Spark 1.4.0: read.df() causes excessive IO

Exie Mon, 29 Jun 2015 22:16:02 -0700

Hi Folks,

I just stepped up from 1.3.1 to 1.4.0, the most notable difference for me so
far is the data frame reader/writer. Previously:
   val myData = hiveContext.load("s3n://someBucket/somePath/","parquet")
Now:
   val myData = hiveContext.read.parquet("s3n://someBucket/somePath")


Using the original code, it didnt actually fire an action, but using the
data frame reader, it does.

So if I have say, 1Gb of data in a parquet file partitioned into 1000
chunks, I now have to sit there waiting for it to scan that 1000 chunks.

It seems to be doing some sort of validation ? is there any way to switch
that off or stop it hitting the source so much ?

I should also note the parquet files were written with 1.6RC3, where as I
think Spark 1.4.0 is using Parquet 1.7.x



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-read-df-causes-excessive-IO-tp23541.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark 1.4.0: read.df() causes excessive IO

Reply via email to