Hi Folks, I just stepped up from 1.3.1 to 1.4.0, the most notable difference for me so far is the data frame reader/writer. Previously: val myData = hiveContext.load("s3n://someBucket/somePath/","parquet") Now: val myData = hiveContext.read.parquet("s3n://someBucket/somePath")
Using the original code, it didnt actually fire an action, but using the data frame reader, it does. So if I have say, 1Gb of data in a parquet file partitioned into 1000 chunks, I now have to sit there waiting for it to scan that 1000 chunks. It seems to be doing some sort of validation ? is there any way to switch that off or stop it hitting the source so much ? I should also note the parquet files were written with 1.6RC3, where as I think Spark 1.4.0 is using Parquet 1.7.x -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-read-df-causes-excessive-IO-tp23541.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org