[ https://issues.apache.org/jira/browse/SPARK-6921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516299#comment-14516299 ]
zhangxiongfei commented on SPARK-6921: -------------------------------------- I think the reason for this issue is below: 1)When the API "sqlContext.parquetFile()" is invoked, a NewHadoopRDD named "baseRDD" is created,meanwhile, the Hadoop Configuration has been broadcasted which means the compute() method of this NewHadoopRDD will use the unchanged Configuration(blocksize 32M) and following changing the Configuation will not take effect on this compute() method which will be executing at executor end. 2)Set the tachyon block size to 256M by "sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)",this just change the local(Driver end) Configuration. 3)Save the DataFrame to Tachyon by "saveAsParquetFile" which will submit a Spark Job.That job will distribute a closure named "writeShard" to Executor.The changed Configuration (blocksize 256M) will also be distributed to executor. 4)When the task is starting, first a Hadoop FileSystem will be instantiated for compute() method of NewHadoopRDD to read the data from Tachyon,this FileSystem instance is created based the unchanged Configuration(blocksize 32M) that is fetched from broadcast server.Next,the closure "writeShard" will try to get a FileSystem instance to write data to Tachyon, this time it will use the same FileSystem instance mentioned above as the FileSystem instance is cached by default,which means "writeShard" will use that unchanged Configuration(blocksize 32M).So the block size of all the files that are wrote at executor end is 32M. However, block size of Meta data files(common_metadata) is 256M as they are wrote at Driver end. In summary, This issue is due to FileSystem cache mechanism. The workaround is to disable the cache mechanism like below: sc.hadoopConfiguration.setBoolean("fs.tachyon.impl.disable.cache",true) OR sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache",true) I verified this workaround on HDFS and Tachyon,but this issue will affect all Hadoop compatible FileSystem. > Spark SQL API "saveAsParquetFile" will output tachyon file with different > block size > ------------------------------------------------------------------------------------ > > Key: SPARK-6921 > URL: https://issues.apache.org/jira/browse/SPARK-6921 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Reporter: zhangxiongfei > Priority: Critical > > I run below code in Spark Shell to access parquet files in Tachyon. > 1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon > val ta3 > =sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m"); > 2.Second, set the "fs.local.block.size" to 256M to make sure that block > size of output files in Tachyon is 256M. > sc.hadoopConfiguration.setLong("fs.local.block.size",268435456) > 3.Third,saved above DataFrame into Parquet files that is stored in Tachyon > > ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test"); > After above code run successfully, the output parquet files were stored in > Tachyon,but these files have different block size,below is the information of > those files in the path > "tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test": > File Name Size Block Size > In-Memory Pin Creation Time > _SUCCESS 0.00 B 256.00 MB 100% > NO 04-13-2015 17:48:23:519 > _common_metadata 1088.00 B 256.00 MB 100% NO > 04-13-2015 17:48:23:741 > _metadata 22.71 KB 256.00 MB 100% NO > 04-13-2015 17:48:23:646 > part-r-00001.parquet 177.19 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:626 > part-r-00002.parquet 177.21 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:636 > part-r-00003.parquet 177.02 MB 32.00 MB 100% NO > 04-13-2015 17:46:45:439 > part-r-00004.parquet 177.21 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:845 > part-r-00005.parquet 177.40 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:638 > part-r-00006.parquet 177.33 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:648 > It seems that the API saveAsParquetFile does not distribute/broadcast the > hadoopconfiguration to executors like the other API such as > saveAsTextFile.The configutation "fs.local.block.size" only take effects on > Driver. > If I set that configuration before loading parquet files,the problem is gone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org