[ 
https://issues.apache.org/jira/browse/SPARK-6921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516299#comment-14516299
 ] 

zhangxiongfei commented on SPARK-6921:
--------------------------------------

I think the reason for this issue is below:
1)When the API "sqlContext.parquetFile()" is invoked, a NewHadoopRDD named 
"baseRDD" is created,meanwhile, the Hadoop Configuration has been broadcasted 
which means the compute() method of this NewHadoopRDD  will use the unchanged 
Configuration(blocksize 32M) and following changing the Configuation will not 
take effect on this compute() method which will be executing at executor end.
2)Set the tachyon block size to 256M by 
"sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)",this just 
change the local(Driver end) Configuration.
3)Save the DataFrame to Tachyon by "saveAsParquetFile" which will submit a 
Spark Job.That job will distribute a closure named "writeShard" to Executor.The 
changed Configuration (blocksize 256M) will also be distributed to executor.
4)When the task is starting, first a Hadoop FileSystem will be instantiated for 
compute() method of NewHadoopRDD  to read the data from Tachyon,this FileSystem 
instance is created based the unchanged Configuration(blocksize 32M) that is 
fetched from broadcast server.Next,the closure "writeShard" will try to get a 
FileSystem instance to write data to Tachyon, this  time it will use the same 
FileSystem instance mentioned above as the FileSystem instance is cached by 
default,which means "writeShard" will use that unchanged 
Configuration(blocksize 32M).So the block size of all the files that are wrote 
at executor end is 32M. However, block size of Meta data files(common_metadata) 
is 256M as they are wrote at Driver end.
In summary, This issue is due to FileSystem cache mechanism. The workaround is 
to disable the cache mechanism like below:
sc.hadoopConfiguration.setBoolean("fs.tachyon.impl.disable.cache",true) OR 
sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache",true) 
I verified this workaround on HDFS and Tachyon,but this issue will affect all 
Hadoop compatible FileSystem.



> Spark SQL API "saveAsParquetFile" will output tachyon file with different 
> block size
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-6921
>                 URL: https://issues.apache.org/jira/browse/SPARK-6921
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: zhangxiongfei
>            Priority: Critical
>
> I run below code  in Spark Shell to access parquet files in Tachyon.
>   1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
>   val ta3 
> =sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m");
>   2.Second, set the "fs.local.block.size" to 256M to make sure that block 
> size of output files in Tachyon is 256M.
>     sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
>  3.Third,saved above DataFrame into Parquet files that is stored in Tachyon
>     
> ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test");
>  After above code run successfully, the output parquet files were stored in 
> Tachyon,but these files have different block size,below is the information of 
> those files in the path 
> "tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test":
>       File Name                     Size              Block Size     
> In-Memory     Pin     Creation Time
>    _SUCCESS                      0.00 B           256.00 MB     100%         
> NO     04-13-2015 17:48:23:519
>  _common_metadata      1088.00 B      256.00 MB     100%         NO     
> 04-13-2015 17:48:23:741
>  _metadata                       22.71 KB       256.00 MB     100%         NO 
>     04-13-2015 17:48:23:646
>  part-r-00001.parquet     177.19 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:626
>  part-r-00002.parquet     177.21 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:636
>  part-r-00003.parquet     177.02 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:45:439
>  part-r-00004.parquet     177.21 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:845
>  part-r-00005.parquet     177.40 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:638
>  part-r-00006.parquet     177.33 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:648
>  It seems that the API saveAsParquetFile does not distribute/broadcast the 
> hadoopconfiguration to executors like the other API such as 
> saveAsTextFile.The configutation "fs.local.block.size" only take effects on 
> Driver.
>  If I set that configuration before loading parquet files,the problem is gone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to