[ 
https://issues.apache.org/jira/browse/SPARK-6921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509349#comment-14509349
 ] 

zhangxiongfei commented on SPARK-6921:
--------------------------------------

I think the root cause may be the following:
1)When the "SQLContext.parquetFile()" is invoked,an instance of case class 
"ParquetRelation2"  is created:
     def parquetFile(paths: String*): DataFrame = {
           baseRelationToDataFrame(parquet.ParquetRelation2(paths, 
Map.empty)(this))
     }
  From now on,the field "val sqlContext: SQLContext" of case class  
"ParquetRelation2"  is not the same instance  any more as the original one 
which is instantiated in Spark Shell.
2)Try to set hadoopconfiguration
   sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
   This code does NOT change the field ""val sqlContext: SQLContext"" of 
"ParquetRelation2" instance.It only change the original sqlContext.
3)Save the current DataFrame  as Parquet files.
    "saveAsParquetFile()" will use the cloned "sqlContext" to write the 
DataFrame,so the configuration of "fs.local.block.size"  is still the default 
32M. 


> Spark SQL API "saveAsParquetFile" will output tachyon file with different 
> block size
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-6921
>                 URL: https://issues.apache.org/jira/browse/SPARK-6921
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: zhangxiongfei
>            Priority: Blocker
>
> I run below code  in Spark Shell to access parquet files in Tachyon.
>   1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
>   val ta3 
> =sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m");
>   2.Second, set the "fs.local.block.size" to 256M to make sure that block 
> size of output files in Tachyon is 256M.
>     sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
>  3.Third,saved above DataFrame into Parquet files that is stored in Tachyon
>     
> ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test");
>  After above code run successfully, the output parquet files were stored in 
> Tachyon,but these files have different block size,below is the information of 
> those files in the path 
> "tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test":
>       File Name                     Size              Block Size     
> In-Memory     Pin     Creation Time
>    _SUCCESS                      0.00 B           256.00 MB     100%         
> NO     04-13-2015 17:48:23:519
>  _common_metadata      1088.00 B      256.00 MB     100%         NO     
> 04-13-2015 17:48:23:741
>  _metadata                       22.71 KB       256.00 MB     100%         NO 
>     04-13-2015 17:48:23:646
>  part-r-00001.parquet     177.19 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:626
>  part-r-00002.parquet     177.21 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:636
>  part-r-00003.parquet     177.02 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:45:439
>  part-r-00004.parquet     177.21 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:845
>  part-r-00005.parquet     177.40 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:638
>  part-r-00006.parquet     177.33 MB     32.00 MB      100%         NO     
> 04-13-2015 17:46:44:648
>  It seems that the API saveAsParquetFile does not distribute/broadcast the 
> hadoopconfiguration to executors like the other API such as 
> saveAsTextFile.The configutation "fs.local.block.size" only take effects on 
> Driver.
>  If I set that configuration before loading parquet files,the problem is gone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to