[ https://issues.apache.org/jira/browse/SPARK-6921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509349#comment-14509349 ]
zhangxiongfei commented on SPARK-6921: -------------------------------------- I think the root cause may be the following: 1)When the "SQLContext.parquetFile()" is invoked,an instance of case class "ParquetRelation2" is created: def parquetFile(paths: String*): DataFrame = { baseRelationToDataFrame(parquet.ParquetRelation2(paths, Map.empty)(this)) } From now on,the field "val sqlContext: SQLContext" of case class "ParquetRelation2" is not the same instance any more as the original one which is instantiated in Spark Shell. 2)Try to set hadoopconfiguration sc.hadoopConfiguration.setLong("fs.local.block.size",268435456) This code does NOT change the field ""val sqlContext: SQLContext"" of "ParquetRelation2" instance.It only change the original sqlContext. 3)Save the current DataFrame as Parquet files. "saveAsParquetFile()" will use the cloned "sqlContext" to write the DataFrame,so the configuration of "fs.local.block.size" is still the default 32M. > Spark SQL API "saveAsParquetFile" will output tachyon file with different > block size > ------------------------------------------------------------------------------------ > > Key: SPARK-6921 > URL: https://issues.apache.org/jira/browse/SPARK-6921 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Reporter: zhangxiongfei > Priority: Blocker > > I run below code in Spark Shell to access parquet files in Tachyon. > 1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon > val ta3 > =sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m"); > 2.Second, set the "fs.local.block.size" to 256M to make sure that block > size of output files in Tachyon is 256M. > sc.hadoopConfiguration.setLong("fs.local.block.size",268435456) > 3.Third,saved above DataFrame into Parquet files that is stored in Tachyon > > ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test"); > After above code run successfully, the output parquet files were stored in > Tachyon,but these files have different block size,below is the information of > those files in the path > "tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test": > File Name Size Block Size > In-Memory Pin Creation Time > _SUCCESS 0.00 B 256.00 MB 100% > NO 04-13-2015 17:48:23:519 > _common_metadata 1088.00 B 256.00 MB 100% NO > 04-13-2015 17:48:23:741 > _metadata 22.71 KB 256.00 MB 100% NO > 04-13-2015 17:48:23:646 > part-r-00001.parquet 177.19 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:626 > part-r-00002.parquet 177.21 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:636 > part-r-00003.parquet 177.02 MB 32.00 MB 100% NO > 04-13-2015 17:46:45:439 > part-r-00004.parquet 177.21 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:845 > part-r-00005.parquet 177.40 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:638 > part-r-00006.parquet 177.33 MB 32.00 MB 100% NO > 04-13-2015 17:46:44:648 > It seems that the API saveAsParquetFile does not distribute/broadcast the > hadoopconfiguration to executors like the other API such as > saveAsTextFile.The configutation "fs.local.block.size" only take effects on > Driver. > If I set that configuration before loading parquet files,the problem is gone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org