Re: pyspark DataFrameWriter ignores customized settings?

2018-03-20 Thread Ryan Blue
To clarify what's going on here: dfs.blocksize and dfs.block.size set the HDFS block size (the spark.hadoop. prefix adds this to the Hadoop configuration). The Parquet "block size" is more accurately called the "row group size", but is set using the unfortunately-named property parquet.block.size.

Re: pyspark DataFrameWriter ignores customized settings?

2018-03-16 Thread chhsiao1981
Hi all, Found the answer from the following link: https://forums.databricks.com/questions/918/how-to-set-size-of-parquet-output-files.html I can successfully setup parquet block size with spark.hadoop.parquet.block.size. The following is the sample code: # init block_size = 512 * 1024 conf =

Re: pyspark DataFrameWriter ignores customized settings?

2018-03-16 Thread chhsiao1981
Hi all, Looks like it's parquet-specific issue. I can successfully write with 512k block-size if I use df.write.csv() or use df.write.text() (I can successfully do csv write when I put hadoop-lzo-0.4.15-cdh5.13.0.jar into the jars dir) sample code: block_size = 512 * 1024 conf = SparkConf().s

pyspark DataFrameWriter ignores customized settings?

2018-03-10 Thread Chuan-Heng Hsiao
hi all, I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode. (python version: 3.5.2 from ubuntu 16.04) I intended to have DataFrame write to hdfs with customized block-size but failed. However, the corresponding rdd can successfully write with the customized block-size. Could you help me f