subject:"pyspark DataFrameWriter ignores customized settings\?"

Re: pyspark DataFrameWriter ignores customized settings?

2018-03-20 Thread Ryan Blue

To clarify what's going on here: dfs.blocksize and dfs.block.size set the HDFS block size (the spark.hadoop. prefix adds this to the Hadoop configuration). The Parquet "block size" is more accurately called the "row group size", but is set using the unfortunately-named property parquet.block.size.

Re: pyspark DataFrameWriter ignores customized settings?

2018-03-16 Thread chhsiao1981

Hi all, Found the answer from the following link: https://forums.databricks.com/questions/918/how-to-set-size-of-parquet-output-files.html I can successfully setup parquet block size with spark.hadoop.parquet.block.size. The following is the sample code: # init block_size = 512 * 1024 conf =

Re: pyspark DataFrameWriter ignores customized settings?

2018-03-16 Thread chhsiao1981

Hi all, Looks like it's parquet-specific issue. I can successfully write with 512k block-size if I use df.write.csv() or use df.write.text() (I can successfully do csv write when I put hadoop-lzo-0.4.15-cdh5.13.0.jar into the jars dir) sample code: block_size = 512 * 1024 conf = SparkConf().s

pyspark DataFrameWriter ignores customized settings?

2018-03-10 Thread Chuan-Heng Hsiao

hi all, I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode. (python version: 3.5.2 from ubuntu 16.04) I intended to have DataFrame write to hdfs with customized block-size but failed. However, the corresponding rdd can successfully write with the customized block-size. Could you help me f