To clarify what's going on here: dfs.blocksize and dfs.block.size set the
HDFS block size (the spark.hadoop. prefix adds this to the Hadoop
configuration). The Parquet "block size" is more accurately called the "row
group size", but is set using the unfortunately-named property
parquet.block.size.
Hi all,
Found the answer from the following link:
https://forums.databricks.com/questions/918/how-to-set-size-of-parquet-output-files.html
I can successfully setup parquet block size with
spark.hadoop.parquet.block.size.
The following is the sample code:
# init
block_size = 512 * 1024
conf =
Hi all,
Looks like it's parquet-specific issue.
I can successfully write with 512k block-size
if I use df.write.csv() or use df.write.text()
(I can successfully do csv write when I put hadoop-lzo-0.4.15-cdh5.13.0.jar
into the jars dir)
sample code:
block_size = 512 * 1024
conf =
SparkConf().s
hi all,
I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode.
(python version: 3.5.2 from ubuntu 16.04)
I intended to have DataFrame write to hdfs with customized block-size but
failed.
However, the corresponding rdd can successfully write with the
customized block-size.
Could you help me f