I am not expert in Parquet.

Looking at PARQUET-166, it seems that parquet.block.size should be lower
than dfs.blocksize

Have you tried Spark 1.5.2 to see if the problem persists ?

Cheers

On Mon, Nov 30, 2015 at 1:55 AM, Jung <jb_j...@naver.com> wrote:

> Hello,
> I use Spark 1.4.1 and Hadoop 2.2.0.
> It may be a stupid question but I cannot understand why "dfs.blocksize" in
> hadoop option doesn't affect the number of blocks sometimes.
> When I run the script below,
>
>   val BLOCK_SIZE = 1024 * 1024 * 512 // set to 512MB, hadoop default is
> 128MB
>   sc.hadoopConfiguration.setInt("parquet.block.size", BLOCK_SIZE)
>   sc.hadoopConfiguration.setInt("dfs.blocksize",BLOCK_SIZE)
>   sc.parallelize(1 to 500000000,
> 24).repartition(3).toDF.saveAsTable("partition_test")
>
> it creates 3 files like this.
>
>   221.1 M  /user/hive/warehouse/partition_test/part-r-00001.gz.parquet
>   221.1 M  /user/hive/warehouse/partition_test/part-r-00002.gz.parquet
>   221.1 M  /user/hive/warehouse/partition_test/part-r-00003.gz.parquet
>
> To check how many blocks in a file, I enter the command "hdfs fsck
> /user/hive/warehouse/partition_test/part-r-00001.gz.parquet -files -blocks".
>
>   Total blocks (validated):      1 (avg. block size 231864402 B)
>
> It is normal case because maximum blocksize change from 128MB to 512MB.
> In the real world, I have a bunch of files.
>
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00001.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00002.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00003.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00004.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00005.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00006.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00007.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00008.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00009.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00010.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00011.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00012.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00013.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00014.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00015.gz.parquet
>   14.4 M  /user/hive/warehouse/data_1g/part-r-00016.gz.parquet
>
> Each file consists of 1block (avg. block size 15141395 B) and I run the
> almost same code as first.
>
>   val BLOCK_SIZE = 1024 * 1024 * 512 // set to 512MB, hadoop default is
> 128MB
>   sc.hadoopConfiguration.setInt("parquet.block.size", BLOCK_SIZE)
>   sc.hadoopConfiguration.setInt("dfs.blocksize",BLOCK_SIZE)
>   sqlContext.table("data_1g").repartition(1).saveAsTable("partition_test2")
>
> It creates one file.
>
>  231.0 M  /user/hive/warehouse/partition_test2/part-r-00001.gz.parquet
>
> But it consists of 2 blocks. It seems dfs.blocksize is not applicable.
>
>   /user/hive/warehouse/partition_test2/part-r-00001.gz.parquet 242202143
> bytes, 2 block(s):  OK
>   0. BP-2098986396-192.168.100.1-1389779750403:blk_1080124727_6385839
> len=134217728 repl=2
>   1. BP-2098986396-192.168.100.1-1389779750403:blk_1080124728_6385840
> len=107984415 repl=2
>
> Because of this, Spark read it as 2partition even though I repartition
> data into 1partition. If the file size after repartitioning is a little
> more 128MB and save it again, it writes 2 files like 128Mb, 1MB.
> It is very important for me because I use repartition method many times.
> Please help me figure out.
>
>   Jung

Reply via email to