Re: write parquet with statistics min max with binary field

Stephen Joung Tue, 23 Jan 2018 17:32:29 -0800

How can I write parquet file with min/max statistic?

2018-01-24 10:30 GMT+09:00 Stephen Joung <step...@vcnc.co.kr>:


> Hi, I am trying to use spark sql filter push down. and specially want to
> use row group skipping with parquet file.
>
> And I guessed that I need parquet file with statistics min/max.
>
> ----
>
> On spark master branch - I tried to write single column with "a", "b", "c"
> to parquet file f1
>
>    scala> List("a", "b", "c").toDF("field1").coalesce(
> 1).write.parquet("f1")
>
> But saved file does not have statistics (min, max)
>
>    $ ls f1/*.parquet
>    f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet
>    $ parquet-tool meta  f1/*.parquet
>    file:        
> file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-
> c000.snappy.parquet
>    creator:     parquet-mr version 1.8.2 (build
> c6522788629e590a53eb79874b95f6c3ff11f16c)
>    extra:       org.apache.spark.sql.parquet.row.metadata =
> {"type":"struct","fields":[{"name":"field1","type":"string"
> ,"nullable":true,"metadata":{}}]}
>
>    file schema: spark_schema
>    -----------------------------------------------------------
> ---------------------
>    field1:      OPTIONAL BINARY O:UTF8 R:0 D:1
>
>    row group 1: RC:3 TS:48 OFFSET:4
>    -----------------------------------------------------------
> ---------------------
>    field1:       BINARY SNAPPY DO:0 FPO:4 SZ:50/48/0.96 VC:3
> ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column]
>
> ----
>
> Any pointer or comment would be appreciated.
> Thank you.
>
>

Re: write parquet with statistics min max with binary field

Reply via email to