How can I write parquet file with min/max statistic? 2018-01-24 10:30 GMT+09:00 Stephen Joung <step...@vcnc.co.kr>:
> Hi, I am trying to use spark sql filter push down. and specially want to > use row group skipping with parquet file. > > And I guessed that I need parquet file with statistics min/max. > > ---- > > On spark master branch - I tried to write single column with "a", "b", "c" > to parquet file f1 > > scala> List("a", "b", "c").toDF("field1").coalesce( > 1).write.parquet("f1") > > But saved file does not have statistics (min, max) > > $ ls f1/*.parquet > f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet > $ parquet-tool meta f1/*.parquet > file: > file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319- > c000.snappy.parquet > creator: parquet-mr version 1.8.2 (build > c6522788629e590a53eb79874b95f6c3ff11f16c) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"field1","type":"string" > ,"nullable":true,"metadata":{}}]} > > file schema: spark_schema > ----------------------------------------------------------- > --------------------- > field1: OPTIONAL BINARY O:UTF8 R:0 D:1 > > row group 1: RC:3 TS:48 OFFSET:4 > ----------------------------------------------------------- > --------------------- > field1: BINARY SNAPPY DO:0 FPO:4 SZ:50/48/0.96 VC:3 > ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column] > > ---- > > Any pointer or comment would be appreciated. > Thank you. > >