Spark is not adding any STAT meta in parquet files in Version 1.6.x. Scanning all files for filter.
(1 to 300000).map(i => (i, i.toString)).toDF("a", "b").sort("a").coalesce(1).write.format("parquet").saveAsTable("metrics") ./parquet-meta /user/hive/warehouse/metrics/*.parquet file: file:/user/hive/warehouse/metrics/part-r-00000-6552bc8f-ec05-4ce8-ad8d-dc22bdd2e502.gz.parquet creator: parquet-mr version 1.6.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------------------------------------------------------------------- a: OPTIONAL INT32 R:0 D:1 b: OPTIONAL BINARY O:UTF8 R:0 D:1 row group 1: RC:300000 TS:4089139 OFFSET:4 -------------------------------------------------------------------------------------------------------------------------------------------- a: INT32 GZIP DO:0 FPO:4 SZ:415087/1200095/2.89 VC:300000 ENC:BIT_PACKED,RLE,PLAIN b: BINARY GZIP DO:0 FPO:415091 SZ:667334/2889044/4.33 VC:300000 ENC:BIT_PACKED,RLE,PLAIN On Tue, Jan 17, 2017 at 9:41 PM, Michael Segel <msegel_had...@hotmail.com> wrote: > Hi, > Lexicographically speaking, Min/Max should work because String(s) support > a comparator operator. So anything which supports an equality test (<,>, > <= , >= , == …) can also support min and max functions as well. > > I guess the question is if Spark does support this, and if not, why? > Yes, it makes sense. > > > > > On Jan 17, 2017, at 9:17 AM, Jörn Franke <jornfra...@gmail.com> wrote: > > > > Hallo, > > > > I am not sure what you mean by min/max for strings. I do not know if > this makes sense. What the ORC format has is bloom filters for strings etc. > - are you referring to this? > > > > In order to apply min/max filters Spark needs to read the meta data of > the file. If the filter is applied or not - this you can see from the > number of bytes read. > > > > > > Best regards > > > >> On 17 Jan 2017, at 15:28, djiang <dji...@dataxu.com> wrote: > >> > >> Hi, > >> > >> I have been looking into how Spark stores statistics (min/max) in > Parquet as > >> well as how it uses the info for query optimization. > >> I have got a few questions. > >> First setup: Spark 2.1.0, the following sets up a Dataframe of 1000 > rows, > >> with a long type and a string type column. > >> They are sorted by different columns, though. > >> > >> scala> spark.sql("select id, cast(id as string) text from > >> range(1000)").sort("id").write.parquet("/secret/spark21-sortById") > >> scala> spark.sql("select id, cast(id as string) text from > >> range(1000)").sort("Text").write.parquet("/secret/spark21-sortByText") > >> > >> I added some code to parquet-tools to print out stats and examine the > >> generated parquet files: > >> > >> hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar meta > >> /secret/spark21-sortById/part-00000-39f7ac12-6038-46ee-b5c3- > d7a5a06e4425.snappy.parquet > >> file: > >> file:/secret/spark21-sortById/part-00000-39f7ac12-6038-46ee- > b5c3-d7a5a06e4425.snappy.parquet > >> creator: parquet-mr version 1.8.1 (build > >> 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) > >> extra: org.apache.spark.sql.parquet.row.metadata = > >> {"type":"struct","fields":[{"name":"id","type":"long"," > nullable":false,"metadata":{}},{"name":"text","type":" > string","nullable":false,"metadata":{}}]} > >> > >> file schema: spark_schema > >> ------------------------------------------------------------ > -------------------- > >> id: REQUIRED INT64 R:0 D:0 > >> text: REQUIRED BINARY O:UTF8 R:0 D:0 > >> > >> row group 1: RC:5 TS:133 OFFSET:4 > >> ------------------------------------------------------------ > -------------------- > >> id: INT64 SNAPPY DO:0 FPO:4 SZ:71/81/1.14 VC:5 > >> ENC:PLAIN,BIT_PACKED STA:[min: 0, max: 4, num_nulls: 0] > >> text: BINARY SNAPPY DO:0 FPO:75 SZ:53/52/0.98 VC:5 > >> ENC:PLAIN,BIT_PACKED > >> > >> hadoop jar parquet-tools-1.9.1-SNAPSHOT.jar meta > >> /secret/spark21-sortByText/part-00000-3d7eac74-5ca0-44a0- > b8a6-d67cc38a2bde.snappy.parquet > >> file: > >> file:/secret/spark21-sortByText/part-00000-3d7eac74-5ca0-44a0-b8a6- > d67cc38a2bde.snappy.parquet > >> creator: parquet-mr version 1.8.1 (build > >> 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf) > >> extra: org.apache.spark.sql.parquet.row.metadata = > >> {"type":"struct","fields":[{"name":"id","type":"long"," > nullable":false,"metadata":{}},{"name":"text","type":" > string","nullable":false,"metadata":{}}]} > >> > >> file schema: spark_schema > >> ------------------------------------------------------------ > -------------------- > >> id: REQUIRED INT64 R:0 D:0 > >> text: REQUIRED BINARY O:UTF8 R:0 D:0 > >> > >> row group 1: RC:5 TS:140 OFFSET:4 > >> ------------------------------------------------------------ > -------------------- > >> id: INT64 SNAPPY DO:0 FPO:4 SZ:71/81/1.14 VC:5 > >> ENC:PLAIN,BIT_PACKED STA:[min: 0, max: 101, num_nulls: 0] > >> text: BINARY SNAPPY DO:0 FPO:75 SZ:60/59/0.98 VC:5 > >> ENC:PLAIN,BIT_PACKED > >> > >> So the question is why is Spark, particularly, 2.1.0, only generate > min/max > >> for numeric columns, but not strings(BINARY) fields, even if the string > >> field is included in the sort? Maybe I missed a configuraiton? > >> > >> The second issue, is how can I confirm Spark is utilizing the min/max? > >> scala> sc.setLogLevel("INFO") > >> scala> spark.sql("select * from parquet.`/secret/spark21-sortById` > where > >> id=4").show > >> I got many lines like this: > >> 17/01/17 09:23:35 INFO FilterCompat: Filtering using predicate: > >> and(noteq(id, null), eq(id, 4)) > >> 17/01/17 09:23:35 INFO FileScanRDD: Reading File path: > >> file:///secret/spark21-sortById/part-00000-39f7ac12- > 6038-46ee-b5c3-d7a5a06e4425.snappy.parquet, > >> range: 0-558, partition values: [empty row] > >> ... > >> 17/01/17 09:23:35 INFO FilterCompat: Filtering using predicate: > >> and(noteq(id, null), eq(id, 4)) > >> 17/01/17 09:23:35 INFO FileScanRDD: Reading File path: > >> file:///secret/spark21-sortById/part-00193-39f7ac12- > 6038-46ee-b5c3-d7a5a06e4425.snappy.parquet, > >> range: 0-574, partition values: [empty row] > >> ... > >> > >> The question is it looks like Spark is scanning every file, even if > from the > >> min/max, Spark should be able to determine only part-00000 has the > relevant > >> data. Or maybe I read it wrong, that Spark is skipping the files? Maybe > >> Spark can only use partition value for data skipping? > >> > >> Thanks, > >> > >> Dong > >> > >> > >> > >> > >> -- > >> View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Spark-Parquet-Statistics-question-tp28312.html > >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > >