steveloughran commented on PR #3452: URL: https://github.com/apache/parquet-java/pull/3452#issuecomment-4157307880
There's now a new benchmark which writes a file using the same simple schema as I'm doing in iceberg https://github.com/apache/iceberg/pull/15629 , and tries to do a projection on it. ``` SELECT id, category, variant_get('nested.varcategory') FROM table ``` Review by the copilot --- Setup: 1M rows, 4-field nested variant (idstr, varid, varcategory, col4), querying varcategory only. SingleShotTime, 15 iterations, @Fork(0). Raw Results ``` ┌───────────────────────────┬──────────┬───────────────┬─────────┬────────┐ │ Benchmark │ shredded │ Score (ms/op) │ Error │ µs/row │ ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤ │ readAllRecords │ false │ 728.514 │ ±11.253 │ 0.729 │ ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤ │ readProjectedFileSchema │ false │ 760.287 │ ±3.314 │ 0.760 │ ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤ │ readProjectedLeanSchema │ false │ 1405.264 │ ±8.399 │ 1.405 │ ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤ │ readAllRecords │ true │ 1315.615 │ ±14.598 │ 1.316 │ ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤ │ readProjectedFileSchema │ true │ 1297.870 │ ±19.621 │ 1.298 │ ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤ │ readProjectedLeanSchema │ true │ 725.618 │ ±10.574 │ 0.726 │ └───────────────────────────┴──────────┴───────────────┴─────────┴────────┘ ``` Speedup/Penalty vs readAllRecords Baseline ``` ┌───────────────────────────┬──────────────────┬──────────────────┐ │ Benchmark │ shredded=false │ shredded=true │ ├───────────────────────────┼──────────────────┼──────────────────┤ │ readProjectedFileSchema │ −4% (overhead) │ +1% (noise) │ ├───────────────────────────┼──────────────────┼──────────────────┤ │ readProjectedLeanSchema │ −93% penalty │ +45% speedup │ └───────────────────────────┴──────────────────┴──────────────────┘ ``` * Lean schema projection is the only technique that skips columns. Projecting the full file schema (readProjectedFileSchema) gives zero benefit in either case — Parquet still reads all column chunks. * Lean schema + shredded = 45% faster than reading all columns. Skipping idstr, varid, and col4 typed columns saves ~590ms per 1M rows. * Lean schema + unshredded = 93% slower. The lean schema requests typed_value.varcategory which does not exist in the unshredded file. Parquet handles the missing columns at every row, which is more expensive than reading the single binary blob directly. * Schema detection in ReadSupport.init() is essential. Applying containsField("typed_value") to choose between lean and full schema prevents the unshredded penalty while preserving the shredded speedup. Recommendation Always detect file layout in ReadSupport.init() and apply the lean projection only when the file was written with a shredded schema. For unshredded files, use the full file schema or no projection. ---- If you have a query with a pushdown predicate that wants to look inside a variant, creating a MessageType schema referring to the shredded values is counterproductive unless you know that the variant is shedded. That can be determined by looking at the schema and use `.containsField("typed_value") to see if the target variant has any nested values. ```java @Override public ReadContext init(InitContext context) { MessageType fileSchema = context.getFileSchema(); GroupType nested = fileSchema.getType("nested").asGroupType(); if (nested.containsField("typed_value")) { return new ReadContext(VARCATEGORY_PROJECTION); } // Unshredded file: projection designed for typed columns provides no benefit and // causes schema mismatch overhead — fall back to the full file schema. return new ReadContext(fileSchema); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
