steveloughran commented on PR #3452:
URL: https://github.com/apache/parquet-java/pull/3452#issuecomment-4157307880

   There's now a new benchmark which writes a file using the same simple schema 
as I'm doing in iceberg https://github.com/apache/iceberg/pull/15629 , and 
tries to do a projection on it.
   
   ```
    SELECT id, category, variant_get('nested.varcategory') FROM table
   ```
   
   Review by the copilot
   
   ---
     Setup: 1M rows, 4-field nested variant (idstr, varid, varcategory, col4), 
querying varcategory only. SingleShotTime, 15 iterations, @Fork(0).
   
     Raw Results
   ```
   
     ┌───────────────────────────┬──────────┬───────────────┬─────────┬────────┐
     │ Benchmark                 │ shredded │ Score (ms/op) │ Error   │ µs/row │
     ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
     │ readAllRecords            │ false    │ 728.514       │ ±11.253 │ 0.729  │
     ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
     │ readProjectedFileSchema   │ false    │ 760.287       │ ±3.314  │ 0.760  │
     ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
     │ readProjectedLeanSchema   │ false    │ 1405.264      │ ±8.399  │ 1.405  │
     ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
     │ readAllRecords            │ true     │ 1315.615      │ ±14.598 │ 1.316  │
     ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
     │ readProjectedFileSchema   │ true     │ 1297.870      │ ±19.621 │ 1.298  │
     ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
     │ readProjectedLeanSchema   │ true     │ 725.618       │ ±10.574 │ 0.726  │
     └───────────────────────────┴──────────┴───────────────┴─────────┴────────┘
   ```
   
     Speedup/Penalty vs readAllRecords Baseline
   ```
   
     ┌───────────────────────────┬──────────────────┬──────────────────┐
     │ Benchmark                 │ shredded=false   │ shredded=true    │
     ├───────────────────────────┼──────────────────┼──────────────────┤
     │ readProjectedFileSchema   │ −4% (overhead)   │ +1% (noise)      │
     ├───────────────────────────┼──────────────────┼──────────────────┤
     │ readProjectedLeanSchema   │ −93% penalty     │ +45% speedup     │
     └───────────────────────────┴──────────────────┴──────────────────┘
   ```
   
   
   * Lean schema projection is the only technique that skips columns. 
Projecting the full file schema (readProjectedFileSchema) gives zero benefit in 
either case — Parquet still reads all column chunks.
   * Lean schema + shredded = 45% faster than reading all columns. Skipping 
idstr, varid, and col4 typed columns saves ~590ms per 1M rows.
   * Lean schema + unshredded = 93% slower. The lean schema requests 
typed_value.varcategory which does not exist in the unshredded file. Parquet 
handles the missing columns at every row, which is more expensive than
     reading the single binary blob directly.
   *  Schema detection in ReadSupport.init() is essential. Applying 
containsField("typed_value") to choose between lean and full schema prevents 
the unshredded penalty while preserving the shredded speedup.
   
     Recommendation
   
     Always detect file layout in ReadSupport.init() and apply the lean 
projection only when the file was written with a shredded schema. For 
unshredded files, use the full file schema or no projection.
   ----
   
   If you have a query with a pushdown predicate that wants to look inside a 
variant, creating a MessageType schema referring to the shredded values is 
counterproductive unless you know that the variant is shedded.
   
   That can be determined by looking at the schema and use 
`.containsField("typed_value") to see if the target variant has any nested 
values.
   
   ```java
       @Override
       public ReadContext init(InitContext context) {
         MessageType fileSchema = context.getFileSchema();
         GroupType nested = fileSchema.getType("nested").asGroupType();
         if (nested.containsField("typed_value")) {
           return new ReadContext(VARCATEGORY_PROJECTION);
         }
         // Unshredded file: projection designed for typed columns provides no 
benefit and
         // causes schema mismatch overhead — fall back to the full file schema.
         return new ReadContext(fileSchema);
       }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to