alamb opened a new issue, #16374:
URL: https://github.com/apache/datafusion/issues/16374

   ### Is your feature request related to a problem or challenge?
   
   One of the common criticisms of parquet based query systems is that they 
don't have some particular type of index (e.g. HyperLogLog and more specialized 
/ advanced structures)
   
   I have written extensively about why these arguments are not compelling to 
me, for example: Accelerating Query Performance of Apache Parquet using 
Specialized Indexes: https://youtu.be/74YsJT1-Rdk
   
   Here are relevant examples in datafusion of how to use such indexes:
   
   * 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
   * 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
   
   However, both of those examples use "external indexes" -- the index is 
stored separately from the parquet file. 
   
   Manage the index information separately from the parquet file is likely more 
operationally complex (as you have to now keep 2 files in sync, for example) 
and this is sometimes cited (again!) as a reason we need a new file format. For 
example, here is a recent post to this effect from amudai: 
   
https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md#extensible-metadata-and-hierarchical-statistics
   
   > Parquet lacks a standardized and extensible mechanism for augmenting data 
with index artifacts, most notably inverted term indexes for full-text search. 
While workarounds exist, such as maintaining indexes and their metadata outside 
of Parquet, these solutions quickly become complex, fragile, and difficult to 
manage
   
   However there is no reason you can't add such an index *inside* a parquet 
file as well (though other readers will now know how to do it as well)
   
   
   ### Describe the solution you'd like
   
   I would like an example that shows how to write and read a specialized index 
*inside* a parquet file
   
   
   
   ### Describe alternatives you've considered
   
   Ideally I would love to see a full text inverted index stored in the parquet 
file but that might be too much for an example
   
   Something simpler might be a "distinct values" type index. I think a good 
example might be:
   
   1. Read an existing parquet file, and compute distinct values (using a 
Datafusion plan perhaps) for one column
   2. Write a new parquet file that includes the index (write the index bytes 
to the file somewhere and then add [custom key/value 
metadata](https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_key_value_metadata)
 to the parquet footer that references it)
   3. Show how to open the parquet file, read the footer metadata, use the 
custom metadata to find the special index, and decide it. 
   
   Basically something like this:
   
   ```text
       Example creating parquet file that                      
     contains specialized indexes that are                     
            ignored by other readers                           
                                                               
                                                               
                                                               
            ┌──────────────────────┐                           
            │┌───────────────────┐ │                           
            ││     DataPage      │ │      Standard Parquet     
            │└───────────────────┘ │      Data / pages         
            │┌───────────────────┐ │                           
            ││     DataPage      │ │                           
            │└───────────────────┘ │                           
            │        ...           │                           
            │                      │                           
            │┌───────────────────┐ │                           
            ││     DataPage      │ │                           
            │└───────────────────┘ │                           
            │┏━━━━━━━━━━━━━━━━━━━┓ │                           
            │┃                   ┃ │        key/value metadata 
            │┃   Special Index   ┃◀┼ ─ ─    that points at the 
            │┃                   ┃ │     │  special index      
            │┗━━━━━━━━━━━━━━━━━━━┛ │                           
            │╔═══════════════════╗ │     │                     
            │║                   ║ │                           
            │║  Parquet Footer   ║ │     │  Footer includes    
            │║                   ║ ┼ ─ ─ ─  thrift-encoded     
            │║                   ║ │        ParquetMetadata    
            │╚═══════════════════╝ │                           
            └──────────────────────┘                           
                                                               
                  Parquet File                                 
   ```
   
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to