[ 
https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485215#comment-17485215
 ] 

Cheng Lian commented on SPARK-37980:
------------------------------------

[~prakharjain09], as you've mentioned, it's not super straightforward to 
customize the Parquet code paths in Spark to achieve the goal. In the 
meanwhile, this functionality is in general quite useful. I can imagine it 
enabling other systems in the Parquet ecosystem to build more sophisticated 
indexing solutions. Instead of doing heavy customizations in Spark, would it be 
better if we can make the changes happen in upstream {{parquet-mr}} so that 
other systems can benefit from it more easily?

> Extend METADATA column to support row indices for file based data sources
> -------------------------------------------------------------------------
>
>                 Key: SPARK-37980
>                 URL: https://issues.apache.org/jira/browse/SPARK-37980
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3
>            Reporter: Prakhar Jain
>            Priority: Major
>
> Spark recently added hidden metadata column support for File based 
> datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th 
> row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple 
> uniquely identifies row in a table. This information can be used to mark rows 
> e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to