[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features
raduteo commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features
raduteo commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-772102343 > +1 to @emkornfield's comment - the intent of this is to establish a clear baseline about what is supported widely in practice - there are a bunch of Parquet features that are in the standard but are hard to use in practice because they don't have read support from other implementatoins. I think it should ultimately make it easier to get adoption on new features cause the status of each feature will be clearer. Thank you @emkornfield and @timarmstrong for the clarifications! Btw, I am 100% in favor of the current initiative and I can relate to the world of pain one has to go through navigating parquet incompatibilities and I can definitely see how this can mitigate those issues while allowing the standard and underlying implementations to evolve. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features
raduteo commented on pull request #164: URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827 @gszadovszky and @emkornfield it's highly coincidental that I was just looking into cleaning up apache/arrow#8130 when I noticed this thread. External column chunks support is one of the key features that attracted me to parquet in the first place and I would like the chance to lobby for keeping it and actually expanding its adoption - I already have the complete PR mentioned above and I can help with supporting it across other implementations. There are a few major domains where I see this as valuable component: 1. Allowing concurrent read to fully flushed row groups while parquet file is still being appended to. A slight variant of this is allowing subsequent row group appends to a parquet file without impacting potential readers. 2. Being able to aggregate multiple data sets in a master parquet file: One scenario if cumulative recordings like stock prices that get collected daily and need to be presented as one unified historical file, another the case of enrichment where we want to add new columns to an existing data set. 3. Allowing for bi-temporal changes to parquet file: External columns chunks allows one to apply small corrections by simply creating delta files and new footers that simply swap out the chunks that require changes and point to the new ones. If the above use cases are addressed by other parquet overlays or they don't line up with the intended usage of parquet I can look elsewhere but it seems like huge opportunity and the development cost for supporting it are quite minor by comparison This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org