[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features

2021-02-02 Thread GitBox


raduteo commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features

2021-02-02 Thread GitBox


raduteo commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-772102343


   > +1 to @emkornfield's comment - the intent of this is to establish a clear 
baseline about what is supported widely in practice - there are a bunch of 
Parquet features that are in the standard but are hard to use in practice 
because they don't have read support from other implementatoins. I think it 
should ultimately make it easier to get adoption on new features cause the 
status of each feature will be clearer.
   
   Thank you @emkornfield and @timarmstrong for the clarifications! 
   Btw, I am 100% in favor of the current initiative and I can relate to the 
world of pain one has to go through navigating parquet incompatibilities and I 
can definitely see how this can mitigate those issues while allowing the 
standard and underlying implementations to evolve.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-format] raduteo commented on pull request #164: PARQUET-1950: Define core features

2021-02-01 Thread GitBox


raduteo commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827


   @gszadovszky and @emkornfield it's highly coincidental that I was just 
looking into cleaning up apache/arrow#8130 when I noticed this thread.
   External column chunks support is one of the key features that attracted me 
to parquet in the first place and I would like the chance to lobby for keeping 
it and actually expanding its adoption - I already have the complete PR 
mentioned above and I can help with supporting it across other implementations.
   There are a few major domains where I see this as valuable component:
   1. Allowing concurrent read to fully flushed row groups while parquet file 
is still being appended to. A slight variant of this is allowing subsequent row 
group appends to a parquet file without impacting potential readers.
   2. Being able to aggregate multiple data sets in a master parquet file: One 
scenario if cumulative recordings like stock prices that get collected daily 
and need to be presented as one unified historical file, another the case of 
enrichment where we want to add new columns to an existing data set.
   3. Allowing for bi-temporal changes to parquet file: External columns chunks 
allows one to apply small corrections by simply creating delta files and new 
footers that simply swap out the chunks that require changes and point to the 
new ones.
   
   If the above use cases are addressed by other parquet overlays or they don't 
line up with the intended usage of parquet I can look elsewhere but it seems 
like huge opportunity and the development cost for supporting it are quite 
minor by comparison  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org