Re: Interest in Parquet V3

2024-05-20 Thread Parth Chandra
Hi Parquet team,

 It is very exciting to see this effort. Thanks Micah for starting this.

 For most use case that our team sees the broad areas for improvement
appear to be -
   1) Optimizing for cloud storage (latency is high, seeks are expensive)
   2) Optimized metadata reading - we've seen 30% (sometimes more) of
Spark's scan operator time spent in reading footers.
   3) Anything that improves support for data lakes.

  Also I'll be happy to help wherever I can.

Parth

On Sun, May 19, 2024 at 10:59 AM Xinli shang 
wrote:

> Sorry I am late to the party! It's great to see this discussion! Thank you
> everyone for the many good points and thank you, Micah, for starting the
> discussion and putting it together into a document, which is very helpful!
> I agree with most of the points we discussed above, and we need to improve
> Parquet and sometimes even speed up to catch up with industry changes.
>
> With that said, we need people to work on it, as Julien mentioned. The
> document
> <
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> >
> that Micah created covers pretty much everything we discussed here. I
> encourage all of us to contribute by raising questions, providing
> suggestions, adding missing functionality, etc. Once we reach a consensus
> on each topic, we can create different tracks and working streams to kick
> off the implementations.
>
> I believe continuously improving Parquet would benefit the industry more
> than creating a new format, which could add friction. These improvement
> ideas are exciting opportunities. If you, your team members, or friends
> have time and interest, please encourage them to contribute.
>
> Our Parquet community meeting is next week, on May 28, 2024. We can have
> discussions there if you can join. Currently, it is scheduled for 7:00 am
> PDT, but I can change it according to the majority's availability.
>
> On Fri, May 17, 2024 at 3:58 PM Rok Mihevc  wrote:
>
> > Hi all,
> >
> > I've discussed with my colleagues and we would dedicate two engineers for
> > 4-6 months on tasks related to implementing the format changes. We're
> > already active in design discussions and can help with C++, Rust and C#
> > implementations. I thought it'd be good to state this explicitly FWIW.
> >
> > Our main areas of interest are efficient reads for tables with wide
> schemas
> > and faster random rowgroup access [1].
> >
> > To workaround the wide schemas issue we actually implemented an internal
> > tool [3] for storing index information into a separate file which allows
> > for reading only the necessary subset of metadata. We would offer this
> > approach for consideration as a possible approach to solve the wide
> schema
> > problem.
> >
> > [1] https://github.com/apache/arrow/issues/39676
> > [2] https://github.com/G-Research/PalletJack
> >
> > Rok
> >
> > On Sun, May 12, 2024 at 12:59 AM Micah Kornfield 
> > wrote:
> >
> > > Hi Parquet Dev,
> > > I wanted to start a conversation within the community about working on
> a
> > > new revision of Parquet.  For context there have been a bunch of new
> > > formats [1][2][3] that show there is decent room for improvement across
> > > data encodings and how metadata is organized.
> > >
> > > Specifically, in a new format revision I think we should be thinking
> > about
> > > the following areas for improvements:
> > > 1.  More efficient encodings that allow for data skipping and SIMD
> > > optimizations.
> > > 2.  More efficient metadata handling for deserialization and projection
> > to
> > > address areas when metadata deserialization time is not trivial [4].
> > > 3.  Possibly thinking about different encodings instead of
> > > repetition/definition for repeated and nested field
> > > 4.  Support for optimizing semi-structured data (e.g. JSON or Variant
> > type)
> > > that can shred elements into individual columns (a recent thread in
> > Iceberg
> > > mentions doing this at the metadata level [5])
> > >
> > > I think the goals of V3 would be to provide existing API compatibility
> as
> > > broadly as possible (possibly with some performance loss) and expose
> new
> > > API surface areas where appropriate to make use of new elements.  New
> > > encodings could be backported so they can be made use of without
> metadata
> > > changes.  I think unfortunately that for points 2 and 3 we would want
> to
> > > break file level compatibility.  More thought would be needed to
> consider
> > > whether 4 could be backported effectively.
> > >
> > > This is a non-trivial amount of work to get good coverage across
> > > implementations, so before putting together more formal proposal it
> would
> > > be nice to know if:
> > >
> > > 1.  If there is an appetite in the general community to consider these
> > > changes
> > > 2.  If anybody from the community is interested in collaborating on
> > > proposals/implementation in this area.
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1] 

Next community call

2024-05-20 Thread Felix Cheung
Hi folks, how can I find information about how to join?


Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-20 Thread Julien Le Dem
Thank you Andrew!

On Mon, May 20, 2024 at 7:05 AM Andrew Lamb  wrote:

> Here is the infrastructure ticket with the request to rename the
> repository: https://issues.apache.org/jira/browse/INFRA-25802
>
> On Fri, May 17, 2024 at 1:28 PM Prem Sahoo  wrote:
>
> > +1 as it will be apt name .
> > Sent from my iPhone
> >
> > > On May 17, 2024, at 12:32 PM, Daniel Weeks  wrote:
> > >
> > > +1 agree, much cleaner naming
> > >
> > > -Dan
> > >
> > >> On Fri, May 17, 2024 at 8:46 AM Chao Sun  wrote:
> > >>
> > >> +1 too. The name has been confusing for a very long time.
> > >>
> > >>> On Fri, May 17, 2024 at 8:40 AM Fokko Driesprong 
> > wrote:
> > >>>
> > >>> +1 - I think it is much clearer to anyone.
> > >>>
> > >>> GitHub will handle all the redirects from the old to the new name, so
> > no
> > >>> reason from my end to not rename it :)
> > >>>
> > >>> Cheers, Fokko
> > >>>
> > >>> Op vr 17 mei 2024 om 17:30 schreef Julien Le Dem  >:
> > >>>
> >  +1
> >  I should have named it that to start with.
> > 
> > 
> >  On Fri, May 17, 2024 at 3:27 AM Wang, Yuming
>  > >>>
> >  wrote:
> > 
> > > +10086
> > >
> > > From: Uwe L. Korn 
> > > Date: Thursday, May 16, 2024 at 15:41
> > > To: dev@parquet.apache.org 
> > > Subject: Re: [DISCUSS] rename parquet-mr to parquet-java?
> > > External Email
> > >
> > > very heavy +1
> > >
> > > This would help a lot.
> > >
> > > On Thu, May 16, 2024, at 4:19 AM, Gang Wu wrote:
> > >> +1 on renaming the repo to reduce confusion.
> > >>
> > >> However, the java library still uses the "parquet-mr" prefix to
> > >> write
> >  its
> > >> application version [1] and it is consumed by downstream projects
> > >>> like
> > >> parquet-cpp [2] as well.
> > >>
> > >> [1]
> > >>
> > >
> > 
> > >>>
> > >>
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsearch%3Fq%3Drepo%253Aapache%252Fparquet-mr%2Bparquet-mr%2Blanguage%253AJava%26type%3Dcode%26l%3DJava=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629473555%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=E4AZ8hEbQyCLy3aXAU2umohUlTCksqHVO5Imfc%2BM6p0%3D=0
> > > <
> > >
> > 
> > >>>
> > >>
> >
> https://github.com/search?q=repo%3Aapache%2Fparquet-mr+parquet-mr+language%3AJava=code=Java
> > >>
> > >> [2]
> > >>
> > >
> > 
> > >>>
> > >>
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsearch%3Fq%3Drepo%253Aapache%252Farrow%2Bparquet-mr%2Blanguage%253AC%252B%252B%2B%26type%3Dcode=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629484103%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=cuaTrenLP4WKex0Mbdk7DbbhzcEP45jhqwj5swRZ5Pk%3D=0
> > > <
> > >
> > 
> > >>>
> > >>
> >
> https://github.com/search?q=repo%3Aapache%2Farrow+parquet-mr+language%3AC%2B%2B+=code
> > >>
> > >>
> > >> Best,
> > >> Gang
> > >>
> > >> On Thu, May 16, 2024 at 12:47 AM Vinoo Ganesh <
> > >>> vinoo.gan...@gmail.com>
> > >> wrote:
> > >>
> > >>> +1, I think this will make things a lot clearer! (non-binding)
> > >>>
> > >>> 
> > >>>
> > >>>
> > >>> On Wed, May 15, 2024 at 12:36 PM Jacques Nadeau <
> > >> jacq...@apache.org
> > 
> > >>> wrote:
> > >>>
> >  +1000
> > 
> >  On Wed, May 15, 2024 at 6:30 AM Andrew Lamb <
> > >>> andrewlam...@gmail.com
> > >
> >  wrote:
> > 
> > > Julien had a great suggestion[1] to  rename the parquet-mr
> > > repository
> > >>> to
> > > parquet-java to reduce confusion about its content.
> > >
> > >> This looks great. Thank you for taking the initiative.
> > >> Hadoop
> >  is
> > > not
> > > required indeed. Perhaps at some point we should rename
> > >>> parquet-mr
> > > to
> > > parquet-java?
> > >
> > > Having just renamed
> > >
> > 
> > >>>
> > >>
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-datafusion=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629491325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=ANZoBX%2B7w4Uu%2BCvBfiuUBLRXqAIF5KDSmtm%2BCtQmroc%3D=0
> > >  to
> > >
> > >
> > 
> > >>>
> > >>
> >
> 

Re: [DISCUSS] rename parquet-mr to parquet-java?

2024-05-20 Thread Andrew Lamb
Here is the infrastructure ticket with the request to rename the
repository: https://issues.apache.org/jira/browse/INFRA-25802

On Fri, May 17, 2024 at 1:28 PM Prem Sahoo  wrote:

> +1 as it will be apt name .
> Sent from my iPhone
>
> > On May 17, 2024, at 12:32 PM, Daniel Weeks  wrote:
> >
> > +1 agree, much cleaner naming
> >
> > -Dan
> >
> >> On Fri, May 17, 2024 at 8:46 AM Chao Sun  wrote:
> >>
> >> +1 too. The name has been confusing for a very long time.
> >>
> >>> On Fri, May 17, 2024 at 8:40 AM Fokko Driesprong 
> wrote:
> >>>
> >>> +1 - I think it is much clearer to anyone.
> >>>
> >>> GitHub will handle all the redirects from the old to the new name, so
> no
> >>> reason from my end to not rename it :)
> >>>
> >>> Cheers, Fokko
> >>>
> >>> Op vr 17 mei 2024 om 17:30 schreef Julien Le Dem :
> >>>
>  +1
>  I should have named it that to start with.
> 
> 
>  On Fri, May 17, 2024 at 3:27 AM Wang, Yuming  >>>
>  wrote:
> 
> > +10086
> >
> > From: Uwe L. Korn 
> > Date: Thursday, May 16, 2024 at 15:41
> > To: dev@parquet.apache.org 
> > Subject: Re: [DISCUSS] rename parquet-mr to parquet-java?
> > External Email
> >
> > very heavy +1
> >
> > This would help a lot.
> >
> > On Thu, May 16, 2024, at 4:19 AM, Gang Wu wrote:
> >> +1 on renaming the repo to reduce confusion.
> >>
> >> However, the java library still uses the "parquet-mr" prefix to
> >> write
>  its
> >> application version [1] and it is consumed by downstream projects
> >>> like
> >> parquet-cpp [2] as well.
> >>
> >> [1]
> >>
> >
> 
> >>>
> >>
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsearch%3Fq%3Drepo%253Aapache%252Fparquet-mr%2Bparquet-mr%2Blanguage%253AJava%26type%3Dcode%26l%3DJava=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629473555%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=E4AZ8hEbQyCLy3aXAU2umohUlTCksqHVO5Imfc%2BM6p0%3D=0
> > <
> >
> 
> >>>
> >>
> https://github.com/search?q=repo%3Aapache%2Fparquet-mr+parquet-mr+language%3AJava=code=Java
> >>
> >> [2]
> >>
> >
> 
> >>>
> >>
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsearch%3Fq%3Drepo%253Aapache%252Farrow%2Bparquet-mr%2Blanguage%253AC%252B%252B%2B%26type%3Dcode=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629484103%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=cuaTrenLP4WKex0Mbdk7DbbhzcEP45jhqwj5swRZ5Pk%3D=0
> > <
> >
> 
> >>>
> >>
> https://github.com/search?q=repo%3Aapache%2Farrow+parquet-mr+language%3AC%2B%2B+=code
> >>
> >>
> >> Best,
> >> Gang
> >>
> >> On Thu, May 16, 2024 at 12:47 AM Vinoo Ganesh <
> >>> vinoo.gan...@gmail.com>
> >> wrote:
> >>
> >>> +1, I think this will make things a lot clearer! (non-binding)
> >>>
> >>> 
> >>>
> >>>
> >>> On Wed, May 15, 2024 at 12:36 PM Jacques Nadeau <
> >> jacq...@apache.org
> 
> >>> wrote:
> >>>
>  +1000
> 
>  On Wed, May 15, 2024 at 6:30 AM Andrew Lamb <
> >>> andrewlam...@gmail.com
> >
>  wrote:
> 
> > Julien had a great suggestion[1] to  rename the parquet-mr
> > repository
> >>> to
> > parquet-java to reduce confusion about its content.
> >
> >> This looks great. Thank you for taking the initiative.
> >> Hadoop
>  is
> > not
> > required indeed. Perhaps at some point we should rename
> >>> parquet-mr
> > to
> > parquet-java?
> >
> > Having just renamed
> >
> 
> >>>
> >>
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-datafusion=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629491325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=ANZoBX%2B7w4Uu%2BCvBfiuUBLRXqAIF5KDSmtm%2BCtQmroc%3D=0
> >  to
> >
> >
> 
> >>>
> >>
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fdatafusion=05%7C02%7Cyumwang%40ebay.com%7Cfed824d59ca84cb3004408dc757b882c%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C638514420629496155%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=CLaLQrZNT761zudqaoMkY2EeC%2F5MGdOeY5veMRe5WcI%3D=0
> >  I think this would be a
> >>> relatively
> > painless experience as all existing links still work
> >
> > I filed a ticket here
> 

Re: [DISCUSS] Propose changing the default branch of the parquet-site repo

2024-05-20 Thread Andrew Lamb
I have filed an issue[1] with this request


[1] https://issues.apache.org/jira/browse/INFRA-25801

On Wed, May 15, 2024 at 6:54 PM Julien Le Dem  wrote:

> +1
>
> On Wed, May 15, 2024 at 4:15 AM Andrew Lamb 
> wrote:
>
> > I plan to wait until next week to allow any one else who has an opinion
> to
> > share it here and then assuming no objections will file a ticket with ASF
> > Infra.
> >
> > Andrew
> >
> > On Sun, May 12, 2024 at 3:57 AM Uwe L. Korn  wrote:
> >
> > > +1
> > >
> > > On Sun, May 12, 2024, at 9:31 AM, Gang Wu wrote:
> > > > +1
> > > >
> > > > This makes sense. I was also confused when I had access to
> > > > parquet-site for the first time.
> > > >
> > > > Thanks Andrew!
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > > On Sun, May 12, 2024 at 3:15 AM Vinoo Ganesh  >
> > > wrote:
> > > >
> > > >> +1, this would be great. It's something Xinli and I discussed when
> we
> > > first
> > > >> made the website updates, but it ended up falling off of the list.
> It
> > > would
> > > >> be great to have this updated.
> > > >>
> > > >> 
> > > >>
> > > >>
> > > >> On Sat, May 11, 2024 at 8:52 PM Andrew Lamb  >
> > > >> wrote:
> > > >>
> > > >> > Hello,
> > > >> >
> > > >> > I would like to propose changing the default branch of the
> > > parquet-site
> > > >> > repo from `asf-site` to `production`
> > > >> >
> > > >> > The `asf-site` branch hosts the static files of the site (aka what
> > is
> > > >> built
> > > >> > from the source in the `development` branch). Thus since it is the
> > > >> default
> > > >> > branch that is what appears when people open the parquet-site[1]
> > repo
> > > >> >
> > > >> > I made a PR to update the readme in the asf-site branch[2] but I
> > > think it
> > > >> > would be better if we changed the default branch to production.
> This
> > > >> > requires an INFRA JIRA ticket[2], which I am happy to file, but
> > > wanted to
> > > >> > discuss here first.
> > > >> >
> > > >> > Andrew Lamb
> > > >> > (Apache DataFusion/Arrow PMC, ASF member)
> > > >> >
> > > >> > p.s.  my not-so-secret agenda is to improve the adoption of the
> > > parquet
> > > >> > file format by helping with communication and coordination. The
> > > >> > parquet.apache.org website plays a key role in this, and thus I
> > want
> > > to
> > > >> > help lower the barrier to help maintain (and update) it.
> > > >> >
> > > >> >
> > > >> > [1]: https://github.com/apache/parquet-site
> > > >> > [2]: https://github.com/apache/parquet-site/pull/57
> > > >> > [2]:
> > > >> >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/infrastructure-asfyaml?tab=readme-ov-file#default_branch
> > > >> >
> > > >>
> > >
> >
>


Re: Is Parquet Meant As a Standalone Database or is a Catalog/Metastore Required?

2024-05-20 Thread Uwe L. Korn
Hello all,

I work in environments where both usages exist. The single file approach at 
leat in this setting comes from the fact that a lot of input data for ML 
pipelines has been historically a single CSV fike dump. As also a lot of data 
analysis tools have been single-threaded, people are jusf way tol used to the 
single file approach. A lot of them simply don't know about the existence and 
benefits of table formats on top of Parquet.

In half of all cases, the single file approach actually seems sufficient to me, 
but once querying, multi-threading or larger data sets are involved, a table 
format would be much better. These peope are not so deep in the engineering 
"world" and thus continue to assume that the single file approach scales.

I think, we should advertise the table formats a bit more in our documentation 
if we're anyways already working on it to avoid such questions coming up.

I personally think that there is an upper limit per file depending on the use 
case. Once you go beyond that or have situations like updating your dataset 
while using it at the same time, you should definitely have a table format on 
top.

Best
Uwe

On Sun, May 19, 2024, at 11:56 AM, Andrew Lamb wrote:
>>  and this is a scenario in which the file does really need to be
> self-contained.
>
> What do you mean by "self-contained"?
>
> If the usecase is exchanging data via files, perhaps only the (relatively
> small) metadata about types / how to read the file (rather than potentially
> large min/max statistics) is required?
>
> If the usecase is replacing loading csv files into a database or other
> system (and building indexes, etc during the load) so querying it is
> faster, then the additional metadata seems warranted.
>
> I think the beauty of parquet is that is an efficient data exchange format
> and comes with features that queries reasonably fast without requiring a
> second system (e.g. database) to manage. However, if you want to have even
> faster performance you can build a second system on top of parquet (with
> catalogs / indexes, etc).
>
> BTW with systems like DataFusion, for example, it is relatively
> straightforward to build an index that prunes parquet files based on
> predicates and information stored in parquet metadata without even opening
> the files at query time. See this example [1].
>
> Andrew
>
> [1]" https://github.com/apache/datafusion/pull/10549
>
>
> On Sat, May 18, 2024 at 12:10 PM Curt Hagenlocher 
> wrote:
>
>> While CSV is still the undisputed monarch of exchanging data via files,
>> Parquet is arguably "top 3" -- and this is a scenario in which the file
>> does really need to be self-contained.
>>
>> On Sat, May 18, 2024 at 9:01 AM Raphael Taylor-Davies
>>  wrote:
>>
>> > Hi Fokko,
>> >
>> > I am aware of catalogs such as iceberg, my question was if in the design
>> > of parquet we can assume the existence of such a catalog.
>> >
>> > Kind Regards,
>> >
>> > Raphael
>> >
>> > On 18 May 2024 16:18:22 BST, Fokko Driesprong  wrote:
>> > >Hey Raphael,
>> > >
>> > >Thanks for reaching out here. Have you looked into table formats such as
>> > Apache
>> > >Iceberg ? This seems to fix
>> the
>> > >problem that you're describing
>> > >
>> > >A table format adds an ACID layer to the file format and acts as a fully
>> > >functional database. In the case of Iceberg, a catalog is required for
>> > >atomicity, and alternatives like Delta Lake also seem to trend into that
>> > >direction
>> > ><
>> >
>> https://github.com/orgs/delta-io/projects/10/views/1?pane=issue=57584023
>> > >
>> > >.
>> > >
>> > >I'm conscious that for many users this responsibility is instead
>> delegated
>> > >> to a catalog that maintains its own index structures and statistics,
>> > only relies
>> > >> on the parquet metadata for very late stage pruning, and may therefore
>> > >> see limited benefit from revisiting the parquet metadata structures.
>> > >
>> > >
>> > >This is exactly what Iceberg offers, it provides additional metadata to
>> > >speed up the planning process:
>> > >https://iceberg.apache.org/docs/nightly/performance/
>> > >
>> > >Kind regards,
>> > >Fokko
>> > >
>> > >Op za 18 mei 2024 om 16:40 schreef Raphael Taylor-Davies
>> > >:
>> > >
>> > >> Hi All,
>> > >>
>> > >> The recent discussions about metadata make me wonder where a storage
>> > >> format ends and a database begins, as people seem to have differing
>> > >> expectations of parquet here. In particular, one school of thought
>> > >> posits that parquet should suffice as a standalone technology, where
>> > >> users can write parquet files to a store and efficiently query them
>> > >> directly with no additional technologies. However, others instead view
>> > >> parquet as a storage format for use in conjunction with some sort of
>> > >> catalog / metastore. These two approaches naturally place very
>> different
>> > >> demands on the parquet format. The former case incentivizes
>> constructing
>> >