Re: Parquet without Hadoop dependencies

Atour Mousavi Gourabi Fri, 09 Jun 2023 13:26:56 -0700

Hi Gang,

I don't think it's feasible to make a new module for it this way as a lot of 
the support for this part of the code (codecs, etc.) resides in parquet-hadoop. 
This means the module would likely require a dependency on parquet-hadoop, 
making it pretty useless. This could be avoided by porting the supporting 
classes over to this new core module, but that could cause similar issues.
As for replacing the Hadoop dependencies by hadoop-client-api and 
hadoop-client-runtime, this could indeed be nice for some use-cases. It could 
avoid a big chunk of the Hadoop related issues, though we still require users 
to package parts of it. There are some convoluted ways this can be achieved 
now, which we could support out of the box, at least for writing to disk. I 
would like to think of this as more of a temporary solution though, as we would 
still be forcing pretty big dependencies on users that oftentimes do not need 
them.
It seems to me that properly decoupling the reader/writer code from this 
dependency will likely require breaking changes in the future as it is 
hardwired in a large part of the logic. Maybe something to consider for the 
next major release?


Best regards,
Atour
________________________________
From: Gang Wu <ust...@gmail.com>
Sent: Friday, June 9, 2023 4:32 PM
To: dev@parquet.apache.org <dev@parquet.apache.org>
Subject: Re: Parquet without Hadoop dependencies

That may break many downstream projects. At least we cannot break
parquet-hadoop (and any existing module). If you can add a new module
like parquet-core and provide limited reader/writer features without hadoop
support, and then make parquet-hadoop depend on parquet-core, that
would be acceptable.

One possible workaround is to replace various Hadoop dependencies
by hadoop-client-api and hadoop-client-runtime in the parquet-mr. This
may be much easier for users to add Hadoop dependency. But they are
only available from Hadoop 3.0.0.

On Fri, Jun 9, 2023 at 3:18 PM Atour Mousavi Gourabi <at...@live.com> wrote:

> Hi Gang,
>
> Backward compatibility does indeed seem challenging here. Especially as
> I'd rather see the writers/readers moved out of parquet-hadoop after
> they've been decoupled. What are your thoughts on this?
>
> Best regards,
> Atour
> ________________________________
> From: Gang Wu <ust...@gmail.com>
> Sent: Friday, June 9, 2023 3:32 AM
> To: dev@parquet.apache.org <dev@parquet.apache.org>
> Subject: Re: Parquet without Hadoop dependencies
>
> Hi Atour,
>
> Thanks for bringing this up!
>
> From what I observed from PARQUET-1822, I think it is a valid use
> case to support parquet reading/writing without hadoop installed.
> The challenge is backward compatibility. It would be great if you can
> work on it.
>
> Best,
> Gang
>
> On Fri, Jun 9, 2023 at 12:24 AM Atour Mousavi Gourabi <at...@live.com>
> wrote:
>
> > Dear all,
> >
> > The Java implementations of the Parquet readers and writers seem pretty
> > tightly coupled to Hadoop (see: PARQUET-1822). For some projects, this
> can
> > cause issues as it's an unnecessary and big dependency when you might
> just
> > need to write to disk. Is there any appetite here for separating the
> Hadoop
> > code and supporting more convenient ways to write to disk out of the
> box? I
> > am willing to work on these changes but would like some pointers on
> whether
> > such patches would be reviewed and accepted as PARQUET-1822 has been open
> > for over three years now.
> >
> > Best regards,
> > Atour Mousavi Gourabi
> >
>

Re: Parquet without Hadoop dependencies

Reply via email to