I would like to know if there are any changes that anyone would like to
suggest for the pull request <https://github.com/apache/thrift/pull/2439> that
adds support for partial Thrift deserialization. Is there something I can
do from my side to make progress on this request?

On Sun, Aug 29, 2021 at 6:50 PM Bhalchandra Pandit <[email protected]>
wrote:

> Hi Duru and Yuxuan,
> I was encouraged by the initial interest both of you showed in this
> feature. Would you be able to help Jens review the pull request?
>
> That will help me open source sooner a number of other useful features
> built on top of partial deserialization.
>
> For example, we can efficiently support columnar access through Hive or
> Spark when the underlying file stores data as serialized Thrift objects.
> Such access can be more efficient compared to columnar formats such as
> Parquet.
>
> Thanks for your help.
>
> Kumar
>
>
>
> On Wed, Aug 25, 2021 at 8:43 AM Bhalchandra Pandit <[email protected]>
> wrote:
>
>> Hi Jens,
>> Thanks for reviewing. For the others who may want to review, I have added
>> README.md file that provides technical details on the overall design and
>> includes description of the major implementation components.
>>
>> Kumar
>>
>> On Wed, Aug 25, 2021 at 12:05 AM Jens Geyer <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I added myself already as a reviewer, but it certainly would help if
>>> some
>>> other (especially Java) people could also review.
>>>
>>> Stay tuned, will do my part soon.
>>>
>>> Have fun,
>>> JensG
>>>
>>> -----Ursprüngliche Nachricht-----
>>> From: Bhalchandra Pandit
>>> Sent: Tuesday, August 24, 2021 4:25 AM
>>> To: [email protected] ; Jens Geyer
>>> Subject: Re: partial deserialization of Thrift
>>>
>>> I would like to know if there are any changes that anyone would like to
>>> suggest for the pull request <https://github.com/apache/thrift/pull/2439
>>> >
>>> that adds support for partial Thrift deserialization.
>>>
>>> On Mon, Aug 16, 2021 at 12:37 PM Bhalchandra Pandit <
>>> [email protected]>
>>> wrote:
>>>
>>> > I created a pull request for this change:
>>> > https://github.com/apache/thrift/pull/2439
>>> >
>>> > I have also added a README.md file that provides technical details of
>>> > partial Thrift deserialization so that it is easier to understand the
>>> Java
>>> > implementation. It will be also useful if this support is to be added
>>> for
>>> > other languages.
>>> >
>>> > Please let me know if you have any questions or suggestions for
>>> changes to
>>> > the code.
>>> >
>>> > On Wed, Aug 4, 2021 at 12:10 AM Jens Geyer <[email protected]> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> thanks!
>>> >>
>>> >> What I can say from a more general/formal viewpoint, the package
>>> >> namespaces
>>> >> etc must be adjusted (e.g. package com.pinterest.commons.thrift) and
>>> all
>>> >> files you plan to contribute need to have the appropriate ASF license
>>> >> header.
>>> >>
>>> >> What would be really nice, especially when we look forward to
>>> integrate
>>> >> it
>>> >> into other languages, adding some technical
>>> documentation/specification
>>> >> to
>>> >> make it easier to implement and have some common, documented view on
>>> the
>>> >> feature.
>>> >>
>>> >> Last not least having appropriate test coverage would be great.
>>> >>
>>> >> Anything else beyond that ... whatever makes sense.
>>> >>
>>> >> Have fun,
>>> >> JensG
>>> >>
>>> >>
>>> >> -----Ursprüngliche Nachricht-----
>>> >> From: Bhalchandra Pandit
>>> >> Sent: Wednesday, August 4, 2021 3:33 AM
>>> >> To: [email protected]
>>> >> Subject: Re: partial deserialization of Thrift
>>> >>
>>> >> I have added basic set of files at the location below. Currently they
>>> are
>>> >> unchanged from their Pinterest version. I would like to get some
>>> guidance
>>> >> on the type of changes I should make. All of them have many tests
>>> though
>>> >> they are not included in this initial skeleton for now.
>>> >>
>>> >>
>>> >>
>>> https://github.com/bhalchandrap/thrift/tree/sample/lib/java/src/org/apache/thrift/partial
>>> >>
>>> >> On Wed, Jul 28, 2021 at 4:59 PM Bhalchandra Pandit <
>>> [email protected]
>>> >> >
>>> >> wrote:
>>> >>
>>> >> > I opened the following jira epic for this contribution:
>>> >> > https://issues.apache.org/jira/browse/THRIFT-5443
>>> >> >
>>> >> > On Mon, Jul 12, 2021 at 11:58 PM Duru Can Celasun <
>>> [email protected]>
>>> >> > wrote:
>>> >> >
>>> >> >> On Tue, 13 Jul 2021, at 01:07, Bhalchandra Pandit wrote:
>>> >> >> > Thanks for showing interest and for raising valid questions.
>>> >> >> >
>>> >> >> > I do not currently have a forked branch to show how it is done.
>>> >> >> However, I
>>> >> >> > can certainly make it happen. It will take me a couple of weeks
>>> or
>>> >> >> > so
>>> >> >> to do
>>> >> >> > it. I will need to get myself familiarized with the contribution
>>> >> >> process.
>>> >> >>
>>> >> >> The short version is that you need to open a ticket on Jira [1] and
>>> >> then
>>> >> >> send a PR against the master branch on Github [2] with the title
>>> >> >> "THRIFT-NNNN: Your title". Everything else can be figured out
>>> later.
>>> >> >>
>>> >> >> [1] https://issues.apache.org/jira/projects/THRIFT/issues
>>> >> >>
>>> >> >> [2] https://github.com/apache/thrift/
>>> >> >>
>>> >> >> >
>>> >> >> > To answer other questions:
>>> >> >> >
>>> >> >> >    1. The current implementation is in Java. However, it can
>>> also be
>>> >> >> easily
>>> >> >> >    ported to other languages. Happy to help in that direction as
>>> >> well.
>>> >> >> >    2. The solution does not modify the definition of any
>>> structure
>>> >> (eg,
>>> >> >> >    Payload => SlimPayload). The output instance has the same type
>>> >> >> regardless
>>> >> >> >    of whether we use full deserialization or partial
>>> >> >> > deserialization.
>>> >> >> In that
>>> >> >> >    sense, it is 100% compatible in both directions (partial <=>
>>> >> full).
>>> >> >> The
>>> >> >> >    solution enables selectively deserializing any arbitrary
>>> nested
>>> >> >> subset.
>>> >> >> >
>>> >> >> > Kumar
>>> >> >> >
>>> >> >> > On Mon, Jul 12, 2021 at 4:10 PM Duru Can Celasun <
>>> >> [email protected]>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> > > This is definitely an interesting problem at scale and it'd be
>>> >> great
>>> >> >> to
>>> >> >> > > have a solution upstream.
>>> >> >> > >
>>> >> >> > > I second Yuxuan's questions. From the blog post it seems you
>>> have
>>> >> an
>>> >> >> > > implementation for Java, but it would be great to have at
>>> least
>>> >> >> > > one
>>> >> >> more.
>>> >> >> > >
>>> >> >> > > On Tue, 13 Jul 2021, at 00:00, Yuxuan Wang wrote:
>>> >> >> > > >  Hi Bhalchandra,
>>> >> >> > > >
>>> >> >> > > > Do you have any open source code (e.g. forked thrift) to
>>> show
>>> >> >> > > > how
>>> >> >> you
>>> >> >> > > > implemented it? The blog post stated what's the problem but
>>> >> really
>>> >> >> lack
>>> >> >> > > > information on the solution side:
>>> >> >> > > >
>>> >> >> > > > 1. How did you solve the problem?
>>> >> >> > > > 2. In which language(s) did you implement the solution?
>>> >> >> > > >
>>> >> >> > > > Without that information it's really not much we can do here.
>>> >> >> > > >
>>> >> >> > > > Also just based on the problem you described, a simple
>>> solution
>>> >> >> would be
>>> >> >> > > > just to duplicate the struct definition and remove the
>>> fields
>>> >> >> > > > you
>>> >> >> don't
>>> >> >> > > > need, for example:
>>> >> >> > > >
>>> >> >> > > > // Original struct
>>> >> >> > > > struct Payload {
>>> >> >> > > >   1: optional Type1 field1,
>>> >> >> > > >   2: optional Type2 field2,
>>> >> >> > > >   3: optional Type3 field3,
>>> >> >> > > >   ...
>>> >> >> > > > }
>>> >> >> > > >
>>> >> >> > > > // Slim struct
>>> >> >> > > > struct SlimPayload {
>>> >> >> > > >   1: optional Type1 field1,
>>> >> >> > > >   // field2 removed because we don't care about it in this
>>> use
>>> >> case
>>> >> >> > > >   3: optional SlimType3 field3,
>>> >> >> > > >   ...
>>> >> >> > > > }
>>> >> >> > > >
>>> >> >> > > > But of course it's hard to keep SlimPayload in sync with the
>>> >> >> original
>>> >> >> > > > Payload so I can see there are some values to have some
>>> helpers
>>> >> to
>>> >> >> help
>>> >> >> > > > that, but as long as you don't make breaking changes into
>>> >> >> > > > Payload
>>> >> >> "keep
>>> >> >> > > > them in sync" is a false problem.
>>> >> >> > > >
>>> >> >> > > > On Mon, Jul 12, 2021 at 12:56 PM Bhalchandra Pandit
>>> >> >> > > > <[email protected]> wrote:
>>> >> >> > > >
>>> >> >> > > > > Hi All,
>>> >> >> > > > > I work for Pinterest. I developed a technique for partial
>>> >> >> > > deserialization
>>> >> >> > > > > of Thrift that has been very useful in significantly
>>> improving
>>> >> >> > > efficiency
>>> >> >> > > > > of the data processing at Pinterest. I would like to
>>> >> >> > > > > contribute
>>> >> >> that
>>> >> >> > > > > feature to Apache Thrift. More details on this technique
>>> are
>>> >> >> available
>>> >> >> > > in
>>> >> >> > > > > this blog I recently wrote:
>>> >> >> > > > >
>>> >> >> > > > >
>>> >> >> > >
>>> >> >>
>>> >>
>>> https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
>>> >> >> > > > >
>>> >> >> > > > > I would like to know if any of you are interested in
>>> helping
>>> >> with
>>> >> >> > > > > contributing this work to the main branch.
>>> >> >> > > > >
>>> >> >> > > > > Kumar
>>> >> >> > > > >
>>> >> >> > > >
>>> >> >> > >
>>> >> >> >
>>> >> >>
>>> >> >
>>> >>
>>> >>
>>>
>>>

Reply via email to