I would like to know if there are any changes that anyone would like to suggest for the pull request <https://github.com/apache/thrift/pull/2439> that adds support for partial Thrift deserialization. Is there something I can do from my side to make progress on this request?
On Sun, Aug 29, 2021 at 6:50 PM Bhalchandra Pandit <[email protected]> wrote: > Hi Duru and Yuxuan, > I was encouraged by the initial interest both of you showed in this > feature. Would you be able to help Jens review the pull request? > > That will help me open source sooner a number of other useful features > built on top of partial deserialization. > > For example, we can efficiently support columnar access through Hive or > Spark when the underlying file stores data as serialized Thrift objects. > Such access can be more efficient compared to columnar formats such as > Parquet. > > Thanks for your help. > > Kumar > > > > On Wed, Aug 25, 2021 at 8:43 AM Bhalchandra Pandit <[email protected]> > wrote: > >> Hi Jens, >> Thanks for reviewing. For the others who may want to review, I have added >> README.md file that provides technical details on the overall design and >> includes description of the major implementation components. >> >> Kumar >> >> On Wed, Aug 25, 2021 at 12:05 AM Jens Geyer <[email protected]> wrote: >> >>> Hi, >>> >>> I added myself already as a reviewer, but it certainly would help if >>> some >>> other (especially Java) people could also review. >>> >>> Stay tuned, will do my part soon. >>> >>> Have fun, >>> JensG >>> >>> -----Ursprüngliche Nachricht----- >>> From: Bhalchandra Pandit >>> Sent: Tuesday, August 24, 2021 4:25 AM >>> To: [email protected] ; Jens Geyer >>> Subject: Re: partial deserialization of Thrift >>> >>> I would like to know if there are any changes that anyone would like to >>> suggest for the pull request <https://github.com/apache/thrift/pull/2439 >>> > >>> that adds support for partial Thrift deserialization. >>> >>> On Mon, Aug 16, 2021 at 12:37 PM Bhalchandra Pandit < >>> [email protected]> >>> wrote: >>> >>> > I created a pull request for this change: >>> > https://github.com/apache/thrift/pull/2439 >>> > >>> > I have also added a README.md file that provides technical details of >>> > partial Thrift deserialization so that it is easier to understand the >>> Java >>> > implementation. It will be also useful if this support is to be added >>> for >>> > other languages. >>> > >>> > Please let me know if you have any questions or suggestions for >>> changes to >>> > the code. >>> > >>> > On Wed, Aug 4, 2021 at 12:10 AM Jens Geyer <[email protected]> wrote: >>> > >>> >> Hi, >>> >> >>> >> thanks! >>> >> >>> >> What I can say from a more general/formal viewpoint, the package >>> >> namespaces >>> >> etc must be adjusted (e.g. package com.pinterest.commons.thrift) and >>> all >>> >> files you plan to contribute need to have the appropriate ASF license >>> >> header. >>> >> >>> >> What would be really nice, especially when we look forward to >>> integrate >>> >> it >>> >> into other languages, adding some technical >>> documentation/specification >>> >> to >>> >> make it easier to implement and have some common, documented view on >>> the >>> >> feature. >>> >> >>> >> Last not least having appropriate test coverage would be great. >>> >> >>> >> Anything else beyond that ... whatever makes sense. >>> >> >>> >> Have fun, >>> >> JensG >>> >> >>> >> >>> >> -----Ursprüngliche Nachricht----- >>> >> From: Bhalchandra Pandit >>> >> Sent: Wednesday, August 4, 2021 3:33 AM >>> >> To: [email protected] >>> >> Subject: Re: partial deserialization of Thrift >>> >> >>> >> I have added basic set of files at the location below. Currently they >>> are >>> >> unchanged from their Pinterest version. I would like to get some >>> guidance >>> >> on the type of changes I should make. All of them have many tests >>> though >>> >> they are not included in this initial skeleton for now. >>> >> >>> >> >>> >> >>> https://github.com/bhalchandrap/thrift/tree/sample/lib/java/src/org/apache/thrift/partial >>> >> >>> >> On Wed, Jul 28, 2021 at 4:59 PM Bhalchandra Pandit < >>> [email protected] >>> >> > >>> >> wrote: >>> >> >>> >> > I opened the following jira epic for this contribution: >>> >> > https://issues.apache.org/jira/browse/THRIFT-5443 >>> >> > >>> >> > On Mon, Jul 12, 2021 at 11:58 PM Duru Can Celasun < >>> [email protected]> >>> >> > wrote: >>> >> > >>> >> >> On Tue, 13 Jul 2021, at 01:07, Bhalchandra Pandit wrote: >>> >> >> > Thanks for showing interest and for raising valid questions. >>> >> >> > >>> >> >> > I do not currently have a forked branch to show how it is done. >>> >> >> However, I >>> >> >> > can certainly make it happen. It will take me a couple of weeks >>> or >>> >> >> > so >>> >> >> to do >>> >> >> > it. I will need to get myself familiarized with the contribution >>> >> >> process. >>> >> >> >>> >> >> The short version is that you need to open a ticket on Jira [1] and >>> >> then >>> >> >> send a PR against the master branch on Github [2] with the title >>> >> >> "THRIFT-NNNN: Your title". Everything else can be figured out >>> later. >>> >> >> >>> >> >> [1] https://issues.apache.org/jira/projects/THRIFT/issues >>> >> >> >>> >> >> [2] https://github.com/apache/thrift/ >>> >> >> >>> >> >> > >>> >> >> > To answer other questions: >>> >> >> > >>> >> >> > 1. The current implementation is in Java. However, it can >>> also be >>> >> >> easily >>> >> >> > ported to other languages. Happy to help in that direction as >>> >> well. >>> >> >> > 2. The solution does not modify the definition of any >>> structure >>> >> (eg, >>> >> >> > Payload => SlimPayload). The output instance has the same type >>> >> >> regardless >>> >> >> > of whether we use full deserialization or partial >>> >> >> > deserialization. >>> >> >> In that >>> >> >> > sense, it is 100% compatible in both directions (partial <=> >>> >> full). >>> >> >> The >>> >> >> > solution enables selectively deserializing any arbitrary >>> nested >>> >> >> subset. >>> >> >> > >>> >> >> > Kumar >>> >> >> > >>> >> >> > On Mon, Jul 12, 2021 at 4:10 PM Duru Can Celasun < >>> >> [email protected]> >>> >> >> > wrote: >>> >> >> > >>> >> >> > > This is definitely an interesting problem at scale and it'd be >>> >> great >>> >> >> to >>> >> >> > > have a solution upstream. >>> >> >> > > >>> >> >> > > I second Yuxuan's questions. From the blog post it seems you >>> have >>> >> an >>> >> >> > > implementation for Java, but it would be great to have at >>> least >>> >> >> > > one >>> >> >> more. >>> >> >> > > >>> >> >> > > On Tue, 13 Jul 2021, at 00:00, Yuxuan Wang wrote: >>> >> >> > > > Hi Bhalchandra, >>> >> >> > > > >>> >> >> > > > Do you have any open source code (e.g. forked thrift) to >>> show >>> >> >> > > > how >>> >> >> you >>> >> >> > > > implemented it? The blog post stated what's the problem but >>> >> really >>> >> >> lack >>> >> >> > > > information on the solution side: >>> >> >> > > > >>> >> >> > > > 1. How did you solve the problem? >>> >> >> > > > 2. In which language(s) did you implement the solution? >>> >> >> > > > >>> >> >> > > > Without that information it's really not much we can do here. >>> >> >> > > > >>> >> >> > > > Also just based on the problem you described, a simple >>> solution >>> >> >> would be >>> >> >> > > > just to duplicate the struct definition and remove the >>> fields >>> >> >> > > > you >>> >> >> don't >>> >> >> > > > need, for example: >>> >> >> > > > >>> >> >> > > > // Original struct >>> >> >> > > > struct Payload { >>> >> >> > > > 1: optional Type1 field1, >>> >> >> > > > 2: optional Type2 field2, >>> >> >> > > > 3: optional Type3 field3, >>> >> >> > > > ... >>> >> >> > > > } >>> >> >> > > > >>> >> >> > > > // Slim struct >>> >> >> > > > struct SlimPayload { >>> >> >> > > > 1: optional Type1 field1, >>> >> >> > > > // field2 removed because we don't care about it in this >>> use >>> >> case >>> >> >> > > > 3: optional SlimType3 field3, >>> >> >> > > > ... >>> >> >> > > > } >>> >> >> > > > >>> >> >> > > > But of course it's hard to keep SlimPayload in sync with the >>> >> >> original >>> >> >> > > > Payload so I can see there are some values to have some >>> helpers >>> >> to >>> >> >> help >>> >> >> > > > that, but as long as you don't make breaking changes into >>> >> >> > > > Payload >>> >> >> "keep >>> >> >> > > > them in sync" is a false problem. >>> >> >> > > > >>> >> >> > > > On Mon, Jul 12, 2021 at 12:56 PM Bhalchandra Pandit >>> >> >> > > > <[email protected]> wrote: >>> >> >> > > > >>> >> >> > > > > Hi All, >>> >> >> > > > > I work for Pinterest. I developed a technique for partial >>> >> >> > > deserialization >>> >> >> > > > > of Thrift that has been very useful in significantly >>> improving >>> >> >> > > efficiency >>> >> >> > > > > of the data processing at Pinterest. I would like to >>> >> >> > > > > contribute >>> >> >> that >>> >> >> > > > > feature to Apache Thrift. More details on this technique >>> are >>> >> >> available >>> >> >> > > in >>> >> >> > > > > this blog I recently wrote: >>> >> >> > > > > >>> >> >> > > > > >>> >> >> > > >>> >> >> >>> >> >>> https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4 >>> >> >> > > > > >>> >> >> > > > > I would like to know if any of you are interested in >>> helping >>> >> with >>> >> >> > > > > contributing this work to the main branch. >>> >> >> > > > > >>> >> >> > > > > Kumar >>> >> >> > > > > >>> >> >> > > > >>> >> >> > > >>> >> >> > >>> >> >> >>> >> > >>> >> >>> >> >>> >>>
