Ananth, If your goal is to merge the parquet files, then why not use these files as source vs. going back to Kafka?
Thomas On Fri, Jun 10, 2016 at 4:42 PM, Ananth Gundabattula < [email protected]> wrote: > Thanks for the thoughts Siyuan. > > Yes agree that the problem is inherently a batch oriented problem. We are > hoping to build upon the window concepts to simulate a batch design. ( > Primary reason is that we do not want two different ETL processing pipeline > platforms within our eco system ). > > We are using kafka as the source of data over which multiple data > processing frameworks ( ETL, M/L frameworks etc) run through. Hence Kafka > is being used both for streaming (primarily ETL - Apex system ) and batch > use cases ( primarily M/L ) . > > I shall create a ticket. > > Regards, > Ananth > > > > On Sat, Jun 11, 2016 at 7:15 AM, [email protected] <[email protected]> > wrote: > >> Hi Ananth, >> Unlike files, Kafka is usually for streaming cases. Correct me if I'm >> wrong, your use case seems like a batch processing. We didn't consider end >> offset in our Kafka input operator design. But it could be a useful >> feature. Unfortunately there is no easy way, as of I know, to extend >> existing operator to achieve that. >> >> OffsetManager is not designed for end offset. It's only >> a customizable callback to update the committed offsets. And the start >> offsets it loads are supposed for stateful application restart. >> >> Can you create a ticket and elaborate your use case there? Thanks! >> >> Regards, >> Siyuan >> >> >> >> >> >> On Friday, June 10, 2016, Ananth Gundabattula <[email protected]> >> wrote: >> >>> Hello All, >>> >>> I was wondering what would be the community's thoughts on the following >>> ? >>> >>> We are using kafka 0.9 input operator to read from a few topics. We are >>> using this stream to generate a parquet file. Now this approach is all good >>> for a beginners use case. At a later point in time, we would like to >>> "merge" all of the parquet files previously generated and for this I would >>> like to reprocess data exactly from a particular offset inside each of the >>> partitions. Each of the partitions will have their own starting and ending >>> offsets that I need to process for. >>> >>> I was wondering if there is an easy way to extend the Kafka 0.9 operator >>> ( perhaps along the lines of the offset manager in the 0.8 versions of the >>> kafka operator ) . Thoughts please ? >>> >>> Regards, >>> Ananth >>> >> >
