Ananth,

If your goal is to merge the parquet files, then why not use these files as
source vs. going back to Kafka?

Thomas




On Fri, Jun 10, 2016 at 4:42 PM, Ananth Gundabattula <
[email protected]> wrote:

> Thanks for the thoughts Siyuan.
>
> Yes agree that the problem is inherently a batch oriented problem. We are
> hoping to build upon the window concepts to simulate a batch design. (
> Primary reason is that we do not want two different ETL processing pipeline
> platforms within our eco system ).
>
> We are using kafka as the source of data over which multiple data
> processing frameworks ( ETL, M/L frameworks etc) run through. Hence Kafka
> is being used  both for streaming (primarily ETL - Apex system ) and batch
> use cases ( primarily M/L ) .
>
> I shall create a ticket.
>
> Regards,
> Ananth
>
>
>
> On Sat, Jun 11, 2016 at 7:15 AM, [email protected] <[email protected]>
> wrote:
>
>> Hi Ananth,
>> Unlike files, Kafka is usually for streaming cases. Correct me if I'm
>> wrong, your use case seems like a batch processing. We didn't consider end
>> offset in our Kafka input operator design. But it could be a useful
>> feature. Unfortunately there is no easy way, as of I know, to extend
>> existing operator to achieve that.
>>
>> OffsetManager is not designed for end offset. It's only
>> a  customizable callback to update the committed offsets. And the start
>> offsets it loads are supposed for stateful application restart.
>>
>> Can you create a ticket and elaborate your use case there? Thanks!
>>
>> Regards,
>> Siyuan
>>
>>
>>
>>
>>
>> On Friday, June 10, 2016, Ananth Gundabattula <[email protected]>
>> wrote:
>>
>>> Hello All,
>>>
>>> I was wondering what would be the community's thoughts on the following
>>> ?
>>>
>>> We are using kafka 0.9 input operator to read from a few topics. We are
>>> using this stream to generate a parquet file. Now this approach is all good
>>> for a beginners use case. At a later point in time, we would like to
>>> "merge" all of the parquet files previously generated and for this I would
>>> like to reprocess data exactly from a particular offset inside each of the
>>> partitions. Each of the partitions will have their own starting and ending
>>> offsets that I need to process for.
>>>
>>> I was wondering if there is an easy way to extend the Kafka 0.9 operator
>>> ( perhaps along the lines of the offset manager in the 0.8 versions of the
>>> kafka operator ) . Thoughts please ?
>>>
>>> Regards,
>>> Ananth
>>>
>>
>

Reply via email to