https://issues.apache.org/jira/browse/HUDI-288 tracks this....



On Tue, Oct 1, 2019 at 10:17 AM Vinoth Chandar <[email protected]> wrote:

>
> I think this has come up before.
>
> +1 to the point pratyaksh mentioned. I would like to add a few more
>
> - Schema could be fetched dynamically from a registry based on
> topic/dataset name. Solvable
> - The hudi keys, partition fields and the inputs you need for configuring
> hudi needs to be standardized. Solvable using dataset level overrides.
> - You will get one RDD from Kafka with data for multiple topics. This
> needs to be now forked to multiple datasets. We need to cache the kafka RDD
> in memory, otherwise we will recompute and re-read input everytime from
> Kafka. Expensive. Solvable.
> - Finally, you will be writing different parquet schema from different
> files and if you are running with num_core > 2, also concurrently. At Uber,
> we original did that and it became an operational nightmare to isolate bad
> topics from good ones.. Pretty tricky!
>
> In all, we could support this and call out these caveats well.
>
> In terms of work,
>
> - We can either introduce multi source support to DeltaStreamer natively
> (more involved design work needed to specify how each input stream maps to
> each output stream)
> - (Or) we can write a new tool that wraps the current DeltaStreamer, just
> uses the kafka topic regex to identify all topics that need to be ingested,
> and just creates one delta streamer each topic within a SINGLE spark
> application.
>
>
> Any takers for this?  Should be a pretty cool project, doable in a week or
> two.
>
> /thanks/vinoth
>
> On Tue, Oct 1, 2019 at 12:39 AM Pratyaksh Sharma <[email protected]>
> wrote:
>
>> Hi Gurudatt,
>>
>> With a minimal code change, you can subscribe to multiple Kafka topics
>> using KafkaOffsetGen.java class. I feel the bigger problem in this case is
>> going to be managing multiple target schemas because we register
>> ParquetWriter with a single target schema at a time. I would also like to
>> know if we have a workaround for such a case.
>>
>> On Tue, Oct 1, 2019 at 12:33 PM Gurudatt Kulkarni <[email protected]>
>> wrote:
>>
>> > Hi All,
>> >
>> > I have a use case where I need to pull multiple tables (say close to
>> 100)
>> > into Hadoop. Do we need to schedule 100 Hudi jobs to pull these tables?
>> Can
>> > there be a workaround where there is one Hudi Application pulling from
>> > multiple Kafka topics? This will avoid creating multiple SparkSessions
>> and
>> > avoid the memory overhead that comes with it.
>> >
>> > Regards,
>> > Gurudatt
>> >
>>
>

Reply via email to