Regarding rdd vs dataframe, the historical reason is that RDD provided more 
control with low level API needed for Hudi to managing various aspects of 
writing. 
On a related note, If you look at the current  approach with Flink support, the 
input batch is getting parameterized to support different processing engines.
    On Tuesday, December 1, 2020, 02:08:05 AM PST, songj songj 
<songjun...@gmail.com> wrote:  
 
 thanks for reply!
could you help to explain my 2 questions  above?

Trevor <wowtua...@gmail.com> 于2020年12月1日周二 下午5:17写道:

> Hi,songj ,
>
> DeltaStreamer can be understood as a packaged Spark DataSource. You only
> need to set the required parameters, which makes it more convenient for
> data ingest.
>
> Best,
>
> Trevor
>
>
> wowtua...@gmail.com
>
> From: songj songj
> Date: 2020-12-01 16:48
> To: dev
> Subject: Re: why not use spark datasource in DeltaStreamer
> spark structured streaming consume kafka using kafka data source, and
> foreachbatch to do insert/upsert/... to hudi,
> is it similar with DeltaStreamer?
>
> songj songj <songjun...@gmail.com> 于2020年12月1日周二 下午4:28写道:
>
> > hi, I have some questions:
> >
> > 1. DeltaStreamer  has its own Source<JavaRDD<String>> to consume source
> > data,
> > such as Kafka, why not use spark datasource directly ?
> >
> > 2. Hudi has lots of logical which use RDD, why not use Spark DataFrame?
> >
> > I just want to know the background of the above implementation, thanks!
> >
>  

Reply via email to