Re: Pyspark with hudi scripts

Vinoth Govindarajan Wed, 08 Apr 2020 22:12:26 -0700

Sorry, I mixed up the names in my last comment and missed to provide the jars 
info.


Hi Yaswanth,
You need to include the following three jar file using the --jars option to 
either spark-submit or pyspark command before using the "org.apach.hudi" format 
in your code to create hudi datasets.

- 
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.11/0.5.2-incubating/hudi-spark-bundle_2.11-0.5.2-incubating.jar
- https://repo1.maven.org/maven2/org/apache/avro/avro/1.8.2/avro-1.8.2.jar
- 
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.11/2.4.5/spark-avro_2.11-2.4.5.jar

Note: Tested with Spark 2.4.5 and Scala 2.11 version, make sure the scala 
version is matching among all the jar files.

Thanks,
Vinoth

On 2020/04/09 01:24:56, Vinoth Govindarajan <vinoth.govindara...@gmail.com> 
wrote: 
> Hi Udit,
> You can use the scripts provided by Yaswanth for reading/writing the hudi 
> dataset using pyspark.
> 
> I need to understand your requirements little bit more to add formal support. 
> 
> Are you looking for a python command-line tool similar to deltastreamer 
> (https://hudi.apache.org/docs/writing_data.html#deltastreamer) for both hudi 
> reader/writer
> or interested in using Data Source APIs like
> 
> hudiOpts = { 
>     "hoodie.datasource.write.recordkey.field": "uuid", 
>     "hoodie.datasource.write.precombine.field": "update_timestamp", 
>     "hoodie.datasource.write.operation": "upsert",
>     "hoodie.table.name": "tmp.stock_ticker"
> } 
> basePath = "/tmp/stock_ticker"
> inputDF.write.format("org.apache.hudi")
>        .options(**hudiOpts)
>        .mode("Append")
>        .save(basePath)
> 
> basePath = "/tmp/stock_ticker/*"
> outputDF = inputDF.read.format("org.apache.hudi").load(basePath)
> 
> Thanks,
> Vinoth
> 
> 
> On 2020/04/09 00:39:49, Vinoth Chandar <vin...@apache.org> wrote: 
> > Thanks Udit!  I also believe there will be a PR soon for pySpark and we
> > should have formal support next release.
> > 
> > 
> > 
> > On Wed, Apr 8, 2020 at 4:49 PM Mehrotra, Udit <udi...@amazon.com.invalid>
> > wrote:
> > 
> > > Hi Yaswanth,
> > >
> > > PFA an example I prepared sometime back which can help you get started.
> > >
> > > Thanks,
> > > Udit
> > >
> > > On 4/8/20, 3:21 PM, "Atluri Yaswanth" <yaswanth.atl...@gmail.com> wrote:
> > >
> > >     CAUTION: This email originated from outside of the organization. Do
> > > not click links or open attachments unless you can confirm the sender and
> > > know the content is safe.
> > >
> > >
> > >
> > >     Hi Team,
> > >
> > >     I would like to know are there any scripts in PySpark to upsert the
> > > data in
> > >     hudi dataset.
> > >
> > >     I am working with Scala now, but i want to use Pyspark as my data is
> > > not in
> > >     good format(i need to use various libraries inside).
> > >
> > >     Thanks in advance
> > >     yaswanth
> > >
> > >
> > >
> > 
>

Re: Pyspark with hudi scripts

Reply via email to