Sorry, I mixed up the names in my last comment and missed to provide the jars info.
Hi Yaswanth, You need to include the following three jar file using the --jars option to either spark-submit or pyspark command before using the "org.apach.hudi" format in your code to create hudi datasets. - https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.11/0.5.2-incubating/hudi-spark-bundle_2.11-0.5.2-incubating.jar - https://repo1.maven.org/maven2/org/apache/avro/avro/1.8.2/avro-1.8.2.jar - https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.11/2.4.5/spark-avro_2.11-2.4.5.jar Note: Tested with Spark 2.4.5 and Scala 2.11 version, make sure the scala version is matching among all the jar files. Thanks, Vinoth On 2020/04/09 01:24:56, Vinoth Govindarajan <vinoth.govindara...@gmail.com> wrote: > Hi Udit, > You can use the scripts provided by Yaswanth for reading/writing the hudi > dataset using pyspark. > > I need to understand your requirements little bit more to add formal support. > > Are you looking for a python command-line tool similar to deltastreamer > (https://hudi.apache.org/docs/writing_data.html#deltastreamer) for both hudi > reader/writer > or interested in using Data Source APIs like > > hudiOpts = { > "hoodie.datasource.write.recordkey.field": "uuid", > "hoodie.datasource.write.precombine.field": "update_timestamp", > "hoodie.datasource.write.operation": "upsert", > "hoodie.table.name": "tmp.stock_ticker" > } > basePath = "/tmp/stock_ticker" > inputDF.write.format("org.apache.hudi") > .options(**hudiOpts) > .mode("Append") > .save(basePath) > > basePath = "/tmp/stock_ticker/*" > outputDF = inputDF.read.format("org.apache.hudi").load(basePath) > > Thanks, > Vinoth > > > On 2020/04/09 00:39:49, Vinoth Chandar <vin...@apache.org> wrote: > > Thanks Udit! I also believe there will be a PR soon for pySpark and we > > should have formal support next release. > > > > > > > > On Wed, Apr 8, 2020 at 4:49 PM Mehrotra, Udit <udi...@amazon.com.invalid> > > wrote: > > > > > Hi Yaswanth, > > > > > > PFA an example I prepared sometime back which can help you get started. > > > > > > Thanks, > > > Udit > > > > > > On 4/8/20, 3:21 PM, "Atluri Yaswanth" <yaswanth.atl...@gmail.com> wrote: > > > > > > CAUTION: This email originated from outside of the organization. Do > > > not click links or open attachments unless you can confirm the sender and > > > know the content is safe. > > > > > > > > > > > > Hi Team, > > > > > > I would like to know are there any scripts in PySpark to upsert the > > > data in > > > hudi dataset. > > > > > > I am working with Scala now, but i want to use Pyspark as my data is > > > not in > > > good format(i need to use various libraries inside). > > > > > > Thanks in advance > > > yaswanth > > > > > > > > > > > >