[ https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130943#comment-17130943 ]
Vinoth Govindarajan commented on HUDI-783: ------------------------------------------ Status update: * -Explain how to read/write hudi datasets using pyspark in a blog post/documentation.- Done ** Here is the documentation: [https://hudi.apache.org/docs/quick-start-guide.html#pyspark-example] ** There is a separate ticket for blog post - HUDI-825 * -Add the hudi-pyspark module to the hudi demo docker along with the instructions.- Done ** pyspark is now supported in hudi demo docker, here is the [PR|[https://github.com/apache/hudi/pull/1632]] * -Make the package available as part of the [spark packages index|https://spark-packages.org/] and [python package index|https://pypi.org/]- Done ** Since hudi is already part of apache project, we can directly use it as a package: ** {code:java} export PYSPARK_PYTHON=$(which python3) spark-2.4.4-bin-hadoop2.7/bin/pyspark \ --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' {code} > Add official python support to create hudi datasets using pyspark > ----------------------------------------------------------------- > > Key: HUDI-783 > URL: https://issues.apache.org/jira/browse/HUDI-783 > Project: Apache Hudi > Issue Type: Wish > Components: Utilities > Reporter: Vinoth Govindarajan > Assignee: Vinoth Govindarajan > Priority: Major > Labels: features, pull-request-available > Fix For: 0.6.0 > > > *Goal:* > As a pyspark user, I would like to read/write hudi datasets using pyspark. > There are several components to achieve this goal. > # Create a hudi-pyspark package that users can import and start > reading/writing hudi datasets. > # Explain how to read/write hudi datasets using pyspark in a blog > post/documentation. > # Add the hudi-pyspark module to the hudi demo docker along with the > instructions. > # Make the package available as part of the [spark packages > index|https://spark-packages.org/] and [python package > index|https://pypi.org/] > hudi-pyspark packages should implement HUDI data source API for Apache Spark > using which HUDI files can be read as DataFrame and write to any Hadoop > supported file system. > Usage pattern after we launch this feature should be something like this: > Install the package using: > {code:java} > pip install hudi-pyspark{code} > or > Include hudi-pyspark package in your Spark Applications using: > spark-shell, pyspark, or spark-submit > {code:java} > > $SPARK_HOME/bin/spark-shell --packages > > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)