[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark

Vinoth Govindarajan (Jira) Wed, 10 Jun 2020 10:02:22 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130943#comment-17130943
 ]


Vinoth Govindarajan commented on HUDI-783:
------------------------------------------

Status update:
 * -Explain how to read/write hudi datasets using pyspark in a blog 
post/documentation.- Done
 ** Here is the documentation: 
[https://hudi.apache.org/docs/quick-start-guide.html#pyspark-example]
 ** There is a separate ticket for blog post - HUDI-825
 * -Add the hudi-pyspark module to the hudi demo docker along with the 
instructions.- Done
 ** pyspark is now supported in hudi demo docker, here is the 
[PR|[https://github.com/apache/hudi/pull/1632]]
 * -Make the package available as part of the [spark packages 
index|https://spark-packages.org/] and [python package 
index|https://pypi.org/]- Done
 ** Since hudi is already part of apache project, we can directly use it as a 
package:
 ** 
{code:java}
export PYSPARK_PYTHON=$(which python3)
spark-2.4.4-bin-hadoop2.7/bin/pyspark \
  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
{code}

> Add official python support to create hudi datasets using pyspark
> -----------------------------------------------------------------
>
>                 Key: HUDI-783
>                 URL: https://issues.apache.org/jira/browse/HUDI-783
>             Project: Apache Hudi
>          Issue Type: Wish
>          Components: Utilities
>            Reporter: Vinoth Govindarajan
>            Assignee: Vinoth Govindarajan
>            Priority: Major
>              Labels: features, pull-request-available
>             Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark

Reply via email to