[jira] [Comment Edited] (DATAFU-148) Setup Spark sub-project

Russell Jurney (JIRA) Mon, 27 May 2019 21:23:38 -0700


    [ 
https://issues.apache.org/jira/browse/DATAFU-148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849318#comment-16849318
 ]


Russell Jurney edited comment on DATAFU-148 at 5/28/19 4:22 AM:
----------------------------------------------------------------

[~matterhayes]' code review brings up an important point: the API should 
decorate the `pyspark.sql.DataFrame` API.

Why not have an {code}activate(){code} or {code}initialize(){code} method and 
then add these methods to the DataFrame class? 
[pymongo_spark|https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/pymongo_spark.py]
 (part of [mongo-hadoop][https://github.com/mongodb/mongo-hadoop]) does this to 
add methods like {code}pyspark.rdd.RDD.saveToMongoDB{code} which make the API 
consistent with PySpark's.

See: 
https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python#usage

You use it like this:

{code:python}
import pymongo_spark
pymongo_spark.activate()

...

some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')}}
{code}

And internally it looks like this:

{code:python}
def activate():
    """Activate integration between PyMongo and PySpark.
    This function only needs to be called once.
    """
    # Patch methods in rather than extending these classes.  Many RDD methods
    # result in the creation of a new RDD, whose exact type is beyond our
    # control. However, we would still like to be able to call any of our
    # methods on the resulting RDDs.
    pyspark.rdd.RDD.saveToMongoDB = saveToMongoDB
{code}

Changing the way pyspark users Spark's API and requiring spark-datafu users to 
run things the way they are at the moment will probably ensure that the library 
isn't popular.


was (Author: russell.jurney):
[~matterhayes]' code review brings up an important point: the API should 
decorate the `pyspark.sql.DataFrame` API.

Why not have an {code}activate(){code} or {code}initialize(){code} method and 
then add these methods to the DataFrame class? 
[pymongo_spark|https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/pymongo_spark.py]
 (part of [mongo-hadoop][https://github.com/mongodb/mongo-hadoop]) does this to 
add methods like {code}pyspark.rdd.RDD.saveToMongoDB{code} which make the API 
consistent with PySpark's.

See: 
https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python#usage

You use it like this:

{code:python}
import pymongo_spark
pymongo_spark.activate()

...

some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')}}
{code}

And internally it looks like this:

{code:python}
def activate():
    """Activate integration between PyMongo and PySpark.
    This function only needs to be called once.
    """
    # Patch methods in rather than extending these classes.  Many RDD methods
    # result in the creation of a new RDD, whose exact type is beyond our
    # control. However, we would still like to be able to call any of our
    # methods on the resulting RDDs.
    pyspark.rdd.RDD.saveToMongoDB = saveToMongoDB
{code}

Changing the API and requiring users to run things the way they are at the 
moment will probably ensure that the library isn't popular.

> Setup Spark sub-project
> -----------------------
>
>                 Key: DATAFU-148
>                 URL: https://issues.apache.org/jira/browse/DATAFU-148
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>            Priority: Major
>         Attachments: patch.diff, patch.diff
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Create a skeleton Spark sub project for Spark code to be contributed to DataFu



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (DATAFU-148) Setup Spark sub-project

Reply via email to