[ https://issues.apache.org/jira/browse/DATAFU-148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849318#comment-16849318 ]
Russell Jurney edited comment on DATAFU-148 at 5/28/19 4:22 AM: ---------------------------------------------------------------- [~matterhayes]' code review brings up an important point: the API should decorate the `pyspark.sql.DataFrame` API. Why not have an {code}activate(){code} or {code}initialize(){code} method and then add these methods to the DataFrame class? [pymongo_spark|https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/pymongo_spark.py] (part of [mongo-hadoop][https://github.com/mongodb/mongo-hadoop]) does this to add methods like {code}pyspark.rdd.RDD.saveToMongoDB{code} which make the API consistent with PySpark's. See: https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python#usage You use it like this: {code:python} import pymongo_spark pymongo_spark.activate() ... some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')}} {code} And internally it looks like this: {code:python} def activate(): """Activate integration between PyMongo and PySpark. This function only needs to be called once. """ # Patch methods in rather than extending these classes. Many RDD methods # result in the creation of a new RDD, whose exact type is beyond our # control. However, we would still like to be able to call any of our # methods on the resulting RDDs. pyspark.rdd.RDD.saveToMongoDB = saveToMongoDB {code} Changing the way pyspark users Spark's API and requiring spark-datafu users to run things the way they are at the moment will probably ensure that the library isn't popular. was (Author: russell.jurney): [~matterhayes]' code review brings up an important point: the API should decorate the `pyspark.sql.DataFrame` API. Why not have an {code}activate(){code} or {code}initialize(){code} method and then add these methods to the DataFrame class? [pymongo_spark|https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/pymongo_spark.py] (part of [mongo-hadoop][https://github.com/mongodb/mongo-hadoop]) does this to add methods like {code}pyspark.rdd.RDD.saveToMongoDB{code} which make the API consistent with PySpark's. See: https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python#usage You use it like this: {code:python} import pymongo_spark pymongo_spark.activate() ... some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')}} {code} And internally it looks like this: {code:python} def activate(): """Activate integration between PyMongo and PySpark. This function only needs to be called once. """ # Patch methods in rather than extending these classes. Many RDD methods # result in the creation of a new RDD, whose exact type is beyond our # control. However, we would still like to be able to call any of our # methods on the resulting RDDs. pyspark.rdd.RDD.saveToMongoDB = saveToMongoDB {code} Changing the API and requiring users to run things the way they are at the moment will probably ensure that the library isn't popular. > Setup Spark sub-project > ----------------------- > > Key: DATAFU-148 > URL: https://issues.apache.org/jira/browse/DATAFU-148 > Project: DataFu > Issue Type: New Feature > Reporter: Eyal Allweil > Assignee: Eyal Allweil > Priority: Major > Attachments: patch.diff, patch.diff > > Time Spent: 40m > Remaining Estimate: 0h > > Create a skeleton Spark sub project for Spark code to be contributed to DataFu -- This message was sent by Atlassian JIRA (v7.6.3#76005)