[ https://issues.apache.org/jira/browse/DATAFU-148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849318#comment-16849318 ]
Russell Jurney edited comment on DATAFU-148 at 5/28/19 4:15 AM: ---------------------------------------------------------------- Why not have an `activate()` or `initialize()` method and then add these methods to the `DataFrame` class? [`pymongo_spark`](https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/pymongo_spark.py) (part of [mongo-hadoop](https://github.com/mongodb/mongo-hadoop)) does this to add methods like `pyspark.rdd.RDD.saveToMongoDB` which make the API consistent with PySpark's. See: https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python#usage You use it like this: {code:python} import pymongo_spark pymongo_spark.activate() ... some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')}} {code} And internally it looks like this: {code:python} def activate(): """Activate integration between PyMongo and PySpark. This function only needs to be called once. """ # Patch methods in rather than extending these classes. Many RDD methods # result in the creation of a new RDD, whose exact type is beyond our # control. However, we would still like to be able to call any of our # methods on the resulting RDDs. pyspark.rdd.RDD.saveToMongoDB = saveToMongoDB {code} was (Author: russell.jurney): Why not have an `activate()` or `initialize()` method and then add these methods to the `DataFrame` class? [`pymongo_spark`](https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/pymongo_spark.py) (part of [mongo-hadoop](https://github.com/mongodb/mongo-hadoop)) does this to add methods like `pyspark.sql.DataFrame.saveToMongoDB` which make the API consistent with PySpark's. See: https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python#usage You use it like this: {code:python} import pymongo_spark pymongo_spark.activate() ... some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')}} {code} And internally it looks like this: {code:python} def activate(): """Activate integration between PyMongo and PySpark. This function only needs to be called once. """ # Patch methods in rather than extending these classes. Many RDD methods # result in the creation of a new RDD, whose exact type is beyond our # control. However, we would still like to be able to call any of our # methods on the resulting RDDs. pyspark.rdd.RDD.saveToMongoDB = saveToMongoDB {code} > Setup Spark sub-project > ----------------------- > > Key: DATAFU-148 > URL: https://issues.apache.org/jira/browse/DATAFU-148 > Project: DataFu > Issue Type: New Feature > Reporter: Eyal Allweil > Assignee: Eyal Allweil > Priority: Major > Attachments: patch.diff, patch.diff > > Time Spent: 40m > Remaining Estimate: 0h > > Create a skeleton Spark sub project for Spark code to be contributed to DataFu -- This message was sent by Atlassian JIRA (v7.6.3#76005)