DPark also can work in localhost without Mesos cluster (single thread or multiple process).
I also think that running PySpark without JVM in local mode will help develop, so both pysparkling and DPark are both useful. On Fri, May 29, 2015 at 1:36 PM, Sven Kreiss <s...@svenkreiss.com> wrote: > I have to admit that I never ran DPark. I think the goals are very > different. The purpose of pysparkling is not to reproduce Spark on a > cluster, but to have a lightweight implementation with the same interface to > run locally or on an API server. I still run PySpark on a cluster to > preprocess a large number of documents to train a scikit-learn classifier, > but use pysparkling to preprocess single documents before applying that > classifier in API calls. The only dependencies of pysparkling are "boto" and > "requests" to access files via "s3://" or "http://" whereas DPark needs a > Mesos cluster. > > On Fri, May 29, 2015 at 2:46 PM Davies Liu <dav...@databricks.com> wrote: >> >> There is another implementation of RDD interface in Python, called >> DPark [1], Could you have a few words to compare these two? >> >> [1] https://github.com/douban/dpark/ >> >> On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss <s...@svenkreiss.com> wrote: >> > I wanted to share a Python implementation of RDDs: pysparkling. >> > >> > >> > http://trivial.io/post/120179819751/pysparkling-is-a-native-implementation-of-the >> > >> > The benefit is that you can apply the same code that you use in PySpark >> > on >> > large datasets in pysparkling on small datasets or single documents. >> > When >> > running with pysparkling, there is no dependency on the Java Virtual >> > Machine >> > or Hadoop. >> > >> > Sven --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org