I have to admit that I never ran DPark. I think the goals are very
different. The purpose of pysparkling is not to reproduce Spark on a
cluster, but to have a lightweight implementation with the same interface
to run locally or on an API server. I still run PySpark on a cluster to
preprocess a large number of documents to train a scikit-learn classifier,
but use pysparkling to preprocess single documents before applying that
classifier in API calls. The only dependencies of pysparkling are "boto"
and "requests" to access files via "s3://" or "http://"; whereas DPark needs
a Mesos cluster.

On Fri, May 29, 2015 at 2:46 PM Davies Liu <dav...@databricks.com> wrote:

> There is another implementation of RDD interface in Python, called
> DPark [1], Could you have a few words to compare these two?
>
> [1] https://github.com/douban/dpark/
>
> On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss <s...@svenkreiss.com> wrote:
> > I wanted to share a Python implementation of RDDs: pysparkling.
> >
> >
> http://trivial.io/post/120179819751/pysparkling-is-a-native-implementation-of-the
> >
> > The benefit is that you can apply the same code that you use in PySpark
> on
> > large datasets in pysparkling on small datasets or single documents. When
> > running with pysparkling, there is no dependency on the Java Virtual
> Machine
> > or Hadoop.
> >
> > Sven
>

Reply via email to