Re: Python implementation of RDD interface

Davies Liu Fri, 29 May 2015 15:32:43 -0700

DPark also can work in localhost without Mesos cluster (single thread
or multiple process).


I also think that running PySpark without JVM in local mode will help
develop, so both pysparkling and DPark are both useful.

On Fri, May 29, 2015 at 1:36 PM, Sven Kreiss <s...@svenkreiss.com> wrote:
> I have to admit that I never ran DPark. I think the goals are very
> different. The purpose of pysparkling is not to reproduce Spark on a
> cluster, but to have a lightweight implementation with the same interface to
> run locally or on an API server. I still run PySpark on a cluster to
> preprocess a large number of documents to train a scikit-learn classifier,
> but use pysparkling to preprocess single documents before applying that
> classifier in API calls. The only dependencies of pysparkling are "boto" and
> "requests" to access files via "s3://" or "http://"; whereas DPark needs a
> Mesos cluster.
>
> On Fri, May 29, 2015 at 2:46 PM Davies Liu <dav...@databricks.com> wrote:
>>
>> There is another implementation of RDD interface in Python, called
>> DPark [1], Could you have a few words to compare these two?
>>
>> [1] https://github.com/douban/dpark/
>>
>> On Fri, May 29, 2015 at 8:29 AM, Sven Kreiss <s...@svenkreiss.com> wrote:
>> > I wanted to share a Python implementation of RDDs: pysparkling.
>> >
>> >
>> > http://trivial.io/post/120179819751/pysparkling-is-a-native-implementation-of-the
>> >
>> > The benefit is that you can apply the same code that you use in PySpark
>> > on
>> > large datasets in pysparkling on small datasets or single documents.
>> > When
>> > running with pysparkling, there is no dependency on the Java Virtual
>> > Machine
>> > or Hadoop.
>> >
>> > Sven

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Python implementation of RDD interface

Reply via email to