from:"Christopher Nguyen"

Re: deep learning with heterogeneous cloud computing using spark

2016-01-30 Thread Christopher Nguyen

Thanks Nick :)


Abid, you may also want to check out
http://conferences.oreilly.com/strata/big-data-conference-ny-2015/public/schedule/detail/43484,
which describes our work on a combination of Spark and Tachyon for Deep
Learning. We found significant gains in using Tachyon (with co-processing)
for the "descent" step while Spark computes the gradients. The video was
recently uploaded here http://bit.ly/1JnvQAO.


Regards,
-- 

*Algorithms of the Mind **http://bit.ly/1ReQvEW <http://bit.ly/1ReQvEW>*

Christopher Nguyen
CEO & Co-Founder
www.Arimo.com (née Adatao)
linkedin.com/in/ctnguyen

Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen

Hi Kui,

DDF (open sourced) also aims to do something similar, adding RDBMS idioms,
and is already implemented on top of Spark.

One philosophy is that the DDF API aggressively hides the notion of
parallel datasets, exposing only (mutable) tables to users, on which they
can apply R and other familiar data mining/machine learning idioms, without
having to know about the distributed representation underneath. Now, you
can get to the underlying RDDs if you want to, simply by asking for it.

This was launched at the July Spark Summit. See
http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
.

Sent while mobile. Please excuse typos etc.
On Sep 4, 2014 1:59 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu
wrote:

Thanks Kui. SparkR is a pretty young project, but there are a bunch of
things we are working on. One of the main features is to expose a data
frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will
be integrating this with Spark's MLLib. At a high-level this will
allow R users to use a familiar API but make use of MLLib's efficient
distributed implementation. This is the same strategy used in Python
as well.

Also we do hope to merge SparkR with mainline Spark -- we have a few
features to complete before that and plan to shoot for integration by
Spark 1.3.

Thanks
Shivaram

On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote:
Thanks, Shivaram.

No specific use case yet. We try to use R in our project as data
scientest
are all knowing R. We had a concern that how R handles the mass data.
Spark
does a better work on big data area, and Spark ML is focusing on
predictive
analysis area. Then we are thinking whether we can merge R and Spark
together. We tried SparkR and it is pretty easy to use. But we didn’t see
any feedback on this package in industry. It will be better if Spark team
has R support just like scala/Java/Python.

Another question is that MLlib will re-implement all famous data mining
algorithms in Spark, then what is the purpose of using R?

There is another technique for us H2O which support R natively. H2O is
more
friendly to data scientist. I saw H2O can also work on Spark (Sparkling
Water). It is better than using SparkR?

Thanks and Regards.

Kui

On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:

Do you have a specific use-case where SparkR doesn't work well ? We'd
love
to hear more about use-cases and features that can be improved with
SparkR.

Thanks
Shivaram

On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote:

Does spark ML team have plan to support R script natively? There is a
SparkR project, but not from spark team. Spark ML used netlib-java to
talk
with native fortran routines or use NumPy, why not try to use R in some
sense.

R had lot of useful packages. If spark ML team can include R support, it
will be a very powerful.

Any comment?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen

Hi Kui, sorry about that. That link you mentioned is probably the one for
the products. We don't have one pointing from adatao.com to ddf.io; maybe
we'll add it.

As for access to the code base itself, I think the team has already created
a GitHub repo for it, and should open it up within a few weeks. There's
some debate about whether to put out the implementation with Shark
dependencies now, or SparkSQL with a bit limited functionality and not as
well tested.

I'll check and ping when this is opened up.

The license is Apache.

Sent while mobile. Please excuse typos etc.
On Sep 6, 2014 1:39 PM, oppokui oppo...@gmail.com wrote:

Thanks, Christopher. I saw it before, it is amazing. Last time I try to
download it from adatao, but no response after filling the table. How can I
download it or its source code? What is the license?

Kui

On Sep 6, 2014, at 8:08 PM, Christopher Nguyen c...@adatao.com wrote:

Hi Kui,

DDF (open sourced) also aims to do something similar, adding RDBMS idioms,
and is already implemented on top of Spark.

This was launched at the July Spark Summit. See
http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
.

Sent while mobile. Please excuse typos etc.
On Sep 4, 2014 1:59 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:

Also we do hope to merge SparkR with mainline Spark -- we have a few
features to complete before that and plan to shoot for integration by
Spark 1.3.

Thanks
Shivaram

On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote:
Thanks, Shivaram.

No specific use case yet. We try to use R in our project as data
scientest
are all knowing R. We had a concern that how R handles the mass data.
Spark
does a better work on big data area, and Spark ML is focusing on
predictive
analysis area. Then we are thinking whether we can merge R and Spark
together. We tried SparkR and it is pretty easy to use. But we didn’t
see
any feedback on this package in industry. It will be better if Spark
team
has R support just like scala/Java/Python.

Another question is that MLlib will re-implement all famous data mining
algorithms in Spark, then what is the purpose of using R?

There is another technique for us H2O which support R natively. H2O is
more
friendly to data scientist. I saw H2O can also work on Spark (Sparkling
Water). It is better than using SparkR?

Thanks and Regards.

Kui

On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:

Do you have a specific use-case where SparkR doesn't work well ? We'd
love
to hear more about use-cases and features that can be improved with
SparkR.

Thanks
Shivaram

On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote:

R had lot of useful packages. If spark ML team can include R support,
it
will be a very powerful.

Any comment?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

2014-08-19 Thread Christopher Nguyen

Fantastic!

Sent while mobile. Pls excuse typos etc.
On Aug 19, 2014 4:09 PM, Haoyuan Li haoyuan...@gmail.com wrote:

 Hi folks,

 We've posted the first Tachyon meetup, which will be on August 25th and is
 hosted by Yahoo! (Limited Space):
 http://www.meetup.com/Tachyon/events/200387252/ . Hope to see you there!

 Best,

 Haoyuan

 --
 Haoyuan Li
 AMPLab, EECS, UC Berkeley
 http://www.cs.berkeley.edu/~haoyuan/

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Christopher Nguyen

Hi Hoai-Thu, the issue of private default constructor is unlikely the cause
here, since Lance was already able to load/deserialize the model object.

And on that side topic, I wish all serdes libraries would just use
constructor.setAccessible(true) by default :-) Most of the time that
privacy is not about serdes reflection restrictions.

Sent while mobile. Pls excuse typos etc.
On Aug 14, 2014 1:58 AM, Hoai-Thu Vuong thuv...@gmail.com wrote:

A man in this community give me a video:
https://www.youtube.com/watch?v=sPhyePwo7FA. I've got a same question in
this community and other guys helped me to solve this problem. I'm trying
to load MatrixFactorizationModel from object file, but compiler said that,
I can not create object because the constructor is private. To solve this,
I put my new object to same package as MatrixFactorizationModel. Luckly it
works.

On Wed, Aug 13, 2014 at 9:20 PM, Christopher Nguyen c...@adatao.com
wrote:

Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to
isolate the cause to serialization of the loaded model. And also try to
serialize the deserialized (loaded) model manually to see if that throws
any visible exceptions.

Sent while mobile. Pls excuse typos etc.
On Aug 13, 2014 7:03 AM, lancezhange lancezha...@gmail.com wrote:

my prediction codes are simple enough as follows:

*val labelsAndPredsOnGoodData = goodDataPoints.map { point =
val prediction = model.predict(point.features)
(point.label, prediction)
}*

when model is the loaded one, above code just can't work. Can you catch
the
error?
Thanks.

PS. i use spark-shell under standalone mode, version 1.0.0

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12035.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

--
Thu.

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Christopher Nguyen

+1 what Sean said. And if there are too many state/argument parameters for
your taste, you can always create a dedicated (serializable) class to
encapsulate them.

Sent while mobile. Pls excuse typos etc.
On Aug 13, 2014 6:58 AM, Sean Owen so...@cloudera.com wrote:

 PS I think that solving not serializable exceptions by adding
 'transient' is usually a mistake. It's a band-aid on a design problem.

 transient causes the default serialization mechanism to not serialize
 the field when the object is serialized. When deserialized, this field
 will be null, which often compromises the class's assumptions about
 state. This keyword is only appropriate when the field can safely be
 recreated at any time -- things like cached values.

 In Java, this commonly comes up when declaring anonymous (therefore
 non-static) inner classes, which have an invisible reference to the
 containing instance, which can easily cause it to serialize the
 enclosing class when it's not necessary at all.

 Inner classes should be static in this case, if possible. Passing
 values as constructor params takes more code but let you tightly
 control what the function references.

 On Wed, Aug 13, 2014 at 2:47 PM, Jaideep Dhok jaideep.d...@inmobi.com
 wrote:
  Hi,
  I have faced a similar issue when trying to run a map function with
 predict.
  In my case I had some non-serializable fields in my calling class. After
  making those fields transient, the error went away.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Christopher Nguyen

Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to
isolate the cause to serialization of the loaded model. And also try to
serialize the deserialized (loaded) model manually to see if that throws
any visible exceptions.

Sent while mobile. Pls excuse typos etc.
On Aug 13, 2014 7:03 AM, lancezhange lancezha...@gmail.com wrote:

 my prediction codes are simple enough as follows:

   *val labelsAndPredsOnGoodData = goodDataPoints.map { point =
   val prediction = model.predict(point.features)
   (point.label, prediction)
   }*

 when model is the loaded one, above code just can't work. Can you catch the
 error?
 Thanks.

 PS. i use spark-shell under standalone mode, version 1.0.0




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12035.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Christopher Nguyen

Hi sparkuser2345,

I'm inferring the problem statement is something like how do I make this
complete faster (given my compute resources)?

Several comments.

First, Spark only allows launching parallel tasks from the driver, not from
workers, which is why you're seeing the exception when you try. Whether the
latter is a sensible/doable idea is another discussion, but I can
appreciate why many people assume this should be possible.

Second, on optimization, you may be able to apply Sean's idea about
(thread) parallelism at the driver, combined with the knowledge that often
these cluster tasks bottleneck while competing for the same resources at
the same time (cpu vs disk vs network, etc.) You may be able to achieve
some performance optimization by randomizing these timings. This is not
unlike GMail randomizing user storage locations around the world for load
balancing. Here, you would partition each of your RDDs into a different
number of partitions, making some tasks larger than others, and thus some
may be in cpu-intensive map while others are shuffling data around the
network. This is rather cluster-specific; I'd be interested in what you
learn from such an exercise.

Third, I find it useful always to consider doing as much as possible in one
pass, subject to memory limits, e.g., mapPartitions() vs map(), thus
minimizing map/shuffle/reduce boundaries with their context switches and
data shuffling. In this case, notice how you're running the
training+prediction k times over mostly the same rows, with map/reduce
boundaries in between. While the training phase is sealed in this context,
you may be able to improve performance by collecting all the k models
together, and do a [m x k] predictions all at once which may end up being
faster.

Finally, as implied from the above, for the very common k-fold
cross-validation pattern, the algorithm itself might be written to be smart
enough to take both train and test data and do the right thing within
itself, thus obviating the need for the user to prepare k data sets and
running over them serially, and likely saving a lot of repeated
computations in the right internal places.

Enjoy,
--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Sat, Jul 5, 2014 at 1:50 AM, Sean Owen so...@cloudera.com wrote:

 If you call .par on data_kfolded it will become a parallel collection in
 Scala and so the maps will happen in parallel .
 On Jul 5, 2014 9:35 AM, sparkuser2345 hm.spark.u...@gmail.com wrote:

 Hi,

 I am trying to fit a logistic regression model with cross validation in
 Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where
 each element is a pair of RDDs containing the training and test data:

 (training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint],
 test_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint])

 scala data_kfolded
 res21:

 Array[(org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],
 org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint])]
 =
 Array((MappedRDD[9] at map at console:24,MappedRDD[7] at map at
 console:23), (MappedRDD[13] at map at console:24,MappedRDD[11] at map
 at
 console:23), (MappedRDD[17] at map at console:24,MappedRDD[15] at map
 at
 console:23))

 Everything works fine when using data_kfolded:

 val validationErrors =
 data_kfolded.map { datafold =
   val svmAlg = new SVMWithSGD()
   val model_reg = svmAlg.run(datafold._1)
   val labelAndPreds = datafold._2.map { point =
 val prediction = model_reg.predict(point.features)
 (point.label, prediction)
   }
   val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble /
 datafold._2.count
   trainErr.toDouble
 }

 scala validationErrors
 res1: Array[Double] = Array(0.8819836785938481, 0.07082521117608837,
 0.29833546734955185)

 However, I have understood that the models are not fitted in parallel as
 data_kfolded is not an RDD (although it's an array of pairs of RDDs). When
 running the same code where data_kfolded has been replaced with
 sc.parallelize(data_kfolded), I get a null pointer exception from the line
 where the run method of the SVMWithSGD object is called with the traning
 data. I guess this is somehow related to the fact that RDDs can't be
 accessed from inside a closure. I fail to understand though why the first
 version works and the second doesn't. Most importantly, is there a way to
 fit the models in parallel? I would really appreciate your help.

 val validationErrors =
 sc.parallelize(data_kfolded).map { datafold =
   val svmAlg = new SVMWithSGD()
   val model_reg = svmAlg.run(datafold._1) // This line gives null pointer
 exception
   val labelAndPreds = datafold._2.map { point =
 val prediction = model_reg.predict(point.features)
 (point.label, prediction)
   }
   val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble /
 datafold._2.count
   trainErr.toDouble
 }
 validationErrors.collect

Re: initial basic question from new user

2014-06-12 Thread Christopher Nguyen

Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want
for your use case. As for Parquet support, that's newly arrived in Spark
1.0.0 together with SparkSQL so continue to watch this space.

Gerard's suggestion to look at JobServer, which you can generalize as
building a long-running application which allows multiple clients to
load/share/persist/save/collaborate-on RDDs satisfies a larger, more
complex use case. That is indeed the job of a higher-level application,
subject to a wide variety of higher-level design choices. A number of us
have successfully built Spark-based apps around that model.
--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Thu, Jun 12, 2014 at 4:35 AM, Toby Douglass t...@avocet.io wrote:

 On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 The goal of rdd.persist is to created a cached rdd that breaks the DAG
 lineage. Therefore, computations *in the same job* that use that RDD can
 re-use that intermediate result, but it's not meant to survive between job
 runs.


 As I understand it, Spark is designed for interactive querying, in the
 sense that the caching of intermediate results eliminates the need to
 recompute those results.

 However, if intermediate results last only for the duration of a job (e.g.
 say a python script), how exactly is interactive querying actually
 performed?   a script is not an interactive medium.  Is the shell the only
 medium for interactive querying?

 Consider a common usage case : a web-site, which offers reporting upon a
 large data set.  Users issue arbitrary queries.  A few queries (just with
 different arguments) dominate the query load, so we thought to create
 intermediate RDDs to service those queries, so only those order of
 magnitude or smaller RDDs would need to be processed.  Where this is not
 possible, we can only use Spark for reporting by issuing each query over
 the whole data set - e.g. Spark is just like Impala is just like Presto is
 just like [nnn].  The enourmous benefit of RDDs - the entire point of Spark
 - so profoundly useful here - is not available.  What a huge and unexpected
 loss!  Spark seemingly renders itself ordinary.  It is for this reason I am
 surprised to find this functionality is not available.


 If you need to ad-hoc persist to files, you can can save RDDs using
 rdd.saveAsObjectFile(...) [1] and load them afterwards using
 sparkContext.objectFile(...)


 I've been using this site for docs;

 http://spark.apache.org

 Here we find through the top-of-the-page menus the link API Docs -
 Python API which brings us to;

 http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html

 Where this page does not show the function saveAsObjectFile().

 I find now from your link here;


 https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD

 What appears to be a second and more complete set of the same
 documentation, using a different web-interface to boot.

 It appears at least that there are two sets of documentation for the same
 APIs, where one set is out of the date and the other not, and the out of
 date set is that which is linked to from the main site?

 Given that our agg sizes will exceed memory, we expect to cache them to
 disk, so save-as-object (assuming there are no out of the ordinary
 performance issues) may solve the problem, but I was hoping to store data
 is a column orientated format.  However I think this in general is not
 possible - Spark can *read* Parquet, but I think it cannot write Parquet as
 a disk-based RDD format.

 If you want to preserve the RDDs in memory between job runs, you should
 look at the Spark-JobServer [3]


 Thankyou.

 I view this with some trepidation.  It took two man-days to get Spark
 running (and I've spent another man day now trying to get a map/reduce to
 run; I'm getting there, but not there yet) - the bring-up/config experience
 for end-users is not tested or accurated documented (although to be clear,
 no better and no worse than is normal for open source; Spark is not
 exceptional).  Having to bring up another open source project is a
 significant barrier to entry; it's always such a headache.

 The save-to-disk function you mentioned earlier will allow intermediate
 RDDs to go to disk, but we do in fact have a use case where in-memory would
 be useful; it might allow us to ditch Cassandra, which would be wonderful,
 since it would reduce the system count by one.

 I have to say, having to install JobServer to achieve this one end seems
 an extraordinarily heavyweight solution - a whole new application, when all
 that is wished for is that Spark persists RDDs across jobs, where so small
 a feature seems to open the door to so much functionality.

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread Christopher Nguyen

Lakshmi, this is orthogonal to your question, but in case it's useful.

It sounds like you're trying to determine the home location of a user, or
something similar.

If that's the problem statement, the data pattern may suggest a far more
computationally efficient approach. For example, first map all (lat,long)
pairs into geocells of a desired resolution (e.g., 10m or 100m), then count
occurrences of geocells instead. There are simple libraries to map any
(lat,long) pairs into a geocell (string) ID very efficiently.

--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen

On Wed, Jun 4, 2014 at 3:49 AM, lmk lakshmi.muralikrish...@gmail.com
wrote:

Hi,
I am a new spark user. Pls let me know how to handle the following
scenario:

I have a data set with the following fields:
1. DeviceId
2. latitude
3. longitude
4. ip address
5. Datetime
6. Mobile application name

With the above data, I would like to perform the following steps:
1. Collect all lat and lon for each ipaddress
(ip1,(lat1,lon1),(lat2,lon2))
(ip2,(lat3,lon3),(lat4,lat5))
2. For each IP,
1.Find the distance between each lat and lon coordinate pair and
all
the other pairs under the same IP
2.Select those coordinates whose distances fall under a specific
threshold (say 100m)
3.Find the coordinate pair with the maximum occurrences

In this case, how can I iterate and compare each coordinate pair with all
the other pairs?
Can this be done in a distributed manner, as this data set is going to have
a few million records?
Can we do this in map/reduce commands?

Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Announcing Spark 1.0.0

2014-05-30 Thread Christopher Nguyen

Awesome work, Pat et al.!

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com wrote:

 I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
 is a milestone release as the first in the 1.0 line of releases,
 providing API stability for Spark's core interfaces.

 Spark 1.0.0 is Spark's largest release ever, with contributions from
 117 developers. I'd like to thank everyone involved in this release -
 it was truly a community effort with fixes, features, and
 optimizations contributed from dozens of organizations.

 This release expands Spark's standard libraries, introducing a new SQL
 package (SparkSQL) which lets users integrate SQL queries into
 existing Spark workflows. MLlib, Spark's machine learning library, is
 expanded with sparse vector support and several new algorithms. The
 GraphX and Streaming libraries also introduce new features and
 optimizations. Spark's core engine adds support for secured YARN
 clusters, a unified tool for submitting Spark applications, and
 several performance and stability improvements. Finally, Spark adds
 support for Java 8 lambda syntax and improves coverage of the Java and
 Python API's.

 Those features only scratch the surface - check out the release notes here:
 http://spark.apache.org/releases/spark-release-1-0-0.html

 Note that since release artifacts were posted recently, certain
 mirrors may not have working downloads for a few hours.

 - Patrick

Re: Spark Memory Bounds

2014-05-27 Thread Christopher Nguyen

Keith, do you mean bound as in (a) strictly control to some quantifiable
limit, or (b) try to minimize the amount used by each task?

If a, then that is outside the scope of Spark's memory management, which
you should think of as an application-level (that is, above JVM) mechanism.
In this scope, Spark voluntarily tracks and limits the amount of memory
it uses for explicitly known data structures, such as RDDs. What Spark
cannot do is, e.g., control or manage the amount of JVM memory that a given
piece of user code might take up. For example, I might write some closure
code that allocates a large array of doubles unbeknownst to Spark.

If b, then your thinking is in the right direction, although quite
imperfect, because of things like the example above. We often experience
OOME if we're not careful with job partitioning. What I think Spark needs
to evolve to is at least to include a mechanism for application-level hints
about task memory requirements. We might work on this and submit a PR for
it.

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Tue, May 27, 2014 at 5:33 PM, Keith Simmons ke...@pulse.io wrote:

 I'm trying to determine how to bound my memory use in a job working with
 more data than can simultaneously fit in RAM.  From reading the tuning
 guide, my impression is that Spark's memory usage is roughly the following:

 (A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory
 used by all currently running tasks

 I can bound A with spark.storage.memoryFraction and I can bound B with 
 spark.shuffle.memoryFraction.
  I'm wondering how to bound C.

 It's been hinted at a few times on this mailing list that you can reduce
 memory use by increasing the number of partitions.  That leads me to
 believe that the amount of transient memory is roughly follows:

 total_data_set_size/number_of_partitions *
 number_of_tasks_simultaneously_running_per_machine

 Does this sound right?  In other words, as I increase the number of
 partitions, the size of each partition will decrease, and since each task
 is processing a single partition and there are a bounded number of tasks in
 flight, my memory use has a rough upper limit.

 Keith

Re: is Mesos falling out of favor?

2014-05-16 Thread Christopher Nguyen

Paco, that's a great video reference, thanks.

To be fair to our friends at Yahoo, who have done a tremendous amount to
help advance the cause of the BDAS stack, it's not FUD coming from them,
certainly not in any organized or intentional manner.

In vacuo we prefer Mesos ourselves, but also can't ignore the fact that in
the larger market, many enterprise technology stack decisions are made
based on their existing vendor support relationships.

And in view of Mesos, super happy to see Mesosphere growing!

Sent while mobile. Pls excuse typos etc.
That's FUD. Tracking the Mesos and Spark use cases, there are very large
production deployments of these together. Some are rather private but
others are being surfaced. IMHO, one of the most amazing case studies is
from Christina Delimitrou http://youtu.be/YpmElyi94AA

For a tutorial, use the following but upgrade it to latest production for
Spark. There was a related O'Reilly webcast and Strata tutorial as well:
http://mesosphere.io/learn/run-spark-on-mesos/

FWIW, I teach Intro to Spark with sections on CM4, YARN, Mesos, etc.
Based on lots of student experiences, Mesos is clearly the shortest path to
deploying a Spark cluster if you want to leverage the robustness,
multi-tenancy for mixed workloads, less ops overhead, etc., that show up
repeatedly in the use case analyses.

My opinion only and not that of any of my clients: Don't believe the FUD
from YHOO unless you really want to be stuck in 2009.

On Wed, May 7, 2014 at 8:30 AM, deric barton.to...@gmail.com wrote:

I'm also using right now SPARK_EXECUTOR_URI, though I would prefer
distributing Spark as a binary package.

For running examples with `./bin/run-example ...` it works fine, however
tasks from spark-shell are getting lost.

Error: Could not find or load main class
org.apache.spark.executor.MesosExecutorBackend

which looks more like problem with sbin/spark-executor and missing paths to
jar. Anyone encountered this error before?

I guess Yahoo invested quite a lot of effort into YARN and Spark
integration
(moreover when Mahout is migrating to Spark there's much more interest in
Hadoop and Spark integration). If there would be some Mesos company
working on Spark - Mesos integration it could be at least on the same
level.

I don't see any other reason why would be YARN better than Mesos,
personally
I like the latter, however I haven't checked YARN for a while, maybe
they've
made a significant progress. I think Mesos is more universal and flexible
than YARN.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5481.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Opinions stratosphere

2014-05-01 Thread Christopher Nguyen

Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted
such a comparative study as a Masters thesis:

http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf

According to this snapshot (c. 2013), Stratosphere is different from Spark
in not having an explicit concept of an in-memory dataset (e.g., RDD).

In principle this could be argued to be an implementation detail; the
operators and execution plan/data flow are of primary concern in the API,
and the data representation/materializations are otherwise unspecified.

But in practice, for long-running interactive applications, I consider RDDs
to be of fundamental, first-class citizen importance, and the key
distinguishing feature of Spark's model vs other in-memory approaches
that treat memory merely as an implicit cache.

--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen

On Tue, Nov 26, 2013 at 1:26 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

I don’t know a lot about it except from the research side, where the team
has done interesting optimization stuff for these types of applications. In
terms of the engine, one thing I’m not sure of is whether Stratosphere
allows explicit caching of datasets (similar to RDD.cache()) and
interactive queries (similar to spark-shell). But it’s definitely an
interesting project to watch.

Matei

On Nov 22, 2013, at 4:17 PM, Ankur Chauhan achau...@brightcove.com
wrote:

Hi,

That's what I thought but as per the slides on
http://www.stratosphere.eu they seem to know about spark and the scala
api does look similar.
I found the PACT model interesting. Would like to know if matei or other
core comitters have something to weight in on.

-- Ankur
On 22 Nov 2013, at 16:05, Patrick Wendell pwend...@gmail.com wrote:

I've never seen that project before, would be interesting to get a
comparison. Seems to offer a much lower level API. For instance this
is a wordcount program:

https://github.com/stratosphere/stratosphere/blob/master/pact/pact-examples/src/main/java/eu/stratosphere/pact/example/wordcount/WordCount.java

On Thu, Nov 21, 2013 at 3:15 PM, Ankur Chauhan achau...@brightcove.com
wrote:
Hi,

I was just curious about https://github.com/stratosphere/stratosphere
and how does spark compare to it. Anyone has any experience with it to
make
any comments?

-- Ankur

Re: Spark and HBase

2014-04-08 Thread Christopher Nguyen

Flavio, the two are best at two orthogonal use cases, HBase on the
transactional side, and Spark on the analytic side. Spark is not intended
for row-based random-access updates, while far more flexible and efficient
in dataset-scale aggregations and general computations.

So yes, you can easily see them deployed side-by-side in a given enterprise.

Sent while mobile. Pls excuse typos etc.
On Apr 8, 2014 5:58 AM, Flavio Pompermaier pomperma...@okkam.it wrote:

 Hi to everybody,

 in these days I looked a bit at the recent evolution of the big data
 stacks and it seems that HBase is somehow fading away in favour of
 Spark+HDFS. Am I correct?
 Do you think that Spark and HBase should work together or not?

 Best regards,
 Flavio

Re: Spark on other parallel filesystems

2014-04-05 Thread Christopher Nguyen

Avati, depending on your specific deployment config, there can be up to a
10X difference in data loading time. For example, we routinely parallel
load 10+GB data files across small 8-node clusters in 10-20 seconds, which
would take about 100s if bottlenecked over a 1GigE network. That's about
the max difference for that config. If you use multiple local SSDs the
difference can be correspondingly greater, and likewise 10x smaller for
10GigE networks.

Lastly, an interesting dimension to consider is that the difference
diminishes as your data size gets much larger relative to your cluster
size, since the load ops have to be serialized in time anyway.

There is no difference after loading.

Sent while mobile. Pls excuse typos etc.
On Apr 4, 2014 10:45 PM, Anand Avati av...@gluster.org wrote:




 On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 As long as the filesystem is mounted at the same path on every node, you
 should be able to just run Spark and use a file:// URL for your files.

 The only downside with running it this way is that Lustre won't expose
 data locality info to Spark, the way HDFS does. That may not matter if it's
 a network-mounted file system though.


 Is the locality querying mechanism specific to HDFS mode, or is it
 possible to implement plugins in Spark to query location in other ways on
 other filesystems? I ask because, glusterfs can expose data location of a
 file through virtual extended attributes and I would be interested in
 making Spark exploit that locality when the file location is specified as
 glusterfs:// (or querying the xattr blindly for file://). How much of a
 difference does data locality make for Spark use cases anyways (since most
 of the computation happens in memory)? Any sort of numbers?

 Thanks!
 Avati




 Matei

 On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy ven...@yarcdata.com
 wrote:

  All

  Are there any drawbacks or technical challenges (or any information,
 really) related to using Spark directly on a global parallel filesystem
  like Lustre/GPFS?

  Any idea of what would be involved in doing a minimal proof of concept?
 Is it just possible to run Spark unmodified (without the HDFS substrate)
 for a start, or will that not work at all? I do know that it's possible to
 implement Tachyon on Lustre and get the HDFS interface - just looking at
 other options.

  Venkat

Re: Cross validation is missing in machine learning examples

2014-03-30 Thread Christopher Nguyen

Aureliano, you're correct that this is not validation error, which is
computed as the residuals on out-of-training-sample data, and helps
minimize overfit variance.

However, in this example, the errors are correctly referred to as training
error, which is what you might compute on a per-iteration basis in a
gradient-descent optimizer, in order to see how you're doing with respect
to minimizing the in-sample residuals.

There's nothing special about Spark ML algorithms that claims to escape
these bias-variance considerations.

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Sat, Mar 29, 2014 at 10:25 PM, Aureliano Buendia buendia...@gmail.comwrote:

 Hi,

 I notices spark machine learning examples use training data to validate
 regression models, For instance, in linear 
 regressionhttp://spark.apache.org/docs/0.9.0/mllib-guide.htmlexample:

 // Evaluate model on training examples and compute training errorval 
 valuesAndPreds = parsedData.map { point =
   val prediction = model.predict(point.features)
   (point.label, prediction)}
 ...


  Here training data was used to validated a model which was created from
 the very same training data. This is just a bias estimation, and cross
 validationhttp://en.wikipedia.org/wiki/Cross-validation_%28statistics%29is 
 missing here. In order to cross validate, we need to partition the data
 into in-sample for training, and out-of-sample for validation.

 Please correct me if this does not apply to ML algorithms implemented in
 spark.

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen

Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to
get what you want is to transform to another RDD. But you might look at
MutablePair (
https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)
to see if the semantics meet your needs.

Alternatively you can consider:

   1. Build  provide a fast lookup service that stores and returns the
   mutable information keyed by the RDD row IDs, or
   2. Use DDF (Distributed DataFrame) which we'll make available in the
   near future, which will give you fully mutable-row table semantics.


--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:

 Hey guys,

 I need to tag individual RDD lines with some values. This tag value would
 change at every iteration. Is this possible with RDD (I suppose this is
 sort of like mutable RDD, but it's more) ?

 If not, what would be the best way to do something like this? Basically,
 we need to keep mutable information per data row (this would be something
 much smaller than actual data row, however).

 Thanks

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen

Sung Hwan, yes, I'm saying exactly what you interpreted, including that if
you tried it, it would (mostly) work, and my uncertainty with respect to
guarantees on the semantics. Definitely there would be no fault tolerance
if the mutations depend on state that is not captured in the RDD lineage.

DDF is to RDD is like RDD is to HDFS. Not a perfect analogy, but the point
is that it's an abstraction above with all attendant implications, plusses
and minusses. With DDFs you get to think of everything as tables with
schemas, while the underlying complexity of mutability and data
representation is hidden away. You also get rich idioms to operate on those
tables like filtering, projection, subsetting, handling of missing data
(NA's), dummy-column generation, data mining statistics and machine
learning, etc. In some aspects it replaces a lot of boiler plate analytics
that you don't want to re-invent over and over again, e.g., FiveNum or
XTabs. So instead of 100 lines of code, it's 4. In other aspects it allows
you to easily apply arbitrary machine learning algorithms without having
to think too hard about getting the data types just right. Etc.

But you would also find yourself wanting access to the underlying RDDs for
their full semantics flexibility.
--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen

On Fri, Mar 28, 2014 at 8:46 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:

Thanks Chris,

I'm not exactly sure what you mean with MutablePair, but are you saying
that we could create RDD[MutablePair] and modify individual rows?

If so, will that play nicely with RDD's lineage and fault tolerance?

As for the alternatives, I don't think 1 is something we want to do, since
that would require another complex system we'll have to implement. Is DDF
going to be an alternative to RDD?

Thanks again!

On Fri, Mar 28, 2014 at 7:02 PM, Christopher Nguyen c...@adatao.comwrote:

Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to
get what you want is to transform to another RDD. But you might look at
MutablePair (
https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)
to see if the semantics meet your needs.

Alternatively you can consider:

1. Build provide a fast lookup service that stores and returns the
mutable information keyed by the RDD row IDs, or
2. Use DDF (Distributed DataFrame) which we'll make available in the
near future, which will give you fully mutable-row table semantics.

--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen

On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:

Hey guys,

I need to tag individual RDD lines with some values. This tag value
would change at every iteration. Is this possible with RDD (I suppose this
is sort of like mutable RDD, but it's more) ?

If not, what would be the best way to do something like this? Basically,
we need to keep mutable information per data row (this would be something
much smaller than actual data row, however).

Thanks

Re: Announcing Spark SQL

2014-03-26 Thread Christopher Nguyen

+1 Michael, Reynold et al. This is key to some of the things we're doing.

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Wed, Mar 26, 2014 at 2:58 PM, Michael Armbrust mich...@databricks.comwrote:

 Hey Everyone,

 This already went out to the dev list, but I wanted to put a pointer here
 as well to a new feature we are pretty excited about for Spark 1.0.


 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html

 Michael

Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen

Deenar, when you say just once, have you defined across multiple what
(e.g., across multiple threads in the same JVM on the same machine)? In
principle you can have multiple executors on the same machine.

In any case, assuming it's the same JVM, have you considered using a
singleton that maintains done/not-done state, that is invoked by each of
the instances (TaskNonce.getSingleton().doThisOnce()) ? You can, e.g., mark
the state boolean transient to prevent it from going through serdes.



--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Tue, Mar 25, 2014 at 10:03 AM, deenar.toraskar deenar.toras...@db.comwrote:

 Hi

 Is there a way in Spark to run a function on each executor just once. I
 have
 a couple of use cases.

 a) I use an external library that is a singleton. It keeps some global
 state
 and provides some functions to manipulate it (e.g. reclaim memory. etc.) .
 I
 want to check the global state of this library on each executor.

 b) To get jvm stats or instrumentation on each executor.

 Currently I have a crude way of achieving something similar, I just run a
 map on a large RDD that is hash partitioned, this does not guarantee that
 the job would run just once.

 Deenar



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen

Deenar, the singleton pattern I'm suggesting would look something like this:

public class TaskNonce {

  private transient boolean mIsAlreadyDone;

  private static transient TaskNonce mSingleton = new TaskNonce();

  private transient Object mSyncObject = new Object();

  public TaskNonce getSingleton() { return mSingleton; }

  public void doThisOnce() {
if (mIsAlreadyDone) return;
lock (mSyncObject) {
  mIsAlreadyDone = true;
  ...
}
  }

which you would invoke as TaskNonce.getSingleton().doThisOnce() from within
the map closure. If you're using the Spark Java API, you can put all this
code in the mapper class itself.

There is no need to require one-row RDD partitions to achieve what you
want, if I understand your problem statement correctly.



--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Tue, Mar 25, 2014 at 11:07 AM, deenar.toraskar deenar.toras...@db.comwrote:

 Christopher

 It is once per JVM. TaskNonce would meet my needs. I guess if I want it
 once
 per thread, then a ThreadLocal would do the same.

 But how do I invoke TaskNonce, what is the best way to generate a RDD to
 ensure that there is one element per executor.

 Deenar



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3208.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Christopher Nguyen

Chanwit, that is awesome!

Improvements in shuffle operations should help improve life even more for
you. Great to see a data point on ARM.

Sent while mobile. Pls excuse typos etc.
On Mar 18, 2014 7:36 PM, Chanwit Kaewkasi chan...@gmail.com wrote:

 Hi all,

 We are a small team doing a research on low-power (and low-cost) ARM
 clusters. We built a 20-node ARM cluster that be able to start Hadoop.
 But as all of you've known, Hadoop is performing on-disk operations,
 so it's not suitable for a constraint machine powered by ARM.

 We then switched to Spark and had to say wow!!

 Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
 size 34GB in 1h50m. We have identified the bottleneck and it's our
 100M network.

 Here's the cluster:
 https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png

 And this is what we got from Spark's shell:
 https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png

 I think it's the first ARM cluster that can process a non-trivial size
 of Big Data.
 (Please correct me if I'm wrong)
 I really want to thank the Spark team that makes this possible !!

 Best regards,

 -chanwit

 --
 Chanwit Kaewkasi
 linkedin.com/in/chanwit

Re: major Spark performance problem

2014-03-06 Thread Christopher Nguyen

Dana,

When you run multiple applications under Spark, and if each application
takes up the entire cluster resources, it is expected that one will block
the other completely, thus you're seeing that the wall time add together
sequentially. In addition there is some overhead associated with starting
up a new application/SparkContext.

Your other mode of sharing a single SparkContext, if your use case allows
it, is more promising in that workers are available to work on tasks in
parallel (but ultimately still subject to maximum resource limits). Without
knowing what your actual workload is, it's hard to tell in absolute terms
whether 12 seconds is reasonable or not.

One reason for the jump from 12s in local mode to 40s in cluster mode would
be the HBase bottleneck---you apparently would have 2x10=20 clients going
against the HBase data source instead of 1 (or however many local threads
you have). Assuming this is an increase of useful work output by a factor
of 20x, a jump from 12s to 40s wall time is actually quite attractive.

NB: given my assumption that the HBase data source is not parallelized
along with the Spark cluster, you would run into sublinear performance
issues (HBase-perf-limited or network-bandwidth-limited) as you scale out
your cluster size.
--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Thu, Mar 6, 2014 at 11:49 AM, Livni, Dana dana.li...@intel.com wrote:

  Hi all,



 We have a big issue and would like if someone have any insights or ideas.

 The problem is composed of two connected problems.

 1.   Run time of a single application.

 2.   Run time of multiple applications in parallel is almost linear
 with run time of a single application.



 We have written a spark application patching its data from HBase.

 We are running the application using YARN-client resource manager.

 The cluster have 2 nodes (both uses as HBase data nodes and spark/YARN
 processing nodes).



 We have few sparks steps in our app, the heaviest and longest from all Is
 describe by this flow

 1.   flatMap - converting the HBase RDD to objects RDD.

 2.   Group by key

 3.   Map making the calculations we need. (checking set of basic
 mathematical conditions)



 When running a single instance of this step Working on only 2000 records
 this step takes around 13s. (all records are related to one key)

 The HBase table we fetch the data from have 5 regions.



 The implementation we have made is using REST service which creates one
 spark context

 Each request we make to this service, run an instance of the application
 (but a gain all uses the same spark contxt)

 Each request creates multiple threads which run all the application steps.

 When running one request (with 10 parallel threads) the relevant stage
 takes about 40s for all the threads - each one of them takes 40s  itself,
 but they almost run completely in parallel, so also the total run time of
 one request is 40s.



 We have allocated 10 workers each with 512M memory (no need for more,
 looks like all the RDD is cached)



 So the first question:

 Does this run time make sense? For us it seems too long? Do you have an
 idea what are we doing wrong



 The second problem and the more serious one

 We need to run multiple parallel request of this kind.

 When doing so the run time spikes again and instead of an request that
 runs in about 1m (40s is only the main stage)

 We get 2 applications both running almost in parallel both run for 2m.

 This also happens if we use 2 different services and sending each of them
 1 request.

 These running times grows as we send more requests.



 We have also monitored the CPU usage of the node and each request makes it
 jump to 90%.



 If we reduce the number of workers to 2 the CPU usage jump is to about
 35%, but the run time increases significantly.



 This seems very unlikely to us.

 Are there any spark parameters we should consider to change?

 Any other ideas? We are quite stuck on this.



 Thanks in advanced

 Dana











 -
 Intel Electronics Ltd.

 This e-mail and any attachments may contain confidential material for
 the sole use of the intended recipient(s). Any review or distribution
 by others is strictly prohibited. If you are not the intended
 recipient, please contact the sender and delete all copies.

Re: deep learning with heterogeneous cloud computing using spark

Re: Support R in Spark

Re: Support R in Spark

Re: First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

Re: How to save mllib model to hdfs and reload it

Re: How to save mllib model to hdfs and reload it

Re: How to save mllib model to hdfs and reload it

Re: How to parallelize model fitting with different cross-validation folds?

Re: initial basic question from new user

Re: Can this be done in map-reduce technique (in parallel)

Re: Announcing Spark 1.0.0

Re: Spark Memory Bounds

Re: is Mesos falling out of favor?

Re: Opinions stratosphere

Re: Spark and HBase

Re: Spark on other parallel filesystems

Re: Cross validation is missing in machine learning examples

Re: Mutable tagging RDD rows ?

Re: Mutable tagging RDD rows ?

Re: Announcing Spark SQL

Re: Running a task once on each executor

Re: Running a task once on each executor

Re: Spark enables us to process Big Data on an ARM cluster !!

Re: major Spark performance problem

24 matches

Site Navigation

Mail list logo

Footer information