Re: deep learning with heterogeneous cloud computing using spark

2016-01-30 Thread Christopher Nguyen
) for the "descent" step while Spark computes the gradients. The video was recently uploaded here http://bit.ly/1JnvQAO. Regards, -- *Algorithms of the Mind **http://bit.ly/1ReQvEW <http://bit.ly/1ReQvEW>* Christopher Nguyen CEO & Co-Founder www.Arimo.com (née Adatao) linkedin.com/in/ctnguyen

Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen
Hi Kui, DDF (open sourced) also aims to do something similar, adding RDBMS idioms, and is already implemented on top of Spark. One philosophy is that the DDF API aggressively hides the notion of parallel datasets, exposing only (mutable) tables to users, on which they can apply R and other

Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen
PM, oppokui oppo...@gmail.com wrote: Thanks, Christopher. I saw it before, it is amazing. Last time I try to download it from adatao, but no response after filling the table. How can I download it or its source code? What is the license? Kui On Sep 6, 2014, at 8:08 PM, Christopher Nguyen c

Re: First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

2014-08-19 Thread Christopher Nguyen
Fantastic! Sent while mobile. Pls excuse typos etc. On Aug 19, 2014 4:09 PM, Haoyuan Li haoyuan...@gmail.com wrote: Hi folks, We've posted the first Tachyon meetup, which will be on August 25th and is hosted by Yahoo! (Limited Space): http://www.meetup.com/Tachyon/events/200387252/ . Hope

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Christopher Nguyen
at 9:20 PM, Christopher Nguyen c...@adatao.com wrote: Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to isolate the cause to serialization of the loaded model. And also try to serialize the deserialized (loaded) model manually to see if that throws any visible exceptions

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Christopher Nguyen
+1 what Sean said. And if there are too many state/argument parameters for your taste, you can always create a dedicated (serializable) class to encapsulate them. Sent while mobile. Pls excuse typos etc. On Aug 13, 2014 6:58 AM, Sean Owen so...@cloudera.com wrote: PS I think that solving not

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Christopher Nguyen
Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to isolate the cause to serialization of the loaded model. And also try to serialize the deserialized (loaded) model manually to see if that throws any visible exceptions. Sent while mobile. Pls excuse typos etc. On Aug 13,

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Christopher Nguyen
Hi sparkuser2345, I'm inferring the problem statement is something like how do I make this complete faster (given my compute resources)? Several comments. First, Spark only allows launching parallel tasks from the driver, not from workers, which is why you're seeing the exception when you try.

Re: initial basic question from new user

2014-06-12 Thread Christopher Nguyen
Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want for your use case. As for Parquet support, that's newly arrived in Spark 1.0.0 together with SparkSQL so continue to watch this space. Gerard's suggestion to look at JobServer, which you can generalize as building a

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread Christopher Nguyen
Lakshmi, this is orthogonal to your question, but in case it's useful. It sounds like you're trying to determine the home location of a user, or something similar. If that's the problem statement, the data pattern may suggest a far more computationally efficient approach. For example, first map

Re: Announcing Spark 1.0.0

2014-05-30 Thread Christopher Nguyen
Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as

Re: Spark Memory Bounds

2014-05-27 Thread Christopher Nguyen
Keith, do you mean bound as in (a) strictly control to some quantifiable limit, or (b) try to minimize the amount used by each task? If a, then that is outside the scope of Spark's memory management, which you should think of as an application-level (that is, above JVM) mechanism. In this scope,

Re: is Mesos falling out of favor?

2014-05-16 Thread Christopher Nguyen
Paco, that's a great video reference, thanks. To be fair to our friends at Yahoo, who have done a tremendous amount to help advance the cause of the BDAS stack, it's not FUD coming from them, certainly not in any organized or intentional manner. In vacuo we prefer Mesos ourselves, but also can't

Re: Opinions stratosphere

2014-05-01 Thread Christopher Nguyen
Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted such a comparative study as a Masters thesis: http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf According to this snapshot (c. 2013), Stratosphere is different from Spark in not having an explicit concept of

Re: Spark and HBase

2014-04-08 Thread Christopher Nguyen
Flavio, the two are best at two orthogonal use cases, HBase on the transactional side, and Spark on the analytic side. Spark is not intended for row-based random-access updates, while far more flexible and efficient in dataset-scale aggregations and general computations. So yes, you can easily

Re: Spark on other parallel filesystems

2014-04-05 Thread Christopher Nguyen
Avati, depending on your specific deployment config, there can be up to a 10X difference in data loading time. For example, we routinely parallel load 10+GB data files across small 8-node clusters in 10-20 seconds, which would take about 100s if bottlenecked over a 1GigE network. That's about the

Re: Cross validation is missing in machine learning examples

2014-03-30 Thread Christopher Nguyen
Aureliano, you're correct that this is not validation error, which is computed as the residuals on out-of-training-sample data, and helps minimize overfit variance. However, in this example, the errors are correctly referred to as training error, which is what you might compute on a per-iteration

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to get what you want is to transform to another RDD. But you might look at MutablePair ( https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
have to implement. Is DDF going to be an alternative to RDD? Thanks again! On Fri, Mar 28, 2014 at 7:02 PM, Christopher Nguyen c...@adatao.comwrote: Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to get what you want is to transform to another RDD. But you might look

Re: Announcing Spark SQL

2014-03-26 Thread Christopher Nguyen
+1 Michael, Reynold et al. This is key to some of the things we're doing. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Wed, Mar 26, 2014 at 2:58 PM, Michael Armbrust mich...@databricks.comwrote: Hey Everyone, This already went out to the

Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, when you say just once, have you defined across multiple what (e.g., across multiple threads in the same JVM on the same machine)? In principle you can have multiple executors on the same machine. In any case, assuming it's the same JVM, have you considered using a singleton that

Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, the singleton pattern I'm suggesting would look something like this: public class TaskNonce { private transient boolean mIsAlreadyDone; private static transient TaskNonce mSingleton = new TaskNonce(); private transient Object mSyncObject = new Object(); public TaskNonce

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Christopher Nguyen
Chanwit, that is awesome! Improvements in shuffle operations should help improve life even more for you. Great to see a data point on ARM. Sent while mobile. Pls excuse typos etc. On Mar 18, 2014 7:36 PM, Chanwit Kaewkasi chan...@gmail.com wrote: Hi all, We are a small team doing a research

Re: major Spark performance problem

2014-03-06 Thread Christopher Nguyen
Dana, When you run multiple applications under Spark, and if each application takes up the entire cluster resources, it is expected that one will block the other completely, thus you're seeing that the wall time add together sequentially. In addition there is some overhead associated with