Re: deep learning with heterogeneous cloud computing using spark
Thanks Nick :) Abid, you may also want to check out http://conferences.oreilly.com/strata/big-data-conference-ny-2015/public/schedule/detail/43484, which describes our work on a combination of Spark and Tachyon for Deep Learning. We found significant gains in using Tachyon (with co-processing) for the "descent" step while Spark computes the gradients. The video was recently uploaded here http://bit.ly/1JnvQAO. Regards, -- *Algorithms of the Mind **http://bit.ly/1ReQvEW <http://bit.ly/1ReQvEW>* Christopher Nguyen CEO & Co-Founder www.Arimo.com (née Adatao) linkedin.com/in/ctnguyen
Re: Support R in Spark
Hi Kui, DDF (open sourced) also aims to do something similar, adding RDBMS idioms, and is already implemented on top of Spark. One philosophy is that the DDF API aggressively hides the notion of parallel datasets, exposing only (mutable) tables to users, on which they can apply R and other familiar data mining/machine learning idioms, without having to know about the distributed representation underneath. Now, you can get to the underlying RDDs if you want to, simply by asking for it. This was launched at the July Spark Summit. See http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us . Sent while mobile. Please excuse typos etc. On Sep 4, 2014 1:59 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Thanks Kui. SparkR is a pretty young project, but there are a bunch of things we are working on. One of the main features is to expose a data frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will be integrating this with Spark's MLLib. At a high-level this will allow R users to use a familiar API but make use of MLLib's efficient distributed implementation. This is the same strategy used in Python as well. Also we do hope to merge SparkR with mainline Spark -- we have a few features to complete before that and plan to shoot for integration by Spark 1.3. Thanks Shivaram On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote: Thanks, Shivaram. No specific use case yet. We try to use R in our project as data scientest are all knowing R. We had a concern that how R handles the mass data. Spark does a better work on big data area, and Spark ML is focusing on predictive analysis area. Then we are thinking whether we can merge R and Spark together. We tried SparkR and it is pretty easy to use. But we didn’t see any feedback on this package in industry. It will be better if Spark team has R support just like scala/Java/Python. Another question is that MLlib will re-implement all famous data mining algorithms in Spark, then what is the purpose of using R? There is another technique for us H2O which support R natively. H2O is more friendly to data scientist. I saw H2O can also work on Spark (Sparkling Water). It is better than using SparkR? Thanks and Regards. Kui On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Hi Do you have a specific use-case where SparkR doesn't work well ? We'd love to hear more about use-cases and features that can be improved with SparkR. Thanks Shivaram On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote: Does spark ML team have plan to support R script natively? There is a SparkR project, but not from spark team. Spark ML used netlib-java to talk with native fortran routines or use NumPy, why not try to use R in some sense. R had lot of useful packages. If spark ML team can include R support, it will be a very powerful. Any comment? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Support R in Spark
Hi Kui, sorry about that. That link you mentioned is probably the one for the products. We don't have one pointing from adatao.com to ddf.io; maybe we'll add it. As for access to the code base itself, I think the team has already created a GitHub repo for it, and should open it up within a few weeks. There's some debate about whether to put out the implementation with Shark dependencies now, or SparkSQL with a bit limited functionality and not as well tested. I'll check and ping when this is opened up. The license is Apache. Sent while mobile. Please excuse typos etc. On Sep 6, 2014 1:39 PM, oppokui oppo...@gmail.com wrote: Thanks, Christopher. I saw it before, it is amazing. Last time I try to download it from adatao, but no response after filling the table. How can I download it or its source code? What is the license? Kui On Sep 6, 2014, at 8:08 PM, Christopher Nguyen c...@adatao.com wrote: Hi Kui, DDF (open sourced) also aims to do something similar, adding RDBMS idioms, and is already implemented on top of Spark. One philosophy is that the DDF API aggressively hides the notion of parallel datasets, exposing only (mutable) tables to users, on which they can apply R and other familiar data mining/machine learning idioms, without having to know about the distributed representation underneath. Now, you can get to the underlying RDDs if you want to, simply by asking for it. This was launched at the July Spark Summit. See http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us . Sent while mobile. Please excuse typos etc. On Sep 4, 2014 1:59 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Thanks Kui. SparkR is a pretty young project, but there are a bunch of things we are working on. One of the main features is to expose a data frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will be integrating this with Spark's MLLib. At a high-level this will allow R users to use a familiar API but make use of MLLib's efficient distributed implementation. This is the same strategy used in Python as well. Also we do hope to merge SparkR with mainline Spark -- we have a few features to complete before that and plan to shoot for integration by Spark 1.3. Thanks Shivaram On Wed, Sep 3, 2014 at 9:24 PM, oppokui oppo...@gmail.com wrote: Thanks, Shivaram. No specific use case yet. We try to use R in our project as data scientest are all knowing R. We had a concern that how R handles the mass data. Spark does a better work on big data area, and Spark ML is focusing on predictive analysis area. Then we are thinking whether we can merge R and Spark together. We tried SparkR and it is pretty easy to use. But we didn’t see any feedback on this package in industry. It will be better if Spark team has R support just like scala/Java/Python. Another question is that MLlib will re-implement all famous data mining algorithms in Spark, then what is the purpose of using R? There is another technique for us H2O which support R natively. H2O is more friendly to data scientist. I saw H2O can also work on Spark (Sparkling Water). It is better than using SparkR? Thanks and Regards. Kui On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Hi Do you have a specific use-case where SparkR doesn't work well ? We'd love to hear more about use-cases and features that can be improved with SparkR. Thanks Shivaram On Wed, Sep 3, 2014 at 3:19 AM, oppokui oppo...@gmail.com wrote: Does spark ML team have plan to support R script natively? There is a SparkR project, but not from spark team. Spark ML used netlib-java to talk with native fortran routines or use NumPy, why not try to use R in some sense. R had lot of useful packages. If spark ML team can include R support, it will be a very powerful. Any comment? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)
Fantastic! Sent while mobile. Pls excuse typos etc. On Aug 19, 2014 4:09 PM, Haoyuan Li haoyuan...@gmail.com wrote: Hi folks, We've posted the first Tachyon meetup, which will be on August 25th and is hosted by Yahoo! (Limited Space): http://www.meetup.com/Tachyon/events/200387252/ . Hope to see you there! Best, Haoyuan -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
Re: How to save mllib model to hdfs and reload it
Hi Hoai-Thu, the issue of private default constructor is unlikely the cause here, since Lance was already able to load/deserialize the model object. And on that side topic, I wish all serdes libraries would just use constructor.setAccessible(true) by default :-) Most of the time that privacy is not about serdes reflection restrictions. Sent while mobile. Pls excuse typos etc. On Aug 14, 2014 1:58 AM, Hoai-Thu Vuong thuv...@gmail.com wrote: A man in this community give me a video: https://www.youtube.com/watch?v=sPhyePwo7FA. I've got a same question in this community and other guys helped me to solve this problem. I'm trying to load MatrixFactorizationModel from object file, but compiler said that, I can not create object because the constructor is private. To solve this, I put my new object to same package as MatrixFactorizationModel. Luckly it works. On Wed, Aug 13, 2014 at 9:20 PM, Christopher Nguyen c...@adatao.com wrote: Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to isolate the cause to serialization of the loaded model. And also try to serialize the deserialized (loaded) model manually to see if that throws any visible exceptions. Sent while mobile. Pls excuse typos etc. On Aug 13, 2014 7:03 AM, lancezhange lancezha...@gmail.com wrote: my prediction codes are simple enough as follows: *val labelsAndPredsOnGoodData = goodDataPoints.map { point = val prediction = model.predict(point.features) (point.label, prediction) }* when model is the loaded one, above code just can't work. Can you catch the error? Thanks. PS. i use spark-shell under standalone mode, version 1.0.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12035.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Thu.
Re: How to save mllib model to hdfs and reload it
+1 what Sean said. And if there are too many state/argument parameters for your taste, you can always create a dedicated (serializable) class to encapsulate them. Sent while mobile. Pls excuse typos etc. On Aug 13, 2014 6:58 AM, Sean Owen so...@cloudera.com wrote: PS I think that solving not serializable exceptions by adding 'transient' is usually a mistake. It's a band-aid on a design problem. transient causes the default serialization mechanism to not serialize the field when the object is serialized. When deserialized, this field will be null, which often compromises the class's assumptions about state. This keyword is only appropriate when the field can safely be recreated at any time -- things like cached values. In Java, this commonly comes up when declaring anonymous (therefore non-static) inner classes, which have an invisible reference to the containing instance, which can easily cause it to serialize the enclosing class when it's not necessary at all. Inner classes should be static in this case, if possible. Passing values as constructor params takes more code but let you tightly control what the function references. On Wed, Aug 13, 2014 at 2:47 PM, Jaideep Dhok jaideep.d...@inmobi.com wrote: Hi, I have faced a similar issue when trying to run a map function with predict. In my case I had some non-serializable fields in my calling class. After making those fields transient, the error went away. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to save mllib model to hdfs and reload it
Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to isolate the cause to serialization of the loaded model. And also try to serialize the deserialized (loaded) model manually to see if that throws any visible exceptions. Sent while mobile. Pls excuse typos etc. On Aug 13, 2014 7:03 AM, lancezhange lancezha...@gmail.com wrote: my prediction codes are simple enough as follows: *val labelsAndPredsOnGoodData = goodDataPoints.map { point = val prediction = model.predict(point.features) (point.label, prediction) }* when model is the loaded one, above code just can't work. Can you catch the error? Thanks. PS. i use spark-shell under standalone mode, version 1.0.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12035.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to parallelize model fitting with different cross-validation folds?
Hi sparkuser2345, I'm inferring the problem statement is something like how do I make this complete faster (given my compute resources)? Several comments. First, Spark only allows launching parallel tasks from the driver, not from workers, which is why you're seeing the exception when you try. Whether the latter is a sensible/doable idea is another discussion, but I can appreciate why many people assume this should be possible. Second, on optimization, you may be able to apply Sean's idea about (thread) parallelism at the driver, combined with the knowledge that often these cluster tasks bottleneck while competing for the same resources at the same time (cpu vs disk vs network, etc.) You may be able to achieve some performance optimization by randomizing these timings. This is not unlike GMail randomizing user storage locations around the world for load balancing. Here, you would partition each of your RDDs into a different number of partitions, making some tasks larger than others, and thus some may be in cpu-intensive map while others are shuffling data around the network. This is rather cluster-specific; I'd be interested in what you learn from such an exercise. Third, I find it useful always to consider doing as much as possible in one pass, subject to memory limits, e.g., mapPartitions() vs map(), thus minimizing map/shuffle/reduce boundaries with their context switches and data shuffling. In this case, notice how you're running the training+prediction k times over mostly the same rows, with map/reduce boundaries in between. While the training phase is sealed in this context, you may be able to improve performance by collecting all the k models together, and do a [m x k] predictions all at once which may end up being faster. Finally, as implied from the above, for the very common k-fold cross-validation pattern, the algorithm itself might be written to be smart enough to take both train and test data and do the right thing within itself, thus obviating the need for the user to prepare k data sets and running over them serially, and likely saving a lot of repeated computations in the right internal places. Enjoy, -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Sat, Jul 5, 2014 at 1:50 AM, Sean Owen so...@cloudera.com wrote: If you call .par on data_kfolded it will become a parallel collection in Scala and so the maps will happen in parallel . On Jul 5, 2014 9:35 AM, sparkuser2345 hm.spark.u...@gmail.com wrote: Hi, I am trying to fit a logistic regression model with cross validation in Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where each element is a pair of RDDs containing the training and test data: (training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint], test_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint]) scala data_kfolded res21: Array[(org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint], org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint])] = Array((MappedRDD[9] at map at console:24,MappedRDD[7] at map at console:23), (MappedRDD[13] at map at console:24,MappedRDD[11] at map at console:23), (MappedRDD[17] at map at console:24,MappedRDD[15] at map at console:23)) Everything works fine when using data_kfolded: val validationErrors = data_kfolded.map { datafold = val svmAlg = new SVMWithSGD() val model_reg = svmAlg.run(datafold._1) val labelAndPreds = datafold._2.map { point = val prediction = model_reg.predict(point.features) (point.label, prediction) } val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / datafold._2.count trainErr.toDouble } scala validationErrors res1: Array[Double] = Array(0.8819836785938481, 0.07082521117608837, 0.29833546734955185) However, I have understood that the models are not fitted in parallel as data_kfolded is not an RDD (although it's an array of pairs of RDDs). When running the same code where data_kfolded has been replaced with sc.parallelize(data_kfolded), I get a null pointer exception from the line where the run method of the SVMWithSGD object is called with the traning data. I guess this is somehow related to the fact that RDDs can't be accessed from inside a closure. I fail to understand though why the first version works and the second doesn't. Most importantly, is there a way to fit the models in parallel? I would really appreciate your help. val validationErrors = sc.parallelize(data_kfolded).map { datafold = val svmAlg = new SVMWithSGD() val model_reg = svmAlg.run(datafold._1) // This line gives null pointer exception val labelAndPreds = datafold._2.map { point = val prediction = model_reg.predict(point.features) (point.label, prediction) } val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / datafold._2.count trainErr.toDouble } validationErrors.collect
Re: initial basic question from new user
Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want for your use case. As for Parquet support, that's newly arrived in Spark 1.0.0 together with SparkSQL so continue to watch this space. Gerard's suggestion to look at JobServer, which you can generalize as building a long-running application which allows multiple clients to load/share/persist/save/collaborate-on RDDs satisfies a larger, more complex use case. That is indeed the job of a higher-level application, subject to a wide variety of higher-level design choices. A number of us have successfully built Spark-based apps around that model. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Thu, Jun 12, 2014 at 4:35 AM, Toby Douglass t...@avocet.io wrote: On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas gerard.m...@gmail.com wrote: The goal of rdd.persist is to created a cached rdd that breaks the DAG lineage. Therefore, computations *in the same job* that use that RDD can re-use that intermediate result, but it's not meant to survive between job runs. As I understand it, Spark is designed for interactive querying, in the sense that the caching of intermediate results eliminates the need to recompute those results. However, if intermediate results last only for the duration of a job (e.g. say a python script), how exactly is interactive querying actually performed? a script is not an interactive medium. Is the shell the only medium for interactive querying? Consider a common usage case : a web-site, which offers reporting upon a large data set. Users issue arbitrary queries. A few queries (just with different arguments) dominate the query load, so we thought to create intermediate RDDs to service those queries, so only those order of magnitude or smaller RDDs would need to be processed. Where this is not possible, we can only use Spark for reporting by issuing each query over the whole data set - e.g. Spark is just like Impala is just like Presto is just like [nnn]. The enourmous benefit of RDDs - the entire point of Spark - so profoundly useful here - is not available. What a huge and unexpected loss! Spark seemingly renders itself ordinary. It is for this reason I am surprised to find this functionality is not available. If you need to ad-hoc persist to files, you can can save RDDs using rdd.saveAsObjectFile(...) [1] and load them afterwards using sparkContext.objectFile(...) I've been using this site for docs; http://spark.apache.org Here we find through the top-of-the-page menus the link API Docs - Python API which brings us to; http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html Where this page does not show the function saveAsObjectFile(). I find now from your link here; https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD What appears to be a second and more complete set of the same documentation, using a different web-interface to boot. It appears at least that there are two sets of documentation for the same APIs, where one set is out of the date and the other not, and the out of date set is that which is linked to from the main site? Given that our agg sizes will exceed memory, we expect to cache them to disk, so save-as-object (assuming there are no out of the ordinary performance issues) may solve the problem, but I was hoping to store data is a column orientated format. However I think this in general is not possible - Spark can *read* Parquet, but I think it cannot write Parquet as a disk-based RDD format. If you want to preserve the RDDs in memory between job runs, you should look at the Spark-JobServer [3] Thankyou. I view this with some trepidation. It took two man-days to get Spark running (and I've spent another man day now trying to get a map/reduce to run; I'm getting there, but not there yet) - the bring-up/config experience for end-users is not tested or accurated documented (although to be clear, no better and no worse than is normal for open source; Spark is not exceptional). Having to bring up another open source project is a significant barrier to entry; it's always such a headache. The save-to-disk function you mentioned earlier will allow intermediate RDDs to go to disk, but we do in fact have a use case where in-memory would be useful; it might allow us to ditch Cassandra, which would be wonderful, since it would reduce the system count by one. I have to say, having to install JobServer to achieve this one end seems an extraordinarily heavyweight solution - a whole new application, when all that is wished for is that Spark persists RDDs across jobs, where so small a feature seems to open the door to so much functionality.
Re: Can this be done in map-reduce technique (in parallel)
Lakshmi, this is orthogonal to your question, but in case it's useful. It sounds like you're trying to determine the home location of a user, or something similar. If that's the problem statement, the data pattern may suggest a far more computationally efficient approach. For example, first map all (lat,long) pairs into geocells of a desired resolution (e.g., 10m or 100m), then count occurrences of geocells instead. There are simple libraries to map any (lat,long) pairs into a geocell (string) ID very efficiently. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Wed, Jun 4, 2014 at 3:49 AM, lmk lakshmi.muralikrish...@gmail.com wrote: Hi, I am a new spark user. Pls let me know how to handle the following scenario: I have a data set with the following fields: 1. DeviceId 2. latitude 3. longitude 4. ip address 5. Datetime 6. Mobile application name With the above data, I would like to perform the following steps: 1. Collect all lat and lon for each ipaddress (ip1,(lat1,lon1),(lat2,lon2)) (ip2,(lat3,lon3),(lat4,lat5)) 2. For each IP, 1.Find the distance between each lat and lon coordinate pair and all the other pairs under the same IP 2.Select those coordinates whose distances fall under a specific threshold (say 100m) 3.Find the coordinate pair with the maximum occurrences In this case, how can I iterate and compare each coordinate pair with all the other pairs? Can this be done in a distributed manner, as this data set is going to have a few million records? Can we do this in map/reduce commands? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Announcing Spark 1.0.0
Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyone involved in this release - it was truly a community effort with fixes, features, and optimizations contributed from dozens of organizations. This release expands Spark's standard libraries, introducing a new SQL package (SparkSQL) which lets users integrate SQL queries into existing Spark workflows. MLlib, Spark's machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark's core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements. Finally, Spark adds support for Java 8 lambda syntax and improves coverage of the Java and Python API's. Those features only scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick
Re: Spark Memory Bounds
Keith, do you mean bound as in (a) strictly control to some quantifiable limit, or (b) try to minimize the amount used by each task? If a, then that is outside the scope of Spark's memory management, which you should think of as an application-level (that is, above JVM) mechanism. In this scope, Spark voluntarily tracks and limits the amount of memory it uses for explicitly known data structures, such as RDDs. What Spark cannot do is, e.g., control or manage the amount of JVM memory that a given piece of user code might take up. For example, I might write some closure code that allocates a large array of doubles unbeknownst to Spark. If b, then your thinking is in the right direction, although quite imperfect, because of things like the example above. We often experience OOME if we're not careful with job partitioning. What I think Spark needs to evolve to is at least to include a mechanism for application-level hints about task memory requirements. We might work on this and submit a PR for it. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Tue, May 27, 2014 at 5:33 PM, Keith Simmons ke...@pulse.io wrote: I'm trying to determine how to bound my memory use in a job working with more data than can simultaneously fit in RAM. From reading the tuning guide, my impression is that Spark's memory usage is roughly the following: (A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory used by all currently running tasks I can bound A with spark.storage.memoryFraction and I can bound B with spark.shuffle.memoryFraction. I'm wondering how to bound C. It's been hinted at a few times on this mailing list that you can reduce memory use by increasing the number of partitions. That leads me to believe that the amount of transient memory is roughly follows: total_data_set_size/number_of_partitions * number_of_tasks_simultaneously_running_per_machine Does this sound right? In other words, as I increase the number of partitions, the size of each partition will decrease, and since each task is processing a single partition and there are a bounded number of tasks in flight, my memory use has a rough upper limit. Keith
Re: is Mesos falling out of favor?
Paco, that's a great video reference, thanks. To be fair to our friends at Yahoo, who have done a tremendous amount to help advance the cause of the BDAS stack, it's not FUD coming from them, certainly not in any organized or intentional manner. In vacuo we prefer Mesos ourselves, but also can't ignore the fact that in the larger market, many enterprise technology stack decisions are made based on their existing vendor support relationships. And in view of Mesos, super happy to see Mesosphere growing! Sent while mobile. Pls excuse typos etc. That's FUD. Tracking the Mesos and Spark use cases, there are very large production deployments of these together. Some are rather private but others are being surfaced. IMHO, one of the most amazing case studies is from Christina Delimitrou http://youtu.be/YpmElyi94AA For a tutorial, use the following but upgrade it to latest production for Spark. There was a related O'Reilly webcast and Strata tutorial as well: http://mesosphere.io/learn/run-spark-on-mesos/ FWIW, I teach Intro to Spark with sections on CM4, YARN, Mesos, etc. Based on lots of student experiences, Mesos is clearly the shortest path to deploying a Spark cluster if you want to leverage the robustness, multi-tenancy for mixed workloads, less ops overhead, etc., that show up repeatedly in the use case analyses. My opinion only and not that of any of my clients: Don't believe the FUD from YHOO unless you really want to be stuck in 2009. On Wed, May 7, 2014 at 8:30 AM, deric barton.to...@gmail.com wrote: I'm also using right now SPARK_EXECUTOR_URI, though I would prefer distributing Spark as a binary package. For running examples with `./bin/run-example ...` it works fine, however tasks from spark-shell are getting lost. Error: Could not find or load main class org.apache.spark.executor.MesosExecutorBackend which looks more like problem with sbin/spark-executor and missing paths to jar. Anyone encountered this error before? I guess Yahoo invested quite a lot of effort into YARN and Spark integration (moreover when Mahout is migrating to Spark there's much more interest in Hadoop and Spark integration). If there would be some Mesos company working on Spark - Mesos integration it could be at least on the same level. I don't see any other reason why would be YARN better than Mesos, personally I like the latter, however I haven't checked YARN for a while, maybe they've made a significant progress. I think Mesos is more universal and flexible than YARN. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5481.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Opinions stratosphere
Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted such a comparative study as a Masters thesis: http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf According to this snapshot (c. 2013), Stratosphere is different from Spark in not having an explicit concept of an in-memory dataset (e.g., RDD). In principle this could be argued to be an implementation detail; the operators and execution plan/data flow are of primary concern in the API, and the data representation/materializations are otherwise unspecified. But in practice, for long-running interactive applications, I consider RDDs to be of fundamental, first-class citizen importance, and the key distinguishing feature of Spark's model vs other in-memory approaches that treat memory merely as an implicit cache. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Tue, Nov 26, 2013 at 1:26 PM, Matei Zaharia matei.zaha...@gmail.comwrote: I don’t know a lot about it except from the research side, where the team has done interesting optimization stuff for these types of applications. In terms of the engine, one thing I’m not sure of is whether Stratosphere allows explicit caching of datasets (similar to RDD.cache()) and interactive queries (similar to spark-shell). But it’s definitely an interesting project to watch. Matei On Nov 22, 2013, at 4:17 PM, Ankur Chauhan achau...@brightcove.com wrote: Hi, That's what I thought but as per the slides on http://www.stratosphere.eu they seem to know about spark and the scala api does look similar. I found the PACT model interesting. Would like to know if matei or other core comitters have something to weight in on. -- Ankur On 22 Nov 2013, at 16:05, Patrick Wendell pwend...@gmail.com wrote: I've never seen that project before, would be interesting to get a comparison. Seems to offer a much lower level API. For instance this is a wordcount program: https://github.com/stratosphere/stratosphere/blob/master/pact/pact-examples/src/main/java/eu/stratosphere/pact/example/wordcount/WordCount.java On Thu, Nov 21, 2013 at 3:15 PM, Ankur Chauhan achau...@brightcove.com wrote: Hi, I was just curious about https://github.com/stratosphere/stratosphere and how does spark compare to it. Anyone has any experience with it to make any comments? -- Ankur
Re: Spark and HBase
Flavio, the two are best at two orthogonal use cases, HBase on the transactional side, and Spark on the analytic side. Spark is not intended for row-based random-access updates, while far more flexible and efficient in dataset-scale aggregations and general computations. So yes, you can easily see them deployed side-by-side in a given enterprise. Sent while mobile. Pls excuse typos etc. On Apr 8, 2014 5:58 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to everybody, in these days I looked a bit at the recent evolution of the big data stacks and it seems that HBase is somehow fading away in favour of Spark+HDFS. Am I correct? Do you think that Spark and HBase should work together or not? Best regards, Flavio
Re: Spark on other parallel filesystems
Avati, depending on your specific deployment config, there can be up to a 10X difference in data loading time. For example, we routinely parallel load 10+GB data files across small 8-node clusters in 10-20 seconds, which would take about 100s if bottlenecked over a 1GigE network. That's about the max difference for that config. If you use multiple local SSDs the difference can be correspondingly greater, and likewise 10x smaller for 10GigE networks. Lastly, an interesting dimension to consider is that the difference diminishes as your data size gets much larger relative to your cluster size, since the load ops have to be serialized in time anyway. There is no difference after loading. Sent while mobile. Pls excuse typos etc. On Apr 4, 2014 10:45 PM, Anand Avati av...@gluster.org wrote: On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia matei.zaha...@gmail.comwrote: As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won't expose data locality info to Spark, the way HDFS does. That may not matter if it's a network-mounted file system though. Is the locality querying mechanism specific to HDFS mode, or is it possible to implement plugins in Spark to query location in other ways on other filesystems? I ask because, glusterfs can expose data location of a file through virtual extended attributes and I would be interested in making Spark exploit that locality when the file location is specified as glusterfs:// (or querying the xattr blindly for file://). How much of a difference does data locality make for Spark use cases anyways (since most of the computation happens in memory)? Any sort of numbers? Thanks! Avati Matei On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy ven...@yarcdata.com wrote: All Are there any drawbacks or technical challenges (or any information, really) related to using Spark directly on a global parallel filesystem like Lustre/GPFS? Any idea of what would be involved in doing a minimal proof of concept? Is it just possible to run Spark unmodified (without the HDFS substrate) for a start, or will that not work at all? I do know that it's possible to implement Tachyon on Lustre and get the HDFS interface - just looking at other options. Venkat
Re: Cross validation is missing in machine learning examples
Aureliano, you're correct that this is not validation error, which is computed as the residuals on out-of-training-sample data, and helps minimize overfit variance. However, in this example, the errors are correctly referred to as training error, which is what you might compute on a per-iteration basis in a gradient-descent optimizer, in order to see how you're doing with respect to minimizing the in-sample residuals. There's nothing special about Spark ML algorithms that claims to escape these bias-variance considerations. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Sat, Mar 29, 2014 at 10:25 PM, Aureliano Buendia buendia...@gmail.comwrote: Hi, I notices spark machine learning examples use training data to validate regression models, For instance, in linear regressionhttp://spark.apache.org/docs/0.9.0/mllib-guide.htmlexample: // Evaluate model on training examples and compute training errorval valuesAndPreds = parsedData.map { point = val prediction = model.predict(point.features) (point.label, prediction)} ... Here training data was used to validated a model which was created from the very same training data. This is just a bias estimation, and cross validationhttp://en.wikipedia.org/wiki/Cross-validation_%28statistics%29is missing here. In order to cross validate, we need to partition the data into in-sample for training, and out-of-sample for validation. Please correct me if this does not apply to ML algorithms implemented in spark.
Re: Mutable tagging RDD rows ?
Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to get what you want is to transform to another RDD. But you might look at MutablePair ( https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala) to see if the semantics meet your needs. Alternatively you can consider: 1. Build provide a fast lookup service that stores and returns the mutable information keyed by the RDD row IDs, or 2. Use DDF (Distributed DataFrame) which we'll make available in the near future, which will give you fully mutable-row table semantics. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung coded...@cs.stanford.eduwrote: Hey guys, I need to tag individual RDD lines with some values. This tag value would change at every iteration. Is this possible with RDD (I suppose this is sort of like mutable RDD, but it's more) ? If not, what would be the best way to do something like this? Basically, we need to keep mutable information per data row (this would be something much smaller than actual data row, however). Thanks
Re: Mutable tagging RDD rows ?
Sung Hwan, yes, I'm saying exactly what you interpreted, including that if you tried it, it would (mostly) work, and my uncertainty with respect to guarantees on the semantics. Definitely there would be no fault tolerance if the mutations depend on state that is not captured in the RDD lineage. DDF is to RDD is like RDD is to HDFS. Not a perfect analogy, but the point is that it's an abstraction above with all attendant implications, plusses and minusses. With DDFs you get to think of everything as tables with schemas, while the underlying complexity of mutability and data representation is hidden away. You also get rich idioms to operate on those tables like filtering, projection, subsetting, handling of missing data (NA's), dummy-column generation, data mining statistics and machine learning, etc. In some aspects it replaces a lot of boiler plate analytics that you don't want to re-invent over and over again, e.g., FiveNum or XTabs. So instead of 100 lines of code, it's 4. In other aspects it allows you to easily apply arbitrary machine learning algorithms without having to think too hard about getting the data types just right. Etc. But you would also find yourself wanting access to the underlying RDDs for their full semantics flexibility. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Fri, Mar 28, 2014 at 8:46 PM, Sung Hwan Chung coded...@cs.stanford.eduwrote: Thanks Chris, I'm not exactly sure what you mean with MutablePair, but are you saying that we could create RDD[MutablePair] and modify individual rows? If so, will that play nicely with RDD's lineage and fault tolerance? As for the alternatives, I don't think 1 is something we want to do, since that would require another complex system we'll have to implement. Is DDF going to be an alternative to RDD? Thanks again! On Fri, Mar 28, 2014 at 7:02 PM, Christopher Nguyen c...@adatao.comwrote: Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to get what you want is to transform to another RDD. But you might look at MutablePair ( https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala) to see if the semantics meet your needs. Alternatively you can consider: 1. Build provide a fast lookup service that stores and returns the mutable information keyed by the RDD row IDs, or 2. Use DDF (Distributed DataFrame) which we'll make available in the near future, which will give you fully mutable-row table semantics. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Hey guys, I need to tag individual RDD lines with some values. This tag value would change at every iteration. Is this possible with RDD (I suppose this is sort of like mutable RDD, but it's more) ? If not, what would be the best way to do something like this? Basically, we need to keep mutable information per data row (this would be something much smaller than actual data row, however). Thanks
Re: Announcing Spark SQL
+1 Michael, Reynold et al. This is key to some of the things we're doing. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Wed, Mar 26, 2014 at 2:58 PM, Michael Armbrust mich...@databricks.comwrote: Hey Everyone, This already went out to the dev list, but I wanted to put a pointer here as well to a new feature we are pretty excited about for Spark 1.0. http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html Michael
Re: Running a task once on each executor
Deenar, when you say just once, have you defined across multiple what (e.g., across multiple threads in the same JVM on the same machine)? In principle you can have multiple executors on the same machine. In any case, assuming it's the same JVM, have you considered using a singleton that maintains done/not-done state, that is invoked by each of the instances (TaskNonce.getSingleton().doThisOnce()) ? You can, e.g., mark the state boolean transient to prevent it from going through serdes. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Tue, Mar 25, 2014 at 10:03 AM, deenar.toraskar deenar.toras...@db.comwrote: Hi Is there a way in Spark to run a function on each executor just once. I have a couple of use cases. a) I use an external library that is a singleton. It keeps some global state and provides some functions to manipulate it (e.g. reclaim memory. etc.) . I want to check the global state of this library on each executor. b) To get jvm stats or instrumentation on each executor. Currently I have a crude way of achieving something similar, I just run a map on a large RDD that is hash partitioned, this does not guarantee that the job would run just once. Deenar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Running a task once on each executor
Deenar, the singleton pattern I'm suggesting would look something like this: public class TaskNonce { private transient boolean mIsAlreadyDone; private static transient TaskNonce mSingleton = new TaskNonce(); private transient Object mSyncObject = new Object(); public TaskNonce getSingleton() { return mSingleton; } public void doThisOnce() { if (mIsAlreadyDone) return; lock (mSyncObject) { mIsAlreadyDone = true; ... } } which you would invoke as TaskNonce.getSingleton().doThisOnce() from within the map closure. If you're using the Spark Java API, you can put all this code in the mapper class itself. There is no need to require one-row RDD partitions to achieve what you want, if I understand your problem statement correctly. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Tue, Mar 25, 2014 at 11:07 AM, deenar.toraskar deenar.toras...@db.comwrote: Christopher It is once per JVM. TaskNonce would meet my needs. I guess if I want it once per thread, then a ThreadLocal would do the same. But how do I invoke TaskNonce, what is the best way to generate a RDD to ensure that there is one element per executor. Deenar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3208.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark enables us to process Big Data on an ARM cluster !!
Chanwit, that is awesome! Improvements in shuffle operations should help improve life even more for you. Great to see a data point on ARM. Sent while mobile. Pls excuse typos etc. On Mar 18, 2014 7:36 PM, Chanwit Kaewkasi chan...@gmail.com wrote: Hi all, We are a small team doing a research on low-power (and low-cost) ARM clusters. We built a 20-node ARM cluster that be able to start Hadoop. But as all of you've known, Hadoop is performing on-disk operations, so it's not suitable for a constraint machine powered by ARM. We then switched to Spark and had to say wow!! Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of size 34GB in 1h50m. We have identified the bottleneck and it's our 100M network. Here's the cluster: https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png And this is what we got from Spark's shell: https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png I think it's the first ARM cluster that can process a non-trivial size of Big Data. (Please correct me if I'm wrong) I really want to thank the Spark team that makes this possible !! Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit
Re: major Spark performance problem
Dana, When you run multiple applications under Spark, and if each application takes up the entire cluster resources, it is expected that one will block the other completely, thus you're seeing that the wall time add together sequentially. In addition there is some overhead associated with starting up a new application/SparkContext. Your other mode of sharing a single SparkContext, if your use case allows it, is more promising in that workers are available to work on tasks in parallel (but ultimately still subject to maximum resource limits). Without knowing what your actual workload is, it's hard to tell in absolute terms whether 12 seconds is reasonable or not. One reason for the jump from 12s in local mode to 40s in cluster mode would be the HBase bottleneck---you apparently would have 2x10=20 clients going against the HBase data source instead of 1 (or however many local threads you have). Assuming this is an increase of useful work output by a factor of 20x, a jump from 12s to 40s wall time is actually quite attractive. NB: given my assumption that the HBase data source is not parallelized along with the Spark cluster, you would run into sublinear performance issues (HBase-perf-limited or network-bandwidth-limited) as you scale out your cluster size. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Thu, Mar 6, 2014 at 11:49 AM, Livni, Dana dana.li...@intel.com wrote: Hi all, We have a big issue and would like if someone have any insights or ideas. The problem is composed of two connected problems. 1. Run time of a single application. 2. Run time of multiple applications in parallel is almost linear with run time of a single application. We have written a spark application patching its data from HBase. We are running the application using YARN-client resource manager. The cluster have 2 nodes (both uses as HBase data nodes and spark/YARN processing nodes). We have few sparks steps in our app, the heaviest and longest from all Is describe by this flow 1. flatMap - converting the HBase RDD to objects RDD. 2. Group by key 3. Map making the calculations we need. (checking set of basic mathematical conditions) When running a single instance of this step Working on only 2000 records this step takes around 13s. (all records are related to one key) The HBase table we fetch the data from have 5 regions. The implementation we have made is using REST service which creates one spark context Each request we make to this service, run an instance of the application (but a gain all uses the same spark contxt) Each request creates multiple threads which run all the application steps. When running one request (with 10 parallel threads) the relevant stage takes about 40s for all the threads - each one of them takes 40s itself, but they almost run completely in parallel, so also the total run time of one request is 40s. We have allocated 10 workers each with 512M memory (no need for more, looks like all the RDD is cached) So the first question: Does this run time make sense? For us it seems too long? Do you have an idea what are we doing wrong The second problem and the more serious one We need to run multiple parallel request of this kind. When doing so the run time spikes again and instead of an request that runs in about 1m (40s is only the main stage) We get 2 applications both running almost in parallel both run for 2m. This also happens if we use 2 different services and sending each of them 1 request. These running times grows as we send more requests. We have also monitored the CPU usage of the node and each request makes it jump to 90%. If we reduce the number of workers to 2 the CPU usage jump is to about 35%, but the run time increases significantly. This seems very unlikely to us. Are there any spark parameters we should consider to change? Any other ideas? We are quite stuck on this. Thanks in advanced Dana - Intel Electronics Ltd. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.