MLlib NNLS implementation is buggy, returning wrong solutions

2014-07-27 Thread Aureliano Buendia
Hi, The recently added NNLS implementation in MLlib returns wrong solutions. This is not data specific, just try any data in R's nnls, and then the same data in MLlib's NNLS. The results are very different. Also, the elected algorithm Polyak(1969) is not the best one around. The most popular one

NMF implementaion is Spark

2014-07-25 Thread Aureliano Buendia
Hi, Is there an implementation for Nonnegative Matrix Factorization in Spark? I understand that MLlib comes with matrix factorization, but it does not seem to cover the nonnegative case.

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Aureliano Buendia
On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That’s technically true, but I’d be surprised if there wasn’t a lot of room for improvement in spark-ec2 regarding cluster launch+config times. Unfortunately, this is a spark support issue, but an AWS one.

Spark vs Google cloud dataflow

2014-06-25 Thread Aureliano Buendia
Hi, Today Google announced their cloud dataflow, which is very similar to spark in performing batch processing and stream processing. How does spark compare to Google cloud dataflow? Are they solutions trying to aim the same problem?

Re: Spark GCE Script

2014-05-23 Thread Aureliano Buendia
Regards On Thu, May 8, 2014 at 2:58 AM, Aureliano Buendia buendia...@gmail.comwrote: Please send a pull request, this should be maintained by the community, just in case you do not feel like continuing to maintain it. Also, nice to see that the gce version is shorter than the aws version

Re: Spark runs applications in an inconsistent way

2014-04-23 Thread Aureliano Buendia
Yes, things get more unstable with larger data. But, that's the whole point of my question: Why should spark get unstable when data gets larger? When data gets larger, spark should get *slower*, not more unstable. lack of stability makes parameter tuning very difficult, time consuming and a

Spark runs applications in an inconsistent way

2014-04-22 Thread Aureliano Buendia
Hi, Sometimes running the very same spark application binary, behaves differently with every execution. - The Ganglia profile is different with every execution: sometimes it takes 0.5 TB of memory, the next time it takes 1 TB of memory, the next time it is 0.75 TB... - Spark UI shows

Re: sc.makeRDD bug with NumericRange

2014-04-18 Thread Aureliano Buendia
:01 PM, Mark Hamstra m...@clearstorydata.comwrote: Please file an issue: Spark Project JIRAhttps://issues.apache.org/jira/browse/SPARK On Fri, Apr 18, 2014 at 10:25 AM, Aureliano Buendia buendia...@gmail.com wrote: Hi, I just notices that sc.makeRDD() does not make all values given

Spark-ec2 asks for password

2014-04-18 Thread Aureliano Buendia
Hi, Since 0.9.0 spark-ec2 has gone unstable. During launch it throws many errors like: ssh: connect to host ec-xx-xx-xx-xx.compute-1.amazonaws.com port 22: Connection refused Error 255 while executing remote command, retrying after 30 seconds .. and recently, it prompts for passwords!:

Re: Spark-ec2 asks for password

2014-04-18 Thread Aureliano Buendia
about the password request; I haven't seen that on my end. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Fri, Apr 18, 2014 at 8:57 PM, Aureliano Buendia buendia...@gmail.comwrote: Hi, Since 0.9.0 spark-ec2 has gone unstable. During

Spark-ec2 setup is getting slower and slower

2014-03-30 Thread Aureliano Buendia
Hi, Spark-ec2 uses rsync to deploy many applications. It seem over time more and more applications have been added to the script, which has significantly slowed down the setup time. Perhaps the script could be restructured this this way: Instead of rsyncing N times per application, we could have

Cross validation is missing in machine learning examples

2014-03-29 Thread Aureliano Buendia
Hi, I notices spark machine learning examples use training data to validate regression models, For instance, in linear regressionhttp://spark.apache.org/docs/0.9.0/mllib-guide.htmlexample: // Evaluate model on training examples and compute training errorval valuesAndPreds = parsedData.map {

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-21 Thread Aureliano Buendia
On Tue, Mar 18, 2014 at 12:56 PM, Ognen Duzlevski og...@plainvanillagames.com wrote: On 3/18/14, 4:49 AM, dmpou...@gmail.com wrote: On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia wrote: Is there a reason for spark using the older akka? On Sun, Mar 2, 2014 at 1:53 PM, 1esha

How to save as a single file efficiently?

2014-03-21 Thread Aureliano Buendia
Hi, Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We found that a partition number of 1000 is a good number to speed the process up. However, it does not make sense to have 1000 pieces of csv files each less than 1 kb. We used RDD.coalesce(1) to get only 1 csv file, but

Re: How to save as a single file efficiently?

2014-03-21 Thread Aureliano Buendia
through one reduce node for writing it out. That's probably the fastest it will get. No need to cache if you do that. Matei On Mar 21, 2014, at 4:04 PM, Aureliano Buendia buendia...@gmail.com wrote: Hi, Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We found

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-21 Thread Aureliano Buendia
I think you bumped the wrong thread. As I mentioned in the other thread: saveAsHadoopFile only applies compression when the codec is available, and it does not seem to respect the global hadoop compression properties. I'm not sure if this is a feature, or a bug in spark. if this is a feature,

Reading back a sorted RDD

2014-03-13 Thread Aureliano Buendia
Hi, After sorting an RDD and writing to hadoop, would the RDD be still sorted when reading it back? Can sorting be guaranteed after reading back, when the RDD was written as 1 partition with rdd.coalesce(1)?

State of spark docker script

2014-03-09 Thread Aureliano Buendia
Hi, Is the spark docker script now mature enough to substitute spark-ec2 script? Anyone here using the docker script is production?

Re: Spark streaming on ec2

2014-02-28 Thread Aureliano Buendia
Also, in this talk http://www.youtube.com/watch?v=OhpjgaBVUtU on using spark streaming in production, the author seems to have missed the topic of how to manage cloud instances. On Fri, Feb 28, 2014 at 6:48 PM, Aureliano Buendia buendia...@gmail.comwrote: What's the updated way of deploying

Spark stream example SimpleZeroMQPublisher high cpu usage

2014-02-28 Thread Aureliano Buendia
Hi, Running: ./bin/run-example org.apache.spa.streaming.examples.SimpleZeroMQPublisher tcp://127.0.1.1:1234 foo causes over 100% cpu usage on os x. Given that it's just a simple zmq publisher, this shouldn't be expected. Is there something wrong with that example?

Re: Spark app gets slower as it gets executed more times

2014-02-27 Thread Aureliano Buendia
the spark app, or the spark cluster? How is it possible to gracefully shut down a spark app? (2) buildup of logs in the work/ directory or files in the Spark tmp directory, and (3) bug in Spark (woo!). On Tue, Feb 4, 2014 at 5:58 AM, Aureliano Buendia buendia...@gmail.comwrote: On Mon

Re: Spark streaming on ec2

2014-02-27 Thread Aureliano Buendia
source support as well? (Eg kafka requires setting up zookeeper). On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia buendia...@gmail.comwrote: Hi, Does the ec2 support for spark 0.9 also include spark streaming? If not, is there an equivalent?