Re: Spark MLlib vs BIDMach Benchmark

2014-07-27 Thread Matei Zaharia
These numbers are from GPUs and Intel MKL (a closed-source math library for Intel processors), where for CPU-bound algorithms you are going to get faster speeds than MLlib's JBLAS. However, there's in theory nothing preventing the use of these in MLlib (e.g. if you have a faster BLAS locally;

Re: Spilling in-memory... messages in log even with MEMORY_ONLY

2014-07-27 Thread lokesh.gidra
I am comparing the total time spent in finishing the job. And What I am comparing, to be precise, is on a 48-core machine. I am comparing the performance of local[48] vs. standalone mode with 8 nodes of 6 cores each (totalling 48 cores) on localhost. In this comparison, the standalone mode

Re: Spilling in-memory... messages in log even with MEMORY_ONLY

2014-07-27 Thread Aaron Davidson
I see. There should not be a significant algorithmic difference between those two cases, as far as I can think, but there is a good bit of local-mode-only logic in Spark. One typical problem we see on large-heap, many-core JVMs, though, is much more time spent in garbage collection. I'm not sure

MLlib NNLS implementation is buggy, returning wrong solutions

2014-07-27 Thread Aureliano Buendia
Hi, The recently added NNLS implementation in MLlib returns wrong solutions. This is not data specific, just try any data in R's nnls, and then the same data in MLlib's NNLS. The results are very different. Also, the elected algorithm Polyak(1969) is not the best one around. The most popular one

Re: SparkSQL extensions

2014-07-27 Thread Michael Armbrust
Ah, I understand now. That sounds pretty useful and is something we would currently plan very inefficiently. On Sun, Jul 27, 2014 at 1:07 AM, Christos Kozanitis kozani...@berkeley.edu wrote: Thanks Michael for the recommendations. Actually the region-join (or I could name it range-join or

Re: Kmeans: set initial centers explicitly

2014-07-27 Thread Xiangrui Meng
I think this is nice to have. Feel free to create a JIRA for it and it would be great if you can send a PR. Thanks! -Xiangrui On Thu, Jul 24, 2014 at 12:39 PM, SK skrishna...@gmail.com wrote: Hi, The mllib.clustering.kmeans implementation supports a random or parallel initialization mode to

Spark as a application library vs infra

2014-07-27 Thread Mayur Rustagi
Based on some discussions with my application users, I have been trying to come up with a standard way to deploy applications built on Spark 1. Bundle the version of spark with your application and ask users store it in hdfs before referring it in yarn to boot your application 2. Provide ways

RE: Strange exception on coalesce()

2014-07-27 Thread innowireless TaeYun Kim
Thank you. It works. (I've applied the changed source code to my local 1.0.0 source) -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Friday, July 25, 2014 11:47 PM To: user@spark.apache.org Subject: Re: Strange exception on coalesce() I'm pretty sure this was

Re: Spark as a application library vs infra

2014-07-27 Thread Tobias Pfeiffer
Mayur, I don't know if I exactly understand the context of what you are asking, but let me just mention issues I had with deploying. * As my application is a streaming application, it doesn't read any files from disk, so therefore I have no Hadoop/HDFS in place and I there is no need for it,

spark checkpoint details

2014-07-27 Thread Madabhattula Rajesh Kumar
Hi Team, Could you please help me on below query. I'm using JavaStreamingContext to read streaming files from hdfs shared directory. When i start spark streaming job it is reading files from hdfs shared directory and doing some process. When i stop and restart the job it is again reading old

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-27 Thread Jianshi Huang
Hi Andrew, Thanks for the reply, I figured out the cause of the issue. Some resource files were missing in JARs. A class initialization depends on the resource files so it got that exception. I appended the resource files explicitly to --jars option and it worked fine. The Caused by... messages