Spark's Behavior 2

2014-05-13 Thread Eduardo Costa Alfaia
Hi TD, I have sent more informations now using 8 workers. The gap has been 27 sec now. Have you seen? Thanks BR -- Informativa sulla Privacy: http://www.unibs.it/node/8155

Re: Accuracy in mllib BinaryClassificationMetrics

2014-05-13 Thread Xiangrui Meng
Hi Deb, feel free to add accuracy along with precision and recall. -Xiangrui On Mon, May 12, 2014 at 1:26 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I see precision and recall but no accuracy in mllib.evaluation.binary. Is it already under development or it needs to be added ?

Re: Dead lock running multiple Spark jobs on Mesos

2014-05-13 Thread Andrew Ash
Are you setting a core limit with spark.cores.max? If you don't, in coarse mode each Spark job uses all available cores on Mesos and doesn't let them go until the job is terminated. At which point the other job can access the cores. https://spark.apache.org/docs/latest/running-on-mesos.html --

something about pipeline

2014-05-13 Thread wxhsdp
Dear, all definition of fetch wait time: * Time the task spent waiting for remote shuffle blocks. This only includes the time * blocking on shuffle input data. For instance if block B is being fetched while the task is * still not finished processing block A, it is not considered to

Re: How to read a multipart s3 file?

2014-05-13 Thread kamatsuoka
Thanks Nicholas! I looked at those docs several times without noticing that critical part you highlighted. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-multipart-s3-file-tp5463p5494.html Sent from the Apache Spark User List mailing list

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Mayur Rustagi
We are running into same issue. After 700 or so files the stack overflows, cache, persist checkpointing dont help. Basically checkpointing only saves the RDD when it is materialized it only materializes in the end, then it runs out of stack. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257

Re: Doubts regarding Shark

2014-05-13 Thread Mayur Rustagi
The table will be cached but 10GB (Most likely more) would be on disk. You can check that in the storage tab in shark application. Java out of memory could be as your worker memory is too low or memory allocated to Shark is too low. Mayur Rustagi Ph: +1 (760) 203 3257

Re: Caching in graphX

2014-05-13 Thread ankurdave
Unfortunately it's very difficult to get uncaching right with GraphX due to the complicated internal dependency structure that it creates. It's necessary to know exactly what operations you're doing on the graph in order to unpersist correctly (i.e., in a way that avoids recomputation). I have a

Re: Is any idea on architecture based on Spark + Spray + Akka

2014-05-13 Thread Chester At Yahoo
We are using spray + Akka + spark stack at Alpine data labs Chester Sent from my iPhone On May 4, 2014, at 8:37 PM, ZhangYi yizh...@thoughtworks.com wrote: Hi all, Currently, our project is planning to adopt spark to be big data platform. For the client side, we decide expose REST

no subject

2014-05-13 Thread Herman, Matt (CORP)
unsubscribe -- This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an

Re: Is there any problem on the spark mailing list?

2014-05-13 Thread wxhsdp
i think so, fewer questions and answers these three days -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5522.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Variables outside of mapPartitions scope

2014-05-13 Thread ankurdave
In general, you can find out exactly what's not serializable by adding -Dsun.io.serialization.extendedDebugInfo=true to SPARK_JAVA_OPTS. Since a this reference to the enclosing class is often what's causing the problem, a general workaround is to move the mapPartitions call to a static method

Re: A new resource for getting examples of Spark RDD API calls

2014-05-13 Thread Gerard Maas
Hi Zhen, Thanks a lot for sharing. I'm sure it will be useful for new users. A small note: On the 'checkpoint' explanation: sc.setCheckpointDir(my_directory_name) it would be useful to specify that 'my_directory_name' should exist in all slaves. As an alternative you could use an HDFS directory

Caching in graphX

2014-05-13 Thread Franco Avi
Hi, i'm writing this post because I would to know a caching approach for iterative algorithms in graphX. So far I was not able to keep stable the time of execution of each iteration. Can you achieve this condition? The code I used is this: var g = ... // my graph var prevG: Graph[VD, ED] = null

Re: A new resource for getting examples of Spark RDD API calls

2014-05-13 Thread Flavio Pompermaier
Great work!thanks! On May 13, 2014 3:16 AM, zhen z...@latrobe.edu.au wrote: Hi Everyone, I found it quite difficult to find good examples for Spark RDD API calls. So my student and I decided to go through the entire API and write examples for the vast majority of API calls (basically

Re: Variables outside of mapPartitions scope

2014-05-13 Thread DB Tsai
Scala's for-loop is not just looping; it's not native looping in bytecode level. It will create a couple of objects at runtime and performs a truckload of method calls on them. As a result, if you are referring the variables outside the for-loop, the whole for-loop object and any variable inside

Re: Turn BLAS on MacOSX

2014-05-13 Thread DB Tsai
Hi wxhsdp, See https://github.com/scalanlp/breeze/issues/142 and https://github.com/fommil/netlib-java/issues/60 for details. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, May

Re: Reading from .bz2 files with Spark

2014-05-13 Thread Xiangrui Meng
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes the problem you described, but it does contain several fixes to bzip2 format. -Xiangrui On Wed, May 7, 2014 at 9:19 PM, Andrew Ash and...@andrewash.com wrote: Hi all, Is anyone reading and writing to .bz2 files stored in

Re: 1.0.0 Release Date?

2014-05-13 Thread Anurag Tangri
Hi All, We are also waiting for this. Does anyone know of tentative date for this release ? We are at spark 0.8.0 right now. Should we wait for spark 1.0 or upgrade to spark 0.9.1 ? Thanks, Anurag Tangri On Tue, May 13, 2014 at 9:40 AM, bhusted brian.hus...@gmail.com wrote: Can anyone

Turn BLAS on MacOSX

2014-05-13 Thread Debasish Das
Hi, How do I load native BLAS libraries on Mac ? I am getting the following errors while running LR and SVM with SGD: 14/05/07 10:48:13 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/05/07 10:48:13 WARN BLAS: Failed to load implementation from:

Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Marcelo Vanzin
On Mon, May 12, 2014 at 12:14 PM, Matei Zaharia matei.zaha...@gmail.com wrote: That API is something the HDFS administrator uses outside of any application to tell HDFS to cache certain files or directories. But once you’ve done that, any existing HDFS client accesses them directly from the

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Guanhua Yan
Thanks Xiangrui. After some debugging efforts, it turns out that the problem results from a bug in my code. But it's good to know that a long lineage could also lead to this problem. I will also try checkpointing to see whether the performance can be improved. Best regards, - Guanhua On 5/13/14

Re: java.lang.StackOverflowError when calling count()

2014-05-13 Thread Mayur Rustagi
Count causes the overall performance to drop drastically. Infact beyond 50 files it starts to hang. if i force materialization. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, May 13, 2014 at 9:34 PM,

Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Chanwit Kaewkasi
Great to know that! Thank you, Matei. Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Tue, May 13, 2014 at 2:14 AM, Matei Zaharia matei.zaha...@gmail.com wrote: That API is something the HDFS administrator uses outside of any application to tell HDFS to cache certain