breeze DGEMM slow in spark

2014-05-17 Thread wxhsdp
Dear, all i'am testing double precision matrix multiplication in spark on ec2 m1.large machines. i use breeze linalg library, and internally it calls native library(openblas nehalem single threaded) m1.large: model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz cpu MHz :

Apache Spark Throws java.lang.IllegalStateException: unread block data

2014-05-17 Thread sam
What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt

Re: Benchmarking Spark with YCSB

2014-05-17 Thread bhusted
Thanks Jay. I honestly think I just had a senior moment or something. I was getting HiBench and YCSB confused. Has anyone attempted to port HiBench to using Spark? HiBench performs a lot of map/reduce and it would be a very interesting comparison for us. -- View this message in context:

Re: Using mongo with PySpark

2014-05-17 Thread Nicholas Chammas
Where's your driver code (the code interacting with the RDDs)? Are you getting serialization errors? 2014년 5월 17일 토요일, Samarth Mailinglistmailinglistsama...@gmail.com님이 작성한 메시지: Hi all, I am trying to store the results of a reduce into mongo. I want to share the variable collection in the

Re: Worker re-spawn and dynamic node joining

2014-05-17 Thread Nicholas Chammas
Thanks for the info about adding/removing nodes dynamically. That's valuable. 2014년 5월 16일 금요일, Akhil Dasak...@sigmoidanalytics.com님이 작성한 메시지: Hi Han :) 1. Is there a way to automatically re-spawn spark workers? We've situations where executor OOM causes worker process to be DEAD and it does

Re: Worker re-spawn and dynamic node joining

2014-05-17 Thread Mayur Rustagi
A better way would be use Mesos (and quite possibly Yarn in 1.0.0). That will allow you to add nodes on the fly leverage it for Spark. Frankly Standalone mode is not meant to handle those issues. That said we use our deployment tool as stopping the cluster for adding nodes is not really an issue

Re: Using mongo with PySpark

2014-05-17 Thread Mayur Rustagi
You have to ideally pass the mongoclient object along with your data in the mapper(python should be try to serialize your mongoclient, but explicit is better) if client is serializable then all should end well.. if not then you are better off using map partition initilizing the driver in each

Re: Historical Data as Stream

2014-05-17 Thread Mayur Rustagi
The real question is why are looking to consume file as a Stream 1. Too big to load as RDD 2. Operate in sequential manner. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, May 17, 2014 at 5:12 AM, Soumya Simanta

Re: Historical Data as Stream

2014-05-17 Thread Laeeq Ahmed
@Soumya Simanta Right now its just a prove of concept. Later I will have a real stream. Its EEG files of brain. Later it can be used for real time analysis of eeg streams. @Mayur The size is huge yes. SO its better to do in distributed manner and as I said above I want to read as stream

Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Daniel Mahler
I have had a lot of success with Spark on large datasets, both in terms of performance and flexibility. However I hit a wall with reduceByKey when the RDD contains billions of items. I am reducing with simple functions like addition for building histograms, so the reduction process should be

Benchmarking Graphx

2014-05-17 Thread Hari
HI, I want to do some benchmarking tests (run-time and memory) for one of GraphX examples, lets say PageRank on my single processor PC to start with. a) Is there a way to get the total time taken for the execution from start to finish? b) log4j properties need to be modified to turn off logging,

Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-17 Thread Madhu
Daniel, How many partitions do you have? Are they more or less uniformly distributed? We have similar data volume currently running well on Hadoop MapReduce with roughly 30 nodes. I was planning to test it with Spark. I'm very interested in your findings. - Madhu

Re: Historical Data as Stream

2014-05-17 Thread Soumya Simanta
@Laeeq - please see this example. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala#L47-L49 On Sat, May 17, 2014 at 2:06 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: @Soumya Simanta Right now its just a prove of

Unsubscribe

2014-05-17 Thread A.Khanolkar

Re: breeze DGEMM slow in spark

2014-05-17 Thread wxhsdp
i think maybe it's related to m1.large, because i also tested on my laptop, the two case cost nearly the same amount of time. my laptop: model name : Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz cpu MHz : 2893.549 os: Linux ubuntu 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC

Text file and shuffle

2014-05-17 Thread Puneet Lakhina
Hi, I'm new to spark and I wanted to understand a few things conceptually so that I can optimize my spark job. I have a large text file (~14G, 200k lines). This file is available on each worker node of my spark cluster. The job I run calls sc.textFile(...).flatmap(...) . The function that I

Unsubscribe

2014-05-17 Thread A.Khanolkar

Re: breeze DGEMM slow in spark

2014-05-17 Thread Xiangrui Meng
You need to include breeze-natives or netlib:all to load the native libraries. Check the log messages to ensure native libraries are used, especially on the worker nodes. The easiest way to use OpenBLAS is copying the shared library to /usr/lib/libblas.so.3 and /usr/lib/liblapack.so.3. -Xiangrui