Re: SparkSQL: Nested Query error

2014-10-30 Thread SK
Hi, I am getting an error in the Query Plan when I use the SQL statement exactly as you have suggested. Is that the exact SQL statement I should be using (I am not very familiar with SQL syntax)? I also tried using the SchemaRDD's subtract method to perform this query.

Spark Debugging

2014-10-30 Thread Naveen Kumar Pokala
Hi, I have installed 2 node hadoop cluster (For example, on Unix machines A and B. A master node and data node, B is data node) I am submitting my driver programs through SPARK 1.1.0 with bin/spark-submit from Putty Client from my Windows machine. I want to debug my program from Eclipse

Re: Spark Debugging

2014-10-30 Thread Akhil Das
Its called remote debugging. You can read this article http://www.eclipse.org/jetty/documentation/current/enable-remote-debugging.html for more information. You will have to make sure that the network between your cluster and windows machine can communicate with each other. Thanks Best Regards

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Sean Owen
Can you be more specific about numbers? I am not sure that splitting helps so much in the end, in that it has the same effect as executing a smaller number at a time of the large number of tasks that the full cartesian join would generate. The full join is probably intractable no matter what in

Re: Algebird using spark-shell

2014-10-30 Thread thadude
I've tried: . /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar scala import com.twitter.algebird._ import com.twitter.algebird._ scala import HyperLogLog._ import HyperLogLog._ scala import com.twitter.algebird.HyperLogLogMonoid import com.twitter.algebird.HyperLogLogMonoid scala val

NonSerializable Exception in foreachRDD

2014-10-30 Thread Harold Nguyen
Hi all, In Spark Streaming, when I do foreachRDD on my DStreams, I get a NonSerializable exception when I try to do something like: DStream.foreachRDD( rdd = { var sc.parallelize(Seq((test, blah))) }) Is there any way around that ? Thanks, Harold

Re: BUG: when running as extends App, closures don't capture variables

2014-10-30 Thread Sean Owen
Very coincidentally I ran into something equally puzzling yesterday where something was bizarrely null when it can't have been in a Spark program that extends App. I also changed to use main() and it works fine. So definitely some issue here. If nobody makes a JIRA before I get home I'll do it. On

sharing RDDs between PySpark and Scala

2014-10-30 Thread rok
I'm processing some data using PySpark and I'd like to save the RDDs to disk (they are (k,v) RDDs of strings and SparseVector types) and read them in using Scala to run them through some other analysis. Is this possible? Thanks, Rok -- View this message in context:

Re: Spark Debugging

2014-10-30 Thread Akhil Das
Thanks Best Regards On Thu, Oct 30, 2014 at 1:43 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Awesome. Thanks Best Regards On Thu, Oct 30, 2014 at 1:30 PM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Thanks Akhil, It is working. Regards, Naveen *From:* Akhil Das

Spark + Tableau

2014-10-30 Thread Bojan Kostic
I'm testing beta driver from Databricks for Tableua. And unfortunately i encounter some issues. While beeline connection works without problems, Tableau can't connect to spark thrift server. Error from driver(Tableau): Unable to connect to the ODBC Data Source. Check that the necessary drivers

Getting vector values

2014-10-30 Thread Andrejs Abele
Hi, I'm new to Mllib and spark. I'm trying to use tf-idf and use those values for term ranking. I'm getting tf values in vector format, but how can get the values of vector? val sc = new SparkContext(conf) val documents: RDD[Seq[String]] =

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Ilya Ganelin
The split is something like 30 million into 2 milion partitions. The reason that it becomes tractable is that after I perform the Cartesian on the split data and operate on it I don't keep the full results - I actually only keep a tiny fraction of that generated dataset - making the overall

Returned type of Broadcast variable is byte array

2014-10-30 Thread Stephen Boesch
As a template for creating a broadcast variable, the following code snippet within mllib was used: val bcIdf = dataset.context.broadcast(idf) dataset.mapPartitions { iter = val thisIdf = bcIdf.value The new code follows that model: import org.apache.spark.mllib.linalg.{Vector =

Re: Spark + Tableau

2014-10-30 Thread Jimmy
What ODBC driver are you using? We recently got the Hortonworks JODBC drivers working on a Windows box but was having issues with Mac Sent from my iPhone On Oct 30, 2014, at 4:23 AM, Bojan Kostic blood9ra...@gmail.com wrote: I'm testing beta driver from Databricks for Tableua. And

Re: Spark + Tableau

2014-10-30 Thread Bojan Kostic
I use beta driver SQL ODBC from Databricks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17727.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Using a Database to persist and load data from

2014-10-30 Thread Asaf Lahav
Hi Ladies and Gents, I would like to know what are the options I have if I would like to leverage Spark code I already have written to use a DB (Vertica) as its store/datasource. The data is of tabular nature. So any relational DB can essentially be used. Do I need to develop a context? If yes,

issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Hi, Previous we have applied SVM algorithm in MLlib to 5 million records (600 mb), it takes more than 25 minutes to finish. The spark version we are using is 1.0 and we were running this program on a 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM. The 5 million records only have two

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell
Whats the error with the 2.10 version of algebird? On Thu, Oct 30, 2014 at 12:49 AM, thadude ohpre...@yahoo.com wrote: I've tried: . /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar scala import com.twitter.algebird._ import com.twitter.algebird._ scala import HyperLogLog._ import

Re: problem with start-slaves.sh

2014-10-30 Thread Yana Kadiyska
Roberto, I don't think shark is an issue -- I have shark server running on a node that also acts as a worker. What you can do is turn off shark server, just run start-all to start your spark cluster. then you can try bin/spark-shell --master yourmasterip and see if you can successfully run some

RE: problem with start-slaves.sh

2014-10-30 Thread Pagliari, Roberto
I also didn’t realize I was trying to bring up the 2ndNameNode as a slave.. that might be an issue as well.. Thanks, From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Thursday, October 30, 2014 11:27 AM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: problem with

Re: Spark + Tableau

2014-10-30 Thread Denny Lee
When you are starting the thrift server service - are you connecting to it locally or is this on a remote server when you use beeline and/or Tableau? On Thu, Oct 30, 2014 at 8:00 AM, Bojan Kostic blood9ra...@gmail.com wrote: I use beta driver SQL ODBC from Databricks. -- View this message

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Jimmy
Watch the app manager it should tell you what's running and taking awhile... My guess it's a distinct function on the data. J Sent from my iPhone On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote: Hi, Previous we have applied SVM algorithm in MLlib to 5 million records

Best way to partition RDD

2014-10-30 Thread shahab
Hi. I am running an application in the Spark which first loads data from Cassandra and then performs some map/reduce jobs. val srdd = sqlContext.sql(select * from mydb.mytable ) I noticed that the srdd only has one partition . no matter how big is the data loaded form Cassandra. So I perform

Re: Spark + Tableau

2014-10-30 Thread Bojan Kostic
I'm connecting to it remotly with tableau/beeline. On Thu Oct 30 16:51:13 2014 GMT+0100, Denny Lee [via Apache Spark User List] wrote: When you are starting the thrift server service - are you connecting to it locally or is this on a remote server when you use beeline and/or Tableau?

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Vladimir Rodionov
GC limit overhead exceeded is usually sign of either inadequate heap size (not the case here) or application produces garbage (temp objects) faster than garbage collector collects them - GC consumes most CPU cycles. 17G of Java heap is quite large for many application and is above safe and

Re: Manipulating RDDs within a DStream

2014-10-30 Thread Harold Nguyen
Hi, Sorry, there's a typo there: val arr = rdd.toArray Harold On Thu, Oct 30, 2014 at 9:58 AM, Harold Nguyen har...@nexgate.com wrote: Hi all, I'd like to be able to modify values in a DStream, and then send it off to an external source like Cassandra, but I keep getting Serialization

Manipulating RDDs within a DStream

2014-10-30 Thread Harold Nguyen
Hi all, I'd like to be able to modify values in a DStream, and then send it off to an external source like Cassandra, but I keep getting Serialization errors and am not sure how to use the correct design pattern. I was wondering if you could help me. I'd like to be able to do the following:

Re: Algebird using spark-shell

2014-10-30 Thread Buntu Dev
Thanks.. I was using Scala 2.11.1 and was able to use algebird-core_2.10-0.1.11.jar with spark-shell. On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell i...@ianoconnell.com wrote: Whats the error with the 2.10 version of algebird? On Thu, Oct 30, 2014 at 12:49 AM, thadude ohpre...@yahoo.com

Re: Best way to partition RDD

2014-10-30 Thread shahab
Hi Helena, Well... I am just running a toy example, I have one Cassandra node co-located with the Spark Master and one of Spark Workers, all in one machine. I have another node which runs the second Spark worker. /Shahab, On Thu, Oct 30, 2014 at 6:12 PM, Helena Edelson

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng
DId you cache the data and check the load balancing? How many features? Which API are you using, Scala, Java, or Python? -Xiangrui On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote: Watch the app manager it should tell you what's running and taking awhile... My guess it's a

Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread shahab
Hi, I noticed that the count (of RDD) in many of my queries is the most time consuming one as it runs in the driver process rather then done by parallel worker nodes, Is there any way to perform count in parallel , at at least parallelize it as much as possible? best, /Shahab

Re: Best way to partition RDD

2014-10-30 Thread Helena Edelson
Shahab, Regardless, WRT cassandra and spark when using the spark cassandra connector, ‘spark.cassandra.input.split.size’ passed into the SparkConf configures the approx number of Cassandra partitions in a Spark partition (default 10). No repartitioning should be necessary with what you

Re: MLLib: libsvm - default value initialization

2014-10-30 Thread Xiangrui Meng
You can remove 0.5 from all non-zeros. -Xiangrui On Wed, Oct 29, 2014 at 9:20 PM, Sameer Tilak ssti...@live.com wrote: Hi All, I have my sparse data in libsvm format. val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, mllib/data/sample_libsvm_data.txt) I am running Linear

k-mean - result interpretation

2014-10-30 Thread mgCl2
Hello everyone, I'm trying to use MLlib's K-mean algorithm. I tried it on raw data, Here is a example of a line contained in my input data set: 82.9817 3281.4495 with those parameters: *numClusters*=4 *numIterations*=20 results: *WSSSE = 6.375371241589461E9* Then I normalized my data:

Re: Returned type of Broadcast variable is byte array

2014-10-30 Thread Stephen Boesch
The byte array turns out to be a serialized ObjectOutputStream that contains a Tuple2[ParallelCollectionRDD,Function2]. What then should be done differently in the broadcast code (which follows the structure of an example taken from mllib)? assert(crows.isInstanceOf[Array[MVector]]) val bcRows

stage failure: java.lang.IllegalStateException: unread block data

2014-10-30 Thread freedafeng
Hi, Got this error when running spark 1.1.0 to read Hbase 0.98.1 through simple python code in a ec2 cluster. The same program runs correctly in local mode. So this error only happens when running in a real cluster. Here's what I got, 14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.1 in

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sameer Farooqui
Hi Shahab, Are you running Spark in Local, Standalone, YARN or Mesos mode? If you're running in Standalone/YARN/Mesos, then the .count() action is indeed automatically parallelized across multiple Executors. When you run a .count() on an RDD, it is actually distributing tasks to different

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sameer Farooqui
By the way, in case you haven't done so, do try to .cache() the RDD before running a .count() on it as that could make a big speed improvement. On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui same...@databricks.com wrote: Hi Shahab, Are you running Spark in Local, Standalone, YARN or

Re: Ambiguous references to id : what does it mean ?

2014-10-30 Thread Terry Siu
Found this as I am having the same issue. I have exactly the same usage as shown in Michael's join example. I tried executing a SQL statement against the join data set with two columns that have the same name and tried to unambiguate the column name with the table alias, but I would still get an

Re: stage failure: java.lang.IllegalStateException: unread block data

2014-10-30 Thread freedafeng
The worker side has error message as this, 14/10/30 18:29:00 INFO Worker: Asked to launch executor app-20141030182900-0006/0 for testspark_v1 14/10/30 18:29:01 INFO ExecutorRunner: Launch command: java -cp

does updateStateByKey accept a state that is a tuple?

2014-10-30 Thread spr
I'm trying to implement a Spark Streaming program to calculate the number of instances of a given key encountered and the minimum and maximum times at which it was encountered. updateStateByKey seems to be just the thing, but when I define the state to be a tuple, I get compile errors I'm not

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Thanks for all your help. I think I didn't cache the data. My previous cluster was expired and I don't have a chance to check the load balance or app manager. Below is my code. There are 18 features for each record and I am using the Scala API. import org.apache.spark.SparkConf import

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell
Algebird 0.8.0 has 2.11 support if you want to run in a 2.11 env. On Thu, Oct 30, 2014 at 10:08 AM, Buntu Dev buntu...@gmail.com wrote: Thanks.. I was using Scala 2.11.1 and was able to use algebird-core_2.10-0.1.11.jar with spark-shell. On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell

Re: Out of memory with Spark Streaming

2014-10-30 Thread Chris Fregly
curious about why you're only seeing 50 records max per batch. how many receivers are you running? what is the rate that you're putting data onto the stream? per the default AWS kinesis configuration, the producer can do 1000 PUTs per second with max 50k bytes per PUT and max 1mb per second per

Re: Do Spark executors restrict native heap vs JVM heap?

2014-10-30 Thread Sean Owen
No, but, the JVM also does not allocate memory for native code on the heap. I dont think heap has any bearing on whether your native code can't allocate more memory except that of course the heap is also taking memory. On Oct 30, 2014 6:43 PM, Paul Wais pw...@yelp.com wrote: Dear Spark List, I

Re: k-mean - result interpretation

2014-10-30 Thread Sean Owen
Yes that is exactly it. The values are not comparable since normalization is also shrinking all distances. Squared error is not an absolute metric. I haven't thought about this much but maybe you are looking for something like the silhouette coefficient? On Oct 30, 2014 5:35 PM, mgCl2

akka connection refused bug, fix?

2014-10-30 Thread freedafeng
Hi, I saw the same issue as this thread, http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-akka-connection-refused-td9864.html Anyone has a fix for this bug? Please?! The log info in my worker node is like, 14/10/30 20:15:18 INFO Worker: Asked to kill executor

Registering custom metrics

2014-10-30 Thread Gerard Maas
vHi, I've been exploring the metrics exposed by Spark and I'm wondering whether there's a way to register job-specific metrics that could be exposed through the existing metrics system. Would there be an example somewhere? BTW, documentation about how the metrics work could be improved. I

Creating a SchemaRDD from RDD of thrift classes

2014-10-30 Thread ankits
I have one job with spark that creates some RDDs of type X and persists them in memory. The type X is an auto generated Thrift java class (not a case class though). Now in another job, I want to convert the RDD to a SchemaRDD using sqlContext.applySchema(). Can I derive a schema from the thrift

Re: Creating a SchemaRDD from RDD of thrift classes

2014-10-30 Thread Michael Armbrust
That should be possible, although I'm not super familiar with thrift. You'll probably need access to the generated metadata http://people.apache.org/~thejas/thrift-0.9/javadoc/org/apache/thrift/meta_data/package-frame.html . Shameless plug If you find yourself reading a lot of thrift data you

Re: Best way to partition RDD

2014-10-30 Thread shahab
Thanks Helena, very useful comment, But is ‘spark.cassandra.input.split.size only effective in Cluster not in Single node? best, /Shahab On Thu, Oct 30, 2014 at 6:26 PM, Helena Edelson helena.edel...@datastax.com wrote: Shahab, Regardless, WRT cassandra and spark when using the spark

Re: does updateStateByKey accept a state that is a tuple?

2014-10-30 Thread spr
I think I understand how to deal with this, though I don't have all the code working yet. The point is that the V of (K, V) can itself be a tuple. So the updateFunc prototype looks something like val updateDhcpState = (newValues: Seq[Tuple1[(Int, Time, Time)]], state: Option[Tuple1[(Int,

how idf is calculated

2014-10-30 Thread Andrejs Abele
Hi, I'm writing a paper and I need to calculate tf-idf. Whit your help I managed to get results, I needed, but the problem is that I need to be able to explain how each number was gotten. So I tried to understand how idf was calculated and the numbers i get don't correspond to those I should get .

RE: how idf is calculated

2014-10-30 Thread Ashic Mahtab
Hi Andrejs,The calculations are a bit different to what I've come across in Mining Massive Datasets (2nd Ed. Ullman et. al., Cambridge Press) available here:http://www.mmds.org/ Their calculation of IDF is as follows: IDFi = log2(N / ni) where N is the number of documents and ni is the number

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng
Then caching should solve the problem. Otherwise, it is just loading and parsing data from disk for each iteration. -Xiangrui On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote: Thanks for all your help. I think I didn't cache the data. My previous cluster was expired and I

SparkSQL + Hive Cached Table Exception

2014-10-30 Thread Jean-Pascal Billaud
Hi, While testing SparkSQL on top of our Hive metastore, I am getting some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD table. Basically, I have a table mtable partitioned by some date field in hive and below is the scala code I am running in spark-shell: val sqlContext =

Re: akka connection refused bug, fix?

2014-10-30 Thread freedafeng
followed this http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Akka-Error-while-running-Spark-Jobs/td-p/18602 but the problem was not fixed.. -- View this message in context:

SparkContext UI

2014-10-30 Thread Stuart Horsman
Hi All, When I load an RDD with: data = sc.textFile(somefile) I don't see the resulting RDD in the SparkContext gui on localhost:4040 in /storage. Is there something special I need to do to allow me to view this? I tried but scala and python shells but same result. Thanks Stuart

Re: SparkContext UI

2014-10-30 Thread Sameer Farooqui
Hey Stuart, The RDD won't show up under the Storage tab in the UI until it's been cached. Basically Spark doesn't know what the RDD will look like until it's cached, b/c up until then the RDD is just on disk (external to Spark). If you launch some transformations + an action on an RDD that is

Re: SparkContext UI

2014-10-30 Thread Stuart Horsman
Sorry too quick to pull the trigger on my original email. I should have added that I'm tried using persist() and cache() but no joy. I'm doing this: data = sc.textFile(somedata) data.cache data.count() but I still can't see anything in the storage? On 31 October 2014 10:42, Sameer

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Hi Xiangrui, Can you give me some code example about caching, as I am new to Spark. Thanks, Best, Peng On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote: Then caching should solve the problem. Otherwise, it is just loading and parsing data from disk for each iteration.

Scaladoc

2014-10-30 Thread Alessandro Baretta
How do I build the scaladoc html files from the spark source distribution? Alex Bareta

Re: SparkContext UI

2014-10-30 Thread Sameer Farooqui
Hi Stuart, You're close! Just add a () after the cache, like: data.cache() ...and then run the .count() action on it and you should be good to see it in the Storage UI! - Sameer On Thu, Oct 30, 2014 at 4:50 PM, Stuart Horsman stuart.hors...@gmail.com wrote: Sorry too quick to pull the

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Jimmy
sampleRDD. cache() Sent from my iPhone On Oct 30, 2014, at 5:01 PM, peng xia toxiap...@gmail.com wrote: Hi Xiangrui, Can you give me some code example about caching, as I am new to Spark. Thanks, Best, Peng On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote:

Confused about class paths in spark 1.1.0

2014-10-30 Thread Shay Seng
Hi, I've been trying to move up from spark 0.9.2 to 1.1.0. I'm getting a little confused with the setup for a few different use cases, grateful for any pointers... (1) spark-shell + with jars that are only required by the driver (1a) I added spark.driver.extraClassPath /mypath/to.jar to my

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Thanks Jimmy. I will have a try. Thanks very much for your guys' help. Best, Peng On Thu, Oct 30, 2014 at 8:19 PM, Jimmy ji...@sellpoints.com wrote: sampleRDD. cache() Sent from my iPhone On Oct 30, 2014, at 5:01 PM, peng xia toxiap...@gmail.com wrote: Hi Xiangrui, Can you give me some

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
Try using --jars instead of the driver-only options; they should work with spark-shell too but they may be less tested. Unfortunately, you do have to specify each JAR separately; you can maybe use a shell script to list a directory and get a big list, or set up a project that builds all of the

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Shay Seng
-- jars does indeed work but this causes the jars to also get shipped to the workers -- which I don't want to do for efficiency reasons. I think you are saying that setting spark.driver.extraClassPath in spark-default.conf ought to have the same behavior as providing --driver.class.apth to

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
Yeah, I think you should file this as a bug. The problem is that JARs need to also be added into the Scala compiler and REPL class loader, and we probably don't do this for the ones in this driver config property. Matei On Oct 30, 2014, at 6:07 PM, Shay Seng s...@urbanengines.com wrote: --

Re: [scala-user] Why aggregate is inconsistent?

2014-10-30 Thread Xuefeng Wu
My other question is that Spark why not provide foldLeft: *def foldLeft[U](zeroValue: U)(op: (U, T) = T): U *but aggregate. the *def fold(zeroValue: T)(op: (T, T) = T): T* in spark is not deterministic too. On Thu, Oct 30, 2014 at 3:50 PM, Jason Zaugg jza...@gmail.com wrote: On Thu, Oct 30,

Re: SparkSQL + Hive Cached Table Exception

2014-10-30 Thread Michael Armbrust
Hmmm, this looks like a bug. Can you file a JIRA? On Thu, Oct 30, 2014 at 4:04 PM, Jean-Pascal Billaud j...@tellapart.com wrote: Hi, While testing SparkSQL on top of our Hive metastore, I am getting some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD table.

Re: Use RDD like a Iterator

2014-10-30 Thread Zhan Zhang
RDD.toLocalIterator return the partition one by one but with all elements in the partition, which is not lazy calculated. Given the design of spark, it is very hard to maintain the state of iterator across runJob. def toLocalIterator: Iterator[T] = { def collectPartition(p: Int): Array[T]

SizeEstimator in Spark 1.1 and high load/object allocation when reading in data

2014-10-30 Thread Erik Freed
Hi All, We have recently moved to Spark 1.1 from 0.9 for an application handling a fair number of very large datasets partitioned across multiple nodes. About half of each of these large datasets is stored in off heap byte arrays and about half in the standard Java heap. While these datasets are

Spark Streaming Issue not running 24/7

2014-10-30 Thread sivarani
The problem is simple I want a to stream data 24/7 do some calculations and save the result in a csv/json file so that i could use it for visualization using dc.js/d3.js I opted for spark streaming on yarn cluster with kafka tried running it for 24/7 Using GroupByKey and updateStateByKey to

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sonal Goyal
Hey Sameer, Wouldnt local[x] run count parallelly in each of the x threads? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Oct 30, 2014 at 11:42 PM, Sameer Farooqui same...@databricks.com wrote: Hi Shahab, Are you running Spark

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-30 Thread Jianshi Huang
Hi Preshant, Chester, Mohammed, I switched to Spark's Akka and now it works well. Thanks for the help! (Need to exclude Akka from Spray dependencies, or specify it as provided) Jianshi On Thu, Oct 30, 2014 at 3:17 AM, Mohammed Guller moham...@glassbeam.com wrote: I am not sure about that.

Re: use additional ebs volumes for hsdf storage with spark-ec2

2014-10-30 Thread Daniel Mahler
Thanks Akhil. I tried changing /root/ephemeral-hdfs/conf/hdfs-site.xml to have property namedfs.data.dir/name value/vol,/vol0,/vol1,/vol2,/vol3,/vol4,/vol5,/vol6,/vol7,/mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data/value /property and then running

Re: NonSerializable Exception in foreachRDD

2014-10-30 Thread Tobias Pfeiffer
Harold, just mentioning it in case you run into it: If you are in a separate thread, there are apparently stricter limits to what you can and cannot serialize: val someVal future { // be very careful with defining RDD operations using someVal here val myLocalVal = someVal // use myLocalVal

Re: Using a Database to persist and load data from

2014-10-30 Thread Yanbo Liang
AFAIK, you can read data from DB with JdbcRDD, but there is no interface for writing to DB. JdbcRDD has some restrict such as SQL must with where clause. For writing to DB, you can use mapPartitions or foreachPartition to implement. You can refer this example: