Re: Is there a way to write spark RDD to Avro files

2014-08-01 Thread touchdown
Yes, I saw that after I looked at it closer. Thanks! But I am running into a schema not set error: Writer schema for output key was not set. Use AvroJob.setOutputKeySchema() I am in the process of figuring out how to set schema for an AvroJob from a HDFS file, but any pointer is much appreciated!

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep
Here is the log file. streaming.gz There are quite few AskTimeouts that have happening for about 2 minutes and then followed by block not found errors. Thanks Kanwal -- View this message in context: http://apach

Re: Computing mean and standard deviation by key

2014-08-01 Thread Ron Gonzalez
Can you share the mapValues approach you did? Thanks, Ron Sent from my iPhone > On Aug 1, 2014, at 3:00 PM, kriskalish wrote: > > Thanks for the help everyone. I got the mapValues approach working. I will > experiment with the reduceByKey approach later. > > <3 > > -Kris > > > > > -- >

Re: Is there a way to write spark RDD to Avro files

2014-08-01 Thread Ron Gonzalez
You have to import org.apache.spark.rdd._, which will automatically make available this method. Thanks, Ron Sent from my iPhone > On Aug 1, 2014, at 3:26 PM, touchdown wrote: > > Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small > avro files into one avro file. I re

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Tathagata Das
Then could you try giving me a log. And as a workaround, disable spark.streaming.unpersist = false On Fri, Aug 1, 2014 at 4:10 PM, Kanwaldeep wrote: > Not at all. Don't have any such code. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Stre

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson
Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like javaIterator = rdd._jrdd.toLocalIterator() it = rdd._collect_iterator_through_file(javaIterator) On Fri, Aug 1, 2014 at 3:04 PM, Andrei wrote: > Thanks, Aaron, it should be fi

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread Mayur Rustagi
You can design a receiver to receive data every 5 sec (batch size) & pull data of last 5 sec from http API, you can shard data by time further within those 5 sec to distribute it further. You can also bind TSDB nodes to each receiver to translate HBase data to improve performance. Mayur Rustagi P

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread bumble123
So is there no way to do this through SparkStreaming? Won't I have to do batch processing if I use the http api rather than getting it directly into Spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-from-OpenTSDB-using-PySpark-or-Scala-Spark

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread Mayur Rustagi
Http Api would be the best bet, I assume by graph you mean the charts created by tsdb frontends. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Aug 1, 2014 at 4:48 PM, bumble123 wrote: > I'm trying to get metrics

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread bumble123
I'm trying to get metrics out of TSDB so I can use Spark to do anomaly detection on graphs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-from-OpenTSDB-using-PySpark-or-Scala-Spark-tp11211p11232.html Sent from the Apache Spark User List mailing

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep
Not at all. Don't have any such code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p11231.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Tathagata Das
I meant are you using RDD generated by DStreams, in Spark jobs out side the DStreams computation? Something like this: var globalRDD = null dstream.foreachRDD(rdd => // have a global pointer based on the rdds generate by dstream if (runningFirstTime) globalRDD = rdd ) ssc.start() .

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep
All the operations being done are using the dstream. I do read an RDD in memory which is collected and converted into a map and used for lookups as part of DStream operations. This RDD is loaded only once and converted into map that is then used on streamed data. Do you mean non streaming jobs on

Re: Accumulator and Accumulable vs classic MR

2014-08-01 Thread Mayur Rustagi
Only blocker is accumulator can be only "added" to from slaves & only read on the master. If that constraint fit you well you can fire away. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Aug 1, 2014 at 7:38 AM, Jul

Re: RDD to DStream

2014-08-01 Thread Mayur Rustagi
Nice question :) Ideally you should use a queuestream interface to push RDD into a queue & then spark streaming can handle the rest. Though why are you looking to convert RDD to DStream, another workaround folks use is to source DStream from folders & move files that they need reprocessed back into

Re: How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread Mayur Rustagi
What is the usecase you are looking at? Tsdb is not designed for you to query data directly from HBase, Ideally you should use REST API if you are looking to do thin analysis. Are you looking to do whole reprocessing of TSDB ? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @m

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell
I've had intermiddent access to the artifacts themselves, but for me the directory listing always 404's. I think if sbt hits a 404 on the directory, it sends a somewhat confusing error message that it can't download the artifact. - Patrick On Fri, Aug 1, 2014 at 3:28 PM, Shivaram Venkataraman <

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Shivaram Venkataraman
Thanks Patrick -- It does look like some maven misconfiguration as wget http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.2/scala-library-2.10.2.pom works for me. Shivaram On Fri, Aug 1, 2014 at 3:27 PM, Patrick Wendell wrote: > This is a Scala bug - I filed something upstream

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Shivaram Venkataraman
This fails for me too. I have no idea why it happens as I can wget the pom from maven central. To work around this I just copied the ivy xmls and jars from this github repo https://github.com/peterklipfel/scala_koans/tree/master/ivyrepo/cache/org.scala-lang/scala-library and put it in /root/.ivy2/c

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell
This is a Scala bug - I filed something upstream, hopefully they can fix it soon and/or we can provide a work around: https://issues.scala-lang.org/browse/SI-8772 - Patrick On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau wrote: > Currently scala 2.10.2 can't be pulled in from maven central it se

Re: Installing Spark 0.9.1 on EMR Cluster

2014-08-01 Thread Mayur Rustagi
Have you tried https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 Thr is also a 0.9.1 version they talked about in one of the meetups. Check out the s3 bucket inthe guide.. it should have a 0.9.1 version as well. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @

Re: creating a distributed index

2014-08-01 Thread Ankur Dave
At 2014-08-01 14:50:22 -0600, Philip Ogren wrote: > It seems that I could do this with mapPartition so that each element in a > partition gets added to an index for that partition. > [...] > Would it then be possible to take a string and query each partition's index > with it? Or better yet, take

Re: Is there a way to write spark RDD to Avro files

2014-08-01 Thread touchdown
Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small avro files into one avro file. I read it in with: sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](path) but I can't find saveAsHadoopFile or saveAsNewAPIHadoopFile. Can you ple

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Holden Karau
Currently scala 2.10.2 can't be pulled in from maven central it seems, however if you have it in your ivy cache it should work. On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau wrote: > Me 3 > > > On Fri, Aug 1, 2014 at 11:15 AM, nit wrote: > >> I also ran into same issue. What is the solution? >>

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Holden Karau
Me 3 On Fri, Aug 1, 2014 at 11:15 AM, nit wrote: > I also ran into same issue. What is the solution? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html > Sent from

Re: creating a distributed index

2014-08-01 Thread andy petrella
Hey, There is some work that started on IndexedRDD (on master I think). Meanwhile, checking what has been done in GraphX regarding vertex index in partitions could be worthwhile I guess Hth Andy Le 1 août 2014 22:50, "Philip Ogren" a écrit : > > Suppose I want to take my large text data input and

Re: Iterator over RDD in PySpark

2014-08-01 Thread Andrei
Thanks, Aaron, it should be fine with partitions (I can repartition it anyway, right?). But rdd.toLocalIterator is purely Java/Scala method. Is there Python interface to it? I can get Java iterator though rdd._jrdd, but it isn't converted to Python iterator automatically. E.g.: >>> rdd = sc.para

Re: Computing mean and standard deviation by key

2014-08-01 Thread kriskalish
Thanks for the help everyone. I got the mapValues approach working. I will experiment with the reduceByKey approach later. <3 -Kris -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11214.html Sent from t

Re: correct upgrade process

2014-08-01 Thread SK
Hi, So I again ran "sbt clean" followed by all of the steps listed above to rebuild the jars after cleaning. My compilation error still persists. Specifically, I am trying to extract an element from the feature vector that is part of a LabeledPoint as follows: data.features(i) This gives the fo

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Tathagata Das
So you are not running non-streaming jobs using RDDs? That's disturbing. Can you provide me a log of the run in which you encountered this? You cal also try setting spark.streaming.unpersist = false All the blocks are going to be spilled to disk, and never unpersisted. To add to that you can set t

How to read from OpenTSDB using PySpark (or Scala Spark)?

2014-08-01 Thread bumble123
Hi, I've seen many threads about reading from HBase into Spark, but none about how to read from OpenTSDB into Spark. Does anyone know anything about this? I tried looking into it, but I think OpenTSDB saves its information into HBase using hex and I'm not sure how to interpret the data. If you cou

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks
Ignoring my warning about overflow - even more functional - just use a reduceByKey. Since your main operation is just a bunch of summing, you've got a commutative-associative reduce operation and spark will run do everything cluster-parallel, and then shuffle the (small) result set and merge appro

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep
We are using Sparks 1.0. I'm using DStream operations such as map, filter and reduceByKeyAndWindow and doing a foreach operation on DStream. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p1

Re: Computing mean and standard deviation by key

2014-08-01 Thread Sean Owen
Here's the more functional programming-friendly take on the computation (but yeah this is the naive formula): rdd.groupByKey.mapValues { mcs => val values = mcs.map(_.foo.toDouble) val n = values.count val sum = values.sum val sumSquares = values.map(x => x * x).sum math.sqrt(n * sumSqua

Re: Computing mean and standard deviation by key

2014-08-01 Thread kriskalish
So if I do something like this, spark handles the parallelization and recombination of sum and count on the cluster automatically? I started peeking into the source and see that foreach does submit a job to the cluster, but it looked like the inner function needed to return something to work proper

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks
Computing the variance is similar to this example, you just need to keep around the sum of squares as well. The formula for variance is (sumsq/n) - (sum/n)^2 But with big datasets or large values, you can quickly run into overflow issues - MLlib handles this by maintaining the the average sum of

Re: persisting RDD in memory

2014-08-01 Thread Mayur Rustagi
Hi Yes data would be cached again in each spark context. Regards Mayur On Friday, August 1, 2014, Sujee Maniyam wrote: > Hi all, > I have a scenario of a web application submitting multiple jobs to Spark. > These jobs may be operating on the same RDD. > > It is possible to cache() the RDD durin

creating a distributed index

2014-08-01 Thread Philip Ogren
Suppose I want to take my large text data input and create a distributed inverted index in Spark on each string in the input (imagine an in-memory lucene index - not want I'm doing but it's analogous). It seems that I could do this with mapPartition so that each element in a partition gets a

Re: Computing mean and standard deviation by key

2014-08-01 Thread Xu (Simon) Chen
I meant not sure how to do variance in one shot :-) With mean in hand, you can obvious broadcast the variable, and do another map/reduce to calculate variance per key. On Fri, Aug 1, 2014 at 4:39 PM, Xu (Simon) Chen wrote: > val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) => >

Re: Computing mean and standard deviation by key

2014-08-01 Thread Xu (Simon) Chen
val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) => (x._1+x._2, y._1+y._2)).collect This gives you a list of (key, (tot, count)), which you can easily calculate the mean. Not sure about variance. On Fri, Aug 1, 2014 at 2:55 PM, kriskalish wrote: > I have what seems like a relati

spark sql

2014-08-01 Thread Madabhattula Rajesh Kumar
Hi Team, I'm not able to print the values from Spark Sql JavaSchemaRDD. Please find below my code JavaSQLContext sqlCtx = new JavaSQLContext(sc); NewHadoopRDD rdd = new NewHadoopRDD( JavaSparkContext.toSparkContext(sc), TableInputF

Re: Number of partitions and Number of concurrent tasks

2014-08-01 Thread Daniel Siegmann
It is definitely possible to run multiple workers on a single node and have each worker with the maximum number of cores (e.g. if you have 8 cores and 2 workers you'd have 16 cores per node). I don't know if it's possible with the out of the box scripts though. It's actually not really that diffic

Re: Fwd: pyspark crash on mesos

2014-08-01 Thread Brad Miller
Hi Jia, Unfortunately, I did not ever find a solution. More recently, I tried running Spark 1.0.0 with (I think) Mesos 0.18.0; that likewise had stability issues. I've decided to make my peace with Standalone mode for now. -Brad On Thu, Jul 31, 2014 at 7:29 PM, daijia wrote: > I met the sam

Re: Hbase

2014-08-01 Thread Madabhattula Rajesh Kumar
Hi Akhil, Thank you very much for your help and support. Regards, Rajesh On Fri, Aug 1, 2014 at 7:57 PM, Akhil Das wrote: > Here's a piece of code. In your case, you are missing the call() method > inside the map function. > > > import java.util.Iterator; > > import java.util.List; > > import

Re: Computing mean and standard deviation by key

2014-08-01 Thread Sean Owen
You're certainly not iterating on the driver. The Iterable you process in your function is on the cluster and done in parallel. On Fri, Aug 1, 2014 at 8:36 PM, Kristopher Kalish wrote: > The reason I want an RDD is because I'm assuming that iterating the > individual elements of an RDD on the dri

Re: Computing mean and standard deviation by key

2014-08-01 Thread Kristopher Kalish
The reason I want an RDD is because I'm assuming that iterating the individual elements of an RDD on the driver of the cluster is much slower than coming up with the mean and standard deviation using a map-reduce-based algorithm. I don't know the intimate details of Spark's implementation, but it

Re: correct upgrade process

2014-08-01 Thread Matei Zaharia
This should be okay, but make sure that your cluster also has the right code deployed. Maybe you have the wrong one. If you built Spark from source multiple times, you may also want to try sbt clean before sbt assembly. Matei On August 1, 2014 at 12:00:07 PM, SK (skrishna...@gmail.com) wrote:

correct upgrade process

2014-08-01 Thread SK
Hi, I upgraded to 1.0.1 from 1.0 a couple of weeks ago and have been able to use some of the features advertised in 1.0.1. However, I get some compilation errors in some cases and based on user response, these errors have been addressed in the 1.0.1 version and so I should not be getting these er

Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Tathagata Das
Are you accessing the RDDs on raw data blocks and running independent Spark jobs on them (that is outside DStream)? In that case this may happen as Spark Straming will clean up the raw data based on the DStream operations (if there is a window op of 15 mins, it will keep the data around for 15 mins

Computing mean and standard deviation by key

2014-08-01 Thread kriskalish
I have what seems like a relatively straightforward task to accomplish, but I cannot seem to figure it out from the Spark documentation or searching the mailing list. I have an RDD[(String, MyClass)] that I would like to group by the key, and calculate the mean and standard deviation of the "foo"

Re: Issue using kryo serilization

2014-08-01 Thread gpatcham
any pointers to this issue. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-using-kryo-serilization-tp11129p11191.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Deploying spark applications from within Eclipse?

2014-08-01 Thread nunarob
Thanks much for your help Tobias. I am trying what you have suggested but am receiving error messages which I think are generally related to incorrect spark configuration. Will try and fix those and let you know how it goes. Cheers, -Rob On Wed, Jul 30, 2014 at 6:01 PM, Tobias Pfeiffer [via Ap

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread nit
I also ran into same issue. What is the solution? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Example standalone app error!

2014-08-01 Thread Alex Minnaar
I think this is the problem. I was working in a project that inherited some other Akka dependencies (of a different version). I'm switching to a fresh new project which should solve the problem. Thanks, Alex From: Tathagata Das Sent: Thursday, July 31

persisting RDD in memory

2014-08-01 Thread Sujee Maniyam
Hi all, I have a scenario of a web application submitting multiple jobs to Spark. These jobs may be operating on the same RDD. It is possible to cache() the RDD during one call... And all subsequent calls can use the cached RDD? basically, during one invocation val rdd1 = sparkContext1.textFil

Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep
We are using Sparks 1.0 for Spark Streaming on Spark Standalone cluster and seeing the following error. Job aborted due to stage failure: Task 3475.0:15 failed 4 times, most recent failure: Exception failure in TID 216394 on host hslave33102.sjc9.service-now.com: java.lang.Exception: Could

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson
rdd.toLocalIterator will do almost what you want, but requires that each individual partition fits in memory (rather than each individual line). Hopefully that's sufficient, though. On Fri, Aug 1, 2014 at 1:38 AM, Andrei wrote: > Is there a way to get iterator from RDD? Something like rdd.colle

Re: Extracting an element from the feature vector in LabeledPoint

2014-08-01 Thread Sean Owen
Oh I'm sorry, I somehow misread your email as looking for the label. I read too fast. That was pretty silly. THis works for me though: scala> val point = LabeledPoint(1,Vectors.dense(2,3,4)) point: org.apache.spark.mllib.regression.LabeledPoint = (1.0,[2.0,3.0,4.0]) scala> point.features(1) res10

Re: access hdfs file name in map()

2014-08-01 Thread Xu (Simon) Chen
Hi Roberto, Ultimately, the info you need is set here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69 Being a spark newbie, I extended org.apache.spark.rdd.HadoopRDD class as HadoopRDDWithEnv, which takes in an additional parameter (varnam

Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Michael Armbrust
So is the only issue that impala does not see changes until you refresh the table? This sounds like a configuration that needs to be changed on the impala side. On Fri, Aug 1, 2014 at 7:20 AM, Patrick McGloin wrote: > Sorry, sent early, wasn't finished typing. > > CREATE EXTERNAL TABLE >

Re: Extracting an element from the feature vector in LabeledPoint

2014-08-01 Thread SK
I am using 1.0.1. It does not matter to me whether it is the first or second element. I would like to know how to extract the i-th element in the feature vector (not the label). data.features(i) gives the following error: method apply in trait Vector cannot be accessed in org.apache.spark.mllib.l

Re: RDD to DStream

2014-08-01 Thread Aniket Bhatnagar
Hi everyone I haven't been receiving replies to my queries in the distribution list. Not pissed but I am actually curious to know if my messages are actually going through or not. Can someone please confirm that my msgs are getting delivered via this distribution list? Thanks, Aniket On 1 Augus

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-01 Thread Nicholas Chammas
On Fri, Aug 1, 2014 at 12:39 PM, Sean Owen wrote: Isn't this your worker running out of its memory for computations, > rather than for caching RDDs? > I’m not sure how to interpret the stack trace, but let’s say that’s true. I’m even seeing this with a simple a = sc.textFile().cache() and then a.

Spark SQL Query Plan optimization

2014-08-01 Thread N . Venkata Naga Ravi
Hi, I am trying to understand the query plan and number of tasks /execution time created for joined query. Consider following example , creating two tables emp, sal with appropriate 100 records in each table with key for joining them. EmpRDDRelation.scala case class EmpRecord(key: I

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-01 Thread Sean Owen
Isn't this your worker running out of its memory for computations, rather than for caching RDDs? so it has enough memory when you don't actually use a lot of the heap for caching, but when the cache uses its share, you actually run out of memory. If I'm right, and even I am not sure I have this str

Re: spark.shuffle.consolidateFiles seems not working

2014-08-01 Thread Jianshi Huang
I see. I'll try spark 1.1. On Fri, Aug 1, 2014 at 9:58 AM, Aaron Davidson wrote: > Make sure to set it before you start your SparkContext -- it cannot be > changed afterwards. Be warned that there are some known issues with shuffle > file consolidation, which should be fixed in 1.1. > > > On Th

RE: Data from Mysql using JdbcRDD

2014-08-01 Thread srinivas
Hi Thanks Alli have few more questions on this suppose i don't want to pass where caluse in my sql and is their a way that i can do this. Right now i am trying to modify JdbcRDD class by removing all the paramaters for lower bound and upper bound. But i am getting run time exceptions. Is thei

What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-01 Thread Nicholas Chammas
[Forking this thread.] According to the Spark Programming Guide , persisting RDDs with MEMORY_ONLY should not choke if the RDD cannot be held entirely in memory: If the RDD does not fit in memory, some partitions will not

Re: Number of partitions and Number of concurrent tasks

2014-08-01 Thread Nicholas Chammas
Darin, I think the number of cores in your cluster is a hard limit on how many concurrent tasks you can execute at one time. If you want more parallelism, I think you just need more cores in your cluster--that is, bigger nodes, or more nodes. Daniel, Have you been able to get around this limit?

Re: Number of partitions and Number of concurrent tasks

2014-08-01 Thread Daniel Siegmann
Sorry, but I haven't used Spark on EC2 and I'm not sure what the problem could be. Hopefully someone else will be able to help. The only thing I could suggest is to try setting both the worker instances and the number of cores (assuming spark-ec2 has such a parameter). On Thu, Jul 31, 2014 at 3:0

Re: how to publish spark inhouse?

2014-08-01 Thread Koert Kuipers
ok on the positive side i managed to publish inhouse snapshots with maven. on the negative side with the exclusions in dependencies getting lost when the pom gets read into sbt i do not think sbt build is currently usable. well, at least not for building distributions, i tried that and jboss netty

Re: Readin from Amazon S3 behaves inconsistently: return different number of lines...

2014-08-01 Thread nit
@sean - I am using latest code from master branch, up to commit# a7d145e98c55fa66a541293930f25d9cdc25f3b4 . In my case I have multiple directories with 1024 files(in that sizes of files may be different). For some directories I always get consistent result... and for others I can reproduce the inc

Accumulator and Accumulable vs classic MR

2014-08-01 Thread Julien Naour
Hi, My question is simple: could it be some performance issue using Accumulable/Accumulator instead of method like map() reduce()... ? My use case : implementation of a clustering algorithm like k-means. At the begining I used two steps, one to asign data to cluster and another to calculate new c

Re: streaming window not behaving as advertised (v1.0.1)

2014-08-01 Thread RodrigoB
Hi TD, I've also been fighting this issue only to find the exact same solution you are suggesting. Too bad I didn't find either the post or the issue sooner. I'm using a 1 second batch with N amount of kafka events (1 to 1 with the state objects) per batch and only calling the updatestatebykey f

Tasks fail when ran in cluster but they work fine when submited using local local

2014-08-01 Thread salemi
Hi All, My application works when I use the spark-submit with master=local[*]. But if I deploy the application to a standalone cluster master=spark://master:7077 that the application doesn't work and I get the following exception: 14/08/01 05:18:51 ERROR TaskSchedulerImpl: Lost executor 0 on dev

Re: Hbase

2014-08-01 Thread Akhil Das
Here's a piece of code. In your case, you are missing the call() method inside the map function. import java.util.Iterator; import java.util.List; import org.apache.commons.configuration.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.KeyValue;

Re: streaming window not behaving as advertised (v1.0.1)

2014-08-01 Thread Venkat Subramanian
TD, We are seeing the same issue. We struggled through this until we found this post and the work around. A quick fix in the Spark Streaming software will help a lot for others who are encountering this and pulling their hair out on why RDD on some partitions are not computed (we ended up spending

Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin
Sorry, sent early, wasn't finished typing. CREATE EXTERNAL TABLE Then we can select the data using Impala. But this is registered as an external table and must be refreshed if new data is inserted. Obviously this doesn't seem good and doesn't seem like the correct solution. How should we

Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin
Hi, We would like to use Spark SQL to store data in Parquet format and then query that data using Impala. We've tried to come up with a solution and it is working but it doesn't seem good. So I was wondering if you guys could tell us what is the correct way to do this. We are using Spark 1.0 an

Re: Issue with Spark on EC2 using spark-ec2 script

2014-08-01 Thread Dean Wampler
It looked like you were running in standalone mode (master set to local[4]). That's how I ran it. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler ht

Re: access hdfs file name in map()

2014-08-01 Thread Roberto Torella
Hi Simon, I'm trying to do the same but I'm quite lost. How did you do that? (Too direct? :) Thanks and ciao, r- -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-tp6551p11160.html Sent from the Apache Spark User List mailing l

Spark 0.9.2 sbt build issue

2014-08-01 Thread Arun Kumar
Hi While trying to build spark0.9.2 using sbt the build is failing due to the non resolving of most of the libraries .sbt cannot fetch the libraries in the specified location. Please tel me what changes are required to build spark using sbt Regards Arun

Re: Hbase

2014-08-01 Thread Madabhattula Rajesh Kumar
Hi Akhil, Thank you for your response. I'm facing below issues. I'm not able to print the values. Am I missing any thing. Could you please look into this issue. JavaPairRDD hBaseRDD = sc.newAPIHadoopRDD( conf, TableInputFormat.class, ImmutableBytesWritable.class, Resu

Re: Readin from Amazon S3 behaves inconsistently: return different number of lines...

2014-08-01 Thread Sean Owen
See https://issues.apache.org/jira/browse/SPARK-2579 It also was mentioned on the mailing list a while ago, and have heard tell of this from customers. I am trying to get to the bottom of it too. What version are you using, to start? I am wondering if it was fixed in 1.0.x since I was not able to

Re: configuration needed to run twitter(25GB) dataset

2014-08-01 Thread Ankur Dave
At 2014-08-01 02:12:08 -0700, shijiaxin wrote: > When I use fewer partitions, (like 6) > It seems that all the task will be assigned to the same machine, because the > machine has more than 6 cores.But this will run out of memory. > How to set fewer partitions number and use all the machine at the

Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Ankur Dave
Attempting to build Spark from source on EC2 using sbt gives the error "sbt.ResolveException: unresolved dependency: org.scala-lang#scala-library;2.10.2: not found". This only seems to happen on EC2, not on my local machine. To reproduce, launch a cluster using spark-ec2, clone the Spark reposi

Re: Accessing spark context from executors?

2014-08-01 Thread Sean Owen
You should use sc.hadoopConfiguration to get the Hadoop configuration. Making a new one just gets you default values, which may work for your purposes, but probably not as ideal. This configuration object should be something you can send in the closure. On Fri, Aug 1, 2014 at 2:16 AM, Sung Hwan Ch

Re: Extracting an element from the feature vector in LabeledPoint

2014-08-01 Thread Sean Owen
If you look at the class LabeledPoint, you'll see it has a field called "label": data.label data.features(1) would access the second element of features, which is not the same thing. On Fri, Aug 1, 2014 at 3:01 AM, SK wrote: > > Hi, > > I want to extract the individual elements of a feature vec

Reading from HDFS no faster than reading from S3 - how to tell if data locality respected?

2014-08-01 Thread Martin Goodson
Hi all, I'm consistently finding that reading from HDFS is not appreciably faster than reading from S3 using pyspark. How can I tell whether data locality is being respected? In this example, reading from HDFS is only about 10% faster than reading the same file from S3. The files were pulled from

Re: configuration needed to run twitter(25GB) dataset

2014-08-01 Thread shijiaxin
When I use fewer partitions, (like 6) It seems that all the task will be assigned to the same machine, because the machine has more than 6 cores.But this will run out of memory. How to set fewer partitions number and use all the machine at the same time? -- View this message in context: http://

Environment Variables question

2014-08-01 Thread redocpot
Hi, According to the configuration guide (http://spark.apache.org/docs/latest/configuration.html#environment-variables), "Certain Spark settings can be configured through environment variables, which are read from the conf/spark-env.sh script in the directory where Spark is installed (or conf/spa

Spark-sql with Tachyon cache

2014-08-01 Thread Dariusz Kobylarz
Hi, I would like to ask if spark-sql tables cached by Tachyon is a feature to be migrated from shark. I imagine from the user perspective it would look like this: |CREATE TABLE data TBLPROPERTIES("sparksql.cache" = "tachyon") AS SELECT a, b, c from data_on_disk WHERE month="May";|

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-08-01 Thread chenjie
I used the web ui of spark and could see the conf directory is in CLASSPATH. An abnormal thing is that when start spark-shell I always get the following info: WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable At first, I th

Iterator over RDD in PySpark

2014-08-01 Thread Andrei
Is there a way to get iterator from RDD? Something like rdd.collect(), but returning lazy sequence and not single array. Context: I need to GZip processed data to upload it to Amazon S3. Since archive should be a single file, I want to iterate over RDD, writing each line to a local .gz file. File

RDD to DStream

2014-08-01 Thread Aniket Bhatnagar
Sometimes it is useful to convert a RDD into a DStream for testing purposes (generating DStreams from historical data, etc). Is there an easy way to do this? I could come up with the following inefficient way but no sure if there is a better way to achieve this. Thoughts? class RDDExtension[T](rd

Re: [GraphX] The best way to construct a graph

2014-08-01 Thread Ankur Dave
At 2014-08-01 11:23:49 +0800, Bin wrote: > I am wondering what is the best way to construct a graph? > > Say I have some attributes for each user, and specific weight for each user > pair. The way I am currently doing is first read user information and edge > triple into two arrays, then use sc.

Re: configuration needed to run twitter(25GB) dataset

2014-08-01 Thread Ankur Dave
At 2014-07-31 21:40:39 -0700, shijiaxin wrote: > Is it possible to reduce the number of edge partitions and exploit > parallelism fully at the same time? > For example, one partition per node, and the threads in the same node share > the same partition. It's theoretically possible to parallelize