Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Prasad
hi, Yes, i did. PARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly Further, when i use the spark-shell, i can read the same file and it works fine. Thanks Prasad. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0

RDDToTable

2014-02-28 Thread subacini Arunkumar
Hi, I am able create rdd from table using below code in shark. val rdd = sc.sql2rdd("SELECT * FROM TABLEXYZ") Could you please tell me how to create table from RDD using shark 0.8.1 RDDToTable? Thanks in advance, Subacini

RDDToTable

2014-02-28 Thread subacini Arunkumar
Hi, I am able create rdd from table using below code in shark. val rdd = sc.sql2rdd("SELECT * FROM TABLEXYZ") Could you please tell me how to create table from RDD using shark 0.8.1 RDDToTable? Thanks in advance, Subacini

Create a new object in pyspark map function

2014-02-28 Thread Kaden(Xiaozhe) Wang
Hi all, I try to create new object in the map function. But pyspark report a lot of error information. Is it legal to do so? Here is my codes: class Node(object): def __init__(self, A, B, C): self.A = A self.B = B self.C = C def make_vertex(pair): A, (B, C) = pair retur

Re: error in streaming word count API?

2014-02-28 Thread Aaron Kimball
As a post-script, when running the example in precompiled form: /bin/run-example org.apache.spark.streaming.examples.NetworkWordCount local[2] localhost ... I don't need to send a ^D to the netcat stream. It does print the batches to stdout in the manner I'd expect. So is this more repl wei

error in streaming word count API?

2014-02-28 Thread Aaron Kimball
Hi folks, I was trying to work through the streaming word count example at http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.htmland couldn't get the code as-written to run. In fairness, I was trying to do this inside the REPL rather than compiling a separate project; would

Re: java.net.SocketException on reduceByKey() in pyspark

2014-02-28 Thread Nicholas Chammas
Even a count() on the result of the flatMap() fails with the same error. Somehow the formatting on the error output got messed in my previous email, so here's a relevant snippet of the output again. 14/03/01 04:39:01 INFO scheduler.DAGScheduler: Failed to run count at :1 Traceback (most recent cal

Using jeromq instead of akka wrapped zeromq for spark streaming

2014-02-28 Thread Aureliano Buendia
Hi, It seems like a natural choice for spark streaming to go for akka wrapper zeromq, when spark is already based on akka. However, akka-zeromq is not the fastest choice for working with zeromq, akka does not support zeromq 3, which has been out for a long time, and some people reported akka-zerom

How to provide a custom Comparator to sortByKey?

2014-02-28 Thread Tao Xiao
I am using Spark 0.9 I have an array of tuples, and I want to sort these tuples using the *sortByKey *API as follows in Spark shell: val A:Array[(String, String)] = Array(("1", "One"), ("9", "Nine"), ("3", "three"), ("5", "five"), ("4", "four")) val P = sc.parallelize(A) // MyComparator is an exa

Connection Refused When Running SparkPi Locally

2014-02-28 Thread Benny Thompson
I'm trying to run a simple execution of the SparkPi example. I started the master and one worker, then executed the job on my local "cluster", but end up getting a sequence of errors all ending with "Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection re

Re: Trying to connect to spark from within a web server

2014-02-28 Thread Nathan Kronenfeld
I do notice that scala 2.9.2 is being included because of net.liftweb. Also, I don't know if I just missed it before or it wasn't doing this before and my latest changes get it a little farther, but I'm now seeing the following in the spark logs: 14/02/28 20:13:29 INFO actor.ActorSystemImpl: Remo

Lazyoutput format in spark

2014-02-28 Thread Mohit Singh
Hi, Is there something equivalent of LazyOutputFormat equivalent in spark (pyspark) http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/LazyOutputFormat.html Basically, something like where I only save files which has some data in it rather than saving all the files as

Re: JVM error

2014-02-28 Thread Bryn Keller
Hi Mohit, Yes, in pyspark you only get one chance to initialize a spark context. If it goes wrong, you have to restart the process. Thanks, Bryn On Fri, Feb 28, 2014 at 4:55 PM, Mohit Singh wrote: > And I tried that but got the error: > > Traceback (most recent call last): > File "", line 1

Re: JVM error

2014-02-28 Thread Mohit Singh
And I tried that but got the error: Traceback (most recent call last): File "", line 1, in File "/home/hadoop/spark/python/pyspark/context.py", line 83, in __init__ SparkContext._ensure_initialized(self) File "/home/hadoop/spark/python/pyspark/context.py", line 165, in _ensure_initialize

java.net.SocketException on reduceByKey() in pyspark

2014-02-28 Thread nicholas.chammas
I've done a whole bunch of things to this RDD, and now when I try to sortByKey(), this is what I get: >>> flattened_po.flatMap(lambda x: map_to_database_types(x)).sortByKey()14/02/28 23:18:41 INFO spark.SparkContext: Starting job: sortByKey at :114/02/28 23:18:41 INFO scheduler.DAGScheduler: Got j

Re: Spark streaming on ec2

2014-02-28 Thread Nicholas Chammas
Yeah, the Spark on EMR bootstrap scripts referenced hereneed some polishing. I had a lot of trouble just getting through that tutorial. And yes, the version of Spark they're using is 0.8.1. On Fri, Feb 28, 2014 at 2:39 PM, Aureliano Buendia wrote:

Re: Kryo Registration, class is not registered, but Log.TRACE() says otherwise

2014-02-28 Thread pondwater
Has no one ever registered generic classes in scala? Is it possible? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-Registration-class-is-not-registered-but-Log-TRACE-says-otherwise-tp2077p2182.html Sent from the Apache Spark User List mailing list arc

Re: Messy GraphX merge/reduce functions

2014-02-28 Thread Dan Davies
Are these incremental reduction functions what you'd expect when a graph is partitioned using vertex cuts? You'd naturally want to consolidate the versions of a vertex's state inside partitions, then across partitions. -- View this message in context: http://apache-spark-user-list.1001560.n3.

Spark stream example SimpleZeroMQPublisher high cpu usage

2014-02-28 Thread Aureliano Buendia
Hi, Running: ./bin/run-example org.apache.spa.streaming.examples.SimpleZeroMQPublisher tcp://127.0.1.1:1234 foo causes over 100% cpu usage on os x. Given that it's just a simple zmq publisher, this shouldn't be expected. Is there something wrong with that example?

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Egor Pahomov
Protobuf java code generated by ptotoc 2.4 does not compile with protobuf library 2.5 - that's true. What I meant: You serialized message with class generated with protobuf 2.4.1. Now you can read that message with class generated with protobuf 2.5.0 from same .proto. 2014-03-01 0:00 GMT+04:00 Eg

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Egor Pahomov
In that same pom yarn 2 2.2.0 2.5.0 yarn 2014-02-28 23:46 GMT+04:00 Aureliano Buendia : > > > > On Fri, Feb 28, 2014 at 7:17 PM, Egor Pahomov wrote: > >> Spark 0.9 uses protobuf 2.5.0 >> > > Spark 0.9 uses 2.4.1: > >

Re: JVM error

2014-02-28 Thread Bryn Keller
Sorry, typo - that last line should be: sc = pyspark.Spark*Context*(conf = conf) On Fri, Feb 28, 2014 at 9:37 AM, Mohit Singh wrote: > Hi Bryn, > Thanks for the suggestion. > I tried that.. > conf = pyspark.SparkConf().set("spark.executor.memory","20G") > But.. got an error here: > > sc = py

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Aureliano Buendia
On Fri, Feb 28, 2014 at 7:17 PM, Egor Pahomov wrote: > Spark 0.9 uses protobuf 2.5.0 > Spark 0.9 uses 2.4.1: https://github.com/apache/incubator-spark/blob/4d880304867b55a4f2138617b30600b7fa013b14/pom.xml#L118 Is there another pom for when hadoop 2.2 is used? I don't see another branch for hado

Re: Spark streaming on ec2

2014-02-28 Thread Aureliano Buendia
Unfortunately, that script is not under active maintenance. Given that spark is getting accelerated release cycles, solutions like this get outdated quickly. On Fri, Feb 28, 2014 at 7:36 PM, Mayur Rustagi wrote: > Thr is a talk to install spark on Amazon ( not sure if its updated for > 0.9.0). >

Re: Spark streaming on ec2

2014-02-28 Thread Mayur Rustagi
Thr is a talk to install spark on Amazon ( not sure if its updated for 0.9.0). http://www.youtube.com/watch?v=G0lSWUqyOhw In this case the bootstrap script will run on the new slave when it comes up. I am not sure how clean & production quality this is. He seems to be leveraging spot instances wher

Re: GraphX with UUID vertex IDs instead of Long

2014-02-28 Thread Deepak Nulu
I created an Improvement Issue for this: https://spark-project.atlassian.net/browse/SPARK-1153 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-with-UUID-vertex-IDs-instead-of-Long-tp1953p2173.html Sent from the Apache Spark User List mailing list arch

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Egor Pahomov
Spark 0.9 uses protobuf 2.5.0 Hadoop 2.2 uses protobuf 2.5.0 protobuf 2.5.0 can read massages serialized with protobuf 2.4.1 So there is not any reason why you can't read some messages from hadoop 2.2 with protobuf 2.5.0, probably you somehow have 2.4.1 in your class path. Of course it's very bad,

Re: Spark streaming on ec2

2014-02-28 Thread Aureliano Buendia
Also, in this talk http://www.youtube.com/watch?v=OhpjgaBVUtU on using spark streaming in production, the author seems to have missed the topic of how to manage cloud instances. On Fri, Feb 28, 2014 at 6:48 PM, Aureliano Buendia wrote: > What's the updated way of deploying spark streaming apps o

Re: Spark streaming on ec2

2014-02-28 Thread Aureliano Buendia
What's the updated way of deploying spark streaming apps on EMR? Using YARN? There are some out of date solutions like https://github.com/ianoc/SparkEMRBootstrap which setup mesos on EMR. I wonder if this can be simplified by spark 0.9. Spark-ec2 comes with a considerable amount of configuration,

Re: Use pyspark for following.

2014-02-28 Thread Andrew Ash
Roughly how many rows are in the most-common primary id? If that's small, you could group by primary id and assemble the resulting row from the group. Is it possible to have two rows with the same primary and secondary id? Like this: 1,alpha,20 1,alpha,25 If not, you could map these to expande

Use pyspark for following.

2014-02-28 Thread Chengi Liu
My use case: prim_id,secondary_id,value There are million ids.. but 5 secondary ids.. But any secondary id is optional. For example: So.. secondary ids are say [alpha,beta,gamma,delta,kappa] 1,alpha,20 1,beta,22 1,gamma,25 2,alpha,1 2,delta,15 3,kappa,90 What I want is to get the following outpu

Re: Spark streaming on ec2

2014-02-28 Thread Mayur Rustagi
I think what you are looking for is sort of a managed service ala EMR or Qubole. Spark-ec2 is just software to boot up machines & integrate them together using Whirr. I agree a managed service for Streaming would be really useful. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoi

Re: Build Spark Against CDH5

2014-02-28 Thread Brian Brunner
After successfully building the official 0.9.0 release I attempted to build off of the github code again and was successfully able to do so. Not really sure what happened, but it works now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Build-Spark-Against-

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Aureliano Buendia
Doesn't hadoop 2.2 also depend on protobuf 2.4? On Fri, Feb 28, 2014 at 5:45 PM, Ognen Duzlevski < og...@plainvanillagames.com> wrote: > A stupid question, by the way, you did compile Spark with Hadoop 2.2.0 > support? > > Ognen > > On 2/28/14, 10:51 AM, Prasad wrote: > >> Hi >> I am getting the

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Ognen Duzlevski
A stupid question, by the way, you did compile Spark with Hadoop 2.2.0 support? Ognen On 2/28/14, 10:51 AM, Prasad wrote: Hi I am getting the protobuf error while reading HDFS file using spark 0.9.0 -- i am running on hadoop 2.2.0 . When i look thru, i find that i have both 2.4.1 and 2.5 a

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Ognen Duzlevski
I run a 2.2.0 based HDFS cluster and I use Spark-0.9.0 without any problems to read the files. Ognen On 2/28/14, 10:51 AM, Prasad wrote: Hi I am getting the protobuf error while reading HDFS file using spark 0.9.0 -- i am running on hadoop 2.2.0 . When i look thru, i find that i have both

Key Sort order on reduction

2014-02-28 Thread Usman Ghani
Hi All, In Spark associative operations like groupByKey and reduceByKey, is it guaranteed that the keys to each reducer will flow in sorted order like they do in Hadoop MR? Or do I have to call sortByKey first?

Re: JVM error

2014-02-28 Thread Mohit Singh
Hi Bryn, Thanks for the suggestion. I tried that.. conf = pyspark.SparkConf().set("spark.executor.memory","20G") But.. got an error here: sc = pyspark.SparkConf(conf = conf) Traceback (most recent call last): File "", line 1, in TypeError: __init__() got an unexpected keyword argument 'conf'

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Aureliano Buendia
Using protobuf 2.5 can lead to some major issues with spark, see http://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3ccab89jjuy0sqkkokcidetglrzrj2zlat3phbvpjoxxcy9soq...@mail.gmail.com%3E Moving protobuf 2.5 jar after the spark jar can help with your error, but then you'll face the

Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Prasad
Hi I am getting the protobuf error while reading HDFS file using spark 0.9.0 -- i am running on hadoop 2.2.0 . When i look thru, i find that i have both 2.4.1 and 2.5 and some blogs suggest that there is some incompatability issues betwen 2.4.1 and 2.5 hduser@prasadHdp1:~/spark-0.9.0-incubati

Re: Spark streaming on ec2

2014-02-28 Thread Aureliano Buendia
Another subject that was not that important in spark, but it could be crucial for 24/7 spark streaming, is reconstruction of lost nodes. By that, I do not mean lost data reconstruction by self healing, but bringing up new ec2 instances once they die for whatever reasons. Is this also supported in s

Re: Having Spark read a JSON file

2014-02-28 Thread Paul Brown
Hi, Nick -- Not that it adds legitimacy, but there is even a MIME type line-delimited JSON: application/x-ldjson (not to be confused with application/ld+json...) What I said about ser/de in inline blocks only applied in the Scala dialect of Spark when using Jackson; for example: val om: Objec

RE: is RDD failure transparent to stream consumer

2014-02-28 Thread Adrian Mocanu
Thanks so much Matei! From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: February-28-14 10:59 AM To: user@spark.apache.org Subject: Re: is RDD failure transparent to stream consumer For output operators like this, the operator will run multiple times, so it need to be idempotent. However

Re: is RDD failure transparent to stream consumer

2014-02-28 Thread Matei Zaharia
For output operators like this, the operator will run multiple times, so it need to be idempotent. However, the built-in save operators (e.g. saveAsTextFile) are automatically idempotent (they only create each output partition once). Matei On Feb 28, 2014, at 10:10 AM, Adrian Mocanu wrote: >

RE: is RDD failure transparent to stream consumer

2014-02-28 Thread Adrian Mocanu
Would really like an answer to this. A `yes` or `no` would suffice. I'm talking ab RDD failure in this context: myStream.foreachRDD(rdd=>rdd.foreach(tuple => println(tuple))) From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: February-27-14 12:19 PM To: u...@spark.incubator.apache.org S

Re: Rename filter() into keep(), remove() or take() ?

2014-02-28 Thread Bertrand Dechoux
Clojure made the same kind of choice too : 'filter()' and 'remove()'. So the behavior of filter is obvious when you know about the other one... Well, the function name makes sense if you are thinking using a 'logic paradigm'. Anyway, that something I had to write about. I understand that the ROI i

Re: Implementing a custom Spark shell

2014-02-28 Thread Prashant Sharma
You can enable debug logging for repl, thankfully it uses sparks logging framework. Trouble must be with wrappers. Prashant Sharma On Fri, Feb 28, 2014 at 12:29 PM, Sampo Niskanen wrote: > Hi, > > Thanks for the pointers. I did get my code working within the normal > spark-shell. However, sin