Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Prasad
hi,
Yes, i did. 
PARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly
Further, when i use the spark-shell, i can read the same file and it works
fine.
Thanks
Prasad.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p2199.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RDDToTable

2014-02-28 Thread subacini Arunkumar
Hi,

I am able create rdd from table using below code in shark.
val rdd = sc.sql2rdd("SELECT * FROM TABLEXYZ")

Could you please tell me how to create table from RDD using shark 0.8.1
RDDToTable?


Thanks in advance,
Subacini


RDDToTable

2014-02-28 Thread subacini Arunkumar
Hi,

I am able create rdd from table using below code in shark.
val rdd = sc.sql2rdd("SELECT * FROM TABLEXYZ")

Could you please tell me how to create table from RDD using shark 0.8.1
RDDToTable?


Thanks in advance,
Subacini


Create a new object in pyspark map function

2014-02-28 Thread Kaden(Xiaozhe) Wang
Hi all,
I try to create new object in the map function. But pyspark report a lot of
error information. Is it legal to do so? Here is my codes:

class Node(object):

  def __init__(self, A, B, C):

self.A = A

self.B = B

self.C = C


def make_vertex(pair):

   A, (B, C) = pair

   return Node(A, B, C)


dictionary = {'PYTHONPATH':'/home/grad02/xss/opt/old'}

sc = SparkContext("spark://zzz:7077", "test job", environment = dictionary )

rdd = sc.parallelize([[1,(2, 3) ]])

noMap = make_vertex([1, (2, 3)])

print(noMap.A)

myRdd = rdd.map(make_vertex)

result = myRdd.collect()


Could anybody tell me whether create a new object in a map function in
pyspark is legal?


Thanks,

Kaden


Re: error in streaming word count API?

2014-02-28 Thread Aaron Kimball
As a post-script, when running the example in precompiled form:

/bin/run-example org.apache.spark.streaming.examples.NetworkWordCount
local[2] localhost 


... I don't need to send a ^D to the netcat stream. It does print the
batches to stdout in the manner I'd expect. So is this more repl weirdness
than spark weirdness?

- Aaron


On Fri, Feb 28, 2014 at 8:46 PM, Aaron Kimball  wrote:

> Hi folks,
>
> I was trying to work through the streaming word count example at
> http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.htmland
>  couldn't get the code as-written to run. In fairness, I was trying to
> do this inside the REPL rather than compiling a separate project; would the
> types be different?
>
> In any case, here's the code I ran:
>
> $ SPARK_JAVA_OPTS=-Dspark.cleaner.ttl=30 bin/spark-shell
>
> scala> import org.apache.spark.streaming._
> scala> val ssc = new StreamingContext(sc, Seconds(2))
> scala> val lines = ssc.socketTextStream("127.0.0.1", 1234)
> scala> val words = lines.flatMap(_.split(" "))
>
> // *** The following code from the html page doesn't work
> // because pairs has type DStream[(String, Int)] and
> // there is no reduceByKey method on this type.
>
> // Count each word in each batch
> scala> val pairs = words.map(word => (word, 1))
> scala> val wordCounts = pairs.reduceByKey(_ + _)  // <-- error here. no
> reduceByKey()
>
> // Print a few of the counts to the console
> scala> wordCount.print()   // ... and even if the above did work,
> 'wordCount' and 'wordCounts' are different symbols ;) This couldn't compile
> as written.
>
>
> Instead, I got the following to run instead:
> scala> val wordCounts = words.countByValue()
> scala> wordCounts.print()
> scala> ssc.start() // Start the computation
> scala> ssc.awaitTermination()
>
> This worked if I ran 'nc -lk 1234' in another terminal and typed some
> words into it.. but the 'wordCounts.print()' statement would only emit
> things to stdout if I sent a ^D into the netcat stream. It seems to print
> the output for all 2-second windows all-at-once after the ^D in the network
> stream. Is this an expected effect? I don't understand the semantics of
> ssc.start / awaitTermination well enough to know how it interacts with the
> print statement on wordCounts (which I think is a DStream of RRDs?)
>
> I set spark.cleaner.ttl to a relatively high value (I'm not sure what
> units those are.. seconds or millis) because a lower value caused stderr to
> spam everywhere and make my terminal unreadable. Is that part of my issue?
> the spark repl said I had to set it, so I just picked a number.
>
> I kind of expected wordCounts.print() to be constantly emitting (word, N)
> pairs to my spark terminal as I typed into the netcat side of things.
>
> I'm using Spark built from github source that I pulled from source earlier
> today.
>
> I am using the following as my 'origin':
>   Fetch URL: git://github.com/apache/incubator-spark.git
>
> ... and the most recent commit (master a.k.a. HEAD) is:
> commit 4d880304867b55a4f2138617b30600b7fa013b14
> Author: Bryn Keller 
> Date:   Mon Feb 24 17:35:22 2014 -0800
>
>
> In any case, I'm happy to help update the docs (or the code) if this is a
> bug. I realize this is getting long-winded. But in any case, I think my
> questions really boil down to:
>
> 1) should there be a reduceByKey() method on DStream? The documentation at
> http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.htmlsays
>  so in the "Transformations" section, but the scaladoc at
> https://spark.incubator.apache.org/docs/latest/api/streaming/index.html#org.apache.spark.streaming.dstream.DStreamdoesn't
>  list it.  DStream.scala also doesn't have a definition for such a
> method...
>
> (and based on reading the source of NetworkWordCount.scala, I can't
> spot-identify why this *does* work there (i.e., reduceByKey compiles) but
> it doesn't do so in the terminal)
>
> 2) Why do I have to wait for the stream to "terminate" with a ^D before
> seeing any stdout in the repl from the wordCounts.print() statement?
>  Doesn't this defeat the point of "streaming"?
> 2a) how does the print() statement interact with ssc.start() and
> ssc.awaitTermination() ?
>
> 3) is the cleaner TTL something that, as a user, I should be adjusting to
> change my observed effects? i.e., would adjusting this change the frequency
> of emissions to stdout of prior window data?  Or is this just a background
> property that happens to affect the spamminess of my stderr that is routed
> to the same console?
>
> 4) Should I update the documentation to match my example (i.e., no
> reduceByKey, but use words.countByValue() instead)?
>
> 5) Now that Spark is a TLP, are my references to the incubator-spark.git
> and the http://spark.incubator.apache.org docs woefully out of date,
> making this entire exercise a goof? :)
>
> Thanks for the help!
>
> Cheers,
> - Aaron
>
>


error in streaming word count API?

2014-02-28 Thread Aaron Kimball
Hi folks,

I was trying to work through the streaming word count example at
http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.htmland
couldn't get the code as-written to run. In fairness, I was trying to
do this inside the REPL rather than compiling a separate project; would the
types be different?

In any case, here's the code I ran:

$ SPARK_JAVA_OPTS=-Dspark.cleaner.ttl=30 bin/spark-shell

scala> import org.apache.spark.streaming._
scala> val ssc = new StreamingContext(sc, Seconds(2))
scala> val lines = ssc.socketTextStream("127.0.0.1", 1234)
scala> val words = lines.flatMap(_.split(" "))

// *** The following code from the html page doesn't work
// because pairs has type DStream[(String, Int)] and
// there is no reduceByKey method on this type.

// Count each word in each batch
scala> val pairs = words.map(word => (word, 1))
scala> val wordCounts = pairs.reduceByKey(_ + _)  // <-- error here. no
reduceByKey()

// Print a few of the counts to the console
scala> wordCount.print()   // ... and even if the above did work,
'wordCount' and 'wordCounts' are different symbols ;) This couldn't compile
as written.


Instead, I got the following to run instead:
scala> val wordCounts = words.countByValue()
scala> wordCounts.print()
scala> ssc.start() // Start the computation
scala> ssc.awaitTermination()

This worked if I ran 'nc -lk 1234' in another terminal and typed some words
into it.. but the 'wordCounts.print()' statement would only emit things to
stdout if I sent a ^D into the netcat stream. It seems to print the output
for all 2-second windows all-at-once after the ^D in the network stream. Is
this an expected effect? I don't understand the semantics of ssc.start /
awaitTermination well enough to know how it interacts with the print
statement on wordCounts (which I think is a DStream of RRDs?)

I set spark.cleaner.ttl to a relatively high value (I'm not sure what units
those are.. seconds or millis) because a lower value caused stderr to spam
everywhere and make my terminal unreadable. Is that part of my issue? the
spark repl said I had to set it, so I just picked a number.

I kind of expected wordCounts.print() to be constantly emitting (word, N)
pairs to my spark terminal as I typed into the netcat side of things.

I'm using Spark built from github source that I pulled from source earlier
today.

I am using the following as my 'origin':
  Fetch URL: git://github.com/apache/incubator-spark.git

... and the most recent commit (master a.k.a. HEAD) is:
commit 4d880304867b55a4f2138617b30600b7fa013b14
Author: Bryn Keller 
Date:   Mon Feb 24 17:35:22 2014 -0800


In any case, I'm happy to help update the docs (or the code) if this is a
bug. I realize this is getting long-winded. But in any case, I think my
questions really boil down to:

1) should there be a reduceByKey() method on DStream? The documentation at
http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.htmlsays
so in the "Transformations" section, but the scaladoc at
https://spark.incubator.apache.org/docs/latest/api/streaming/index.html#org.apache.spark.streaming.dstream.DStreamdoesn't
list it.  DStream.scala also doesn't have a definition for such a
method...

(and based on reading the source of NetworkWordCount.scala, I can't
spot-identify why this *does* work there (i.e., reduceByKey compiles) but
it doesn't do so in the terminal)

2) Why do I have to wait for the stream to "terminate" with a ^D before
seeing any stdout in the repl from the wordCounts.print() statement?
 Doesn't this defeat the point of "streaming"?
2a) how does the print() statement interact with ssc.start() and
ssc.awaitTermination() ?

3) is the cleaner TTL something that, as a user, I should be adjusting to
change my observed effects? i.e., would adjusting this change the frequency
of emissions to stdout of prior window data?  Or is this just a background
property that happens to affect the spamminess of my stderr that is routed
to the same console?

4) Should I update the documentation to match my example (i.e., no
reduceByKey, but use words.countByValue() instead)?

5) Now that Spark is a TLP, are my references to the incubator-spark.git
and the http://spark.incubator.apache.org docs woefully out of date, making
this entire exercise a goof? :)

Thanks for the help!

Cheers,
- Aaron


Re: java.net.SocketException on reduceByKey() in pyspark

2014-02-28 Thread Nicholas Chammas
Even a count() on the result of the flatMap() fails with the same error.
Somehow the formatting on the error output got messed in my previous email,
so here's a relevant snippet of the output again.

14/03/01 04:39:01 INFO scheduler.DAGScheduler: Failed to run count at
:1
Traceback (most recent call last):
  File "", line 1, in 
  File "/root/spark/python/pyspark/rdd.py", line 542, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/root/spark/python/pyspark/rdd.py", line 533, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "/root/spark/python/pyspark/rdd.py", line 499, in reduce
vals = self.mapPartitions(func).collect()
  File "/root/spark/python/pyspark/rdd.py", line 463, in collect
bytesInJava = self._jrdd.collect().iterator()
  File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 537, in __call__
  File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o396.collect.
: org.apache.spark.SparkException: Job aborted: Task 29.0:0 failed 4 times
(most recent failure: Exception failure: java.net.SocketException:
Connection reset)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
 at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Any pointers to where I should look, or things to try?

Nick



On Fri, Feb 28, 2014 at 6:33 PM, nicholas.chammas <
nicholas.cham...@gmail.com> wrote:

> I've done a whole bunch of things to this RDD, and now when I try to
> sortByKey(), this is what I get:
>
> >>> flattened_po.flatMap(lambda x: 
> >>> map_to_database_types(x)).sortByKey()14/02/28
> 23:18:41 INFO spark.SparkContext: Starting job: sortByKey at :114/02/28
> 23:18:41 INFO scheduler.DAGScheduler: Got job 22 (sortByKey at :1)
> with 1 output partitions (allowLocal=false)14/02/28 23:18:41 INFO
> scheduler.DAGScheduler: Final stage: Stage 23 (sortByKey at :1)14/02/28
> 23:18:41 INFO scheduler.DAGScheduler: Parents of final stage: List()14/02/28
> 23:18:41 INFO scheduler.DAGScheduler: Missing parents: List()14/02/28
> 23:18:41 INFO scheduler.DAGScheduler: Submitting Stage 23 (PythonRDD[41] at
> sortByKey at :1), which has no missing parents14/02/28 23:18:41
> INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 23
> (PythonRDD[41] at sortByKey at :1)14/02/28 23:18:41 INFO
> scheduler.TaskSchedulerImpl: Adding task set 23.0 with 1 tasks14/02/28
> 23:18:41 INFO scheduler.TaskSetManager: Starting task 23.0:0 as TID 32 on
> executor 0: ip-.ec2.internal (PROCESS_LOCAL)14/02/28 23:18:41 INFO
> scheduler.TaskSetManager: Serialized task 23.0:0 as 4985 bytes in 1 ms14/02/28
> 23:18:41 WARN scheduler.TaskSetManager: Lost TID 32 (task 23.0:0)14/02/28
> 23:18:41 WARN scheduler.TaskSetManager: Loss was due to
> java.net.SocketExceptionjava.net.SocketException: Connection reset at
> java.net.SocketInputStream.read(SocketInputStream.java:196) at
> java.net.SocketInputStream.read(SocketInputStream.java:122) at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at
> java.io.BufferedInputStream.read(BufferedInputStream.java:254) at
> java.io.DataInputStream.readInt(DataInputStream.java:387) at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110) at
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:153) at
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at
> o

Using jeromq instead of akka wrapped zeromq for spark streaming

2014-02-28 Thread Aureliano Buendia
Hi,

It seems like a natural choice for spark streaming to go for akka wrapper
zeromq, when spark is already based on akka. However, akka-zeromq is not
the fastest choice for working with zeromq, akka does not support zeromq 3,
which has been out for a long time, and some people reported akka-zeromq
being slow.

Why not using https://github.com/zeromq/jeromq which is the official java
support for zeromq?


How to provide a custom Comparator to sortByKey?

2014-02-28 Thread Tao Xiao
I am using Spark 0.9
I have an array of tuples, and I want to sort these tuples using the *sortByKey
*API as follows in Spark shell:

val A:Array[(String, String)] = Array(("1", "One"), ("9", "Nine"), ("3",
"three"), ("5", "five"), ("4", "four"))
val P = sc.parallelize(A)

// MyComparator is an example, maybe I have more complex implementation
class MyComparator extends java.util.Comparator[String] {
def compare(s1:String, s2:String):Int = {
s1.compareTo(s2)
}
}

val comp = new MyComparator()
P.sortByKey(comp, true)


When I invoked P.sortByKey(comp, true),  spark shell complained that there
was a type mismatch, to be specific, *sortByKey *requires *Boolean *but I
provided a *Comparator*.

How should I provide my custom comparator to *sortByKey *?


Connection Refused When Running SparkPi Locally

2014-02-28 Thread Benny Thompson
I'm trying to run a simple execution of the SparkPi example.  I started the
master and one worker, then executed the job on my local "cluster", but end
up getting a sequence of errors all ending with

"Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: /127.0.0.1:39398"

I originally tried running my master and worker without configuration but
ended up with the same error.  I tried to change to 127.0.0.1 to test if it
was maybe just a firewall issue since the server is locked down from the
outside world.

My conf/spark-conf.sh contains the following:
export SPARK_MASTER_IP=127.0.0.1

Here is the order and commands I run:
1) "sbin/start-master.sh" (to start the master)
2) "bin/spark-class org.apache.spark.deploy.worker.Worker spark://
127.0.0.1:7077 --ip 127.0.0.1 --port " (in a different session on the
same machine to start the slave)
3) "bin/run-example org.apache.spark.examples.SparkPi spark://127.0.0.1:7077"
(in a different session on the same machine to start the job)

I find it hard to believe that I'm locked down enough that running locally
would cause problems.

Any help is greatly appreciated!

Thanks,
Benny


Re: Trying to connect to spark from within a web server

2014-02-28 Thread Nathan Kronenfeld
I do notice that scala 2.9.2 is being included because of net.liftweb.

Also, I don't know if I just missed it before or it wasn't doing this
before and my latest changes get it a little farther, but I'm now seeing
the following in the spark logs:

14/02/28 20:13:29 INFO actor.ActorSystemImpl: RemoteClientStarted
@akka://spark@hadoop-s1.oculus.local:35212
14/02/28 20:13:29 ERROR NettyRemoteTransport(null): dropping message
RegisterApplication(ApplicationDescription(Web Service Spark Instance)) for
non-local recipient akka://sparkMaster@192.168.0.46:7077/user/Master at
akka://sparkMaster@hadoop-s1.oculus.local:7077 local is
akka://sparkMaster@hadoop-s1.oculus.local:7077
14/02/28 20:13:49 ERROR NettyRemoteTransport(null): dropping message
RegisterApplication(ApplicationDescription(Web Service Spark Instance)) for
non-local recipient akka://sparkMaster@192.168.0.46:7077/user/Master at
akka://sparkMaster@hadoop-s1.oculus.local:7077 local is
akka://sparkMaster@hadoop-s1.oculus.local:7077
14/02/28 20:14:09 ERROR NettyRemoteTransport(null): dropping message
RegisterApplication(ApplicationDescription(Web Service Spark Instance)) for
non-local recipient akka://sparkMaster@192.168.0.46:7077/user/Master at
akka://sparkMaster@hadoop-s1.oculus.local:7077 local is
akka://sparkMaster@hadoop-s1.oculus.local:7077
14/02/28 20:14:32 INFO actor.ActorSystemImpl: RemoteClientShutdown
@akka://spark@hadoop-s1.oculus.local:35212



On Sat, Feb 22, 2014 at 1:58 PM, Soumya Simanta wrote:

> Mostly likely all your classes/jars that are required to connect to Spark
> and not being loaded or the incorrect versions are being loaded when you
> start to do this from inside the web container (Tomcat).
>
>
>
>
>
> On Sat, Feb 22, 2014 at 1:51 PM, Nathan Kronenfeld <
> nkronenf...@oculusinfo.com> wrote:
>
>> yes, but only when I try to connect from a web service running in Tomcat.
>>
>> When I try to connect using a stand-alone program, using the same
>> parameters, it works fine.
>>
>>
>> On Sat, Feb 22, 2014 at 12:15 PM, Mayur Rustagi 
>> wrote:
>>
>>> So Spark is running on that IP, web ui is loading on that IP showing
>>> workers & when you connect to that IP with javaAPI the cluster appears to
>>> be down to it?
>>>
>>> Mayur Rustagi
>>> Ph: +919632149971
>>> h ttp://www.sigmoidanalytics.com
>>> https://twitter.com/mayur_rustagi
>>>
>>>
>>>
>>> On Fri, Feb 21, 2014 at 10:22 PM, Nathan Kronenfeld <
>>> nkronenf...@oculusinfo.com> wrote:
>>>
 Netstat gives exactly the expected IP address (not a 127, but a
 192...).
 I tried it anyway, though... exactly the same results, but with a
 number instead of a name.
 Oh, and I forgot to mention last time, in case it makes a difference -
 I'm running 0.8.1, not 0.9.0, at least for now



 On Sat, Feb 22, 2014 at 12:50 AM, Mayur Rustagi <
 mayur.rust...@gmail.com> wrote:

> most likely the master is binding to a unique address and you are
> connecting to some other internal address. Master can bind to random
> internal address 127.0... or even your machine IP at that time.
> Easiest is to check
> netstat -an |grep 7077
> This will give you which IP to bind to exactly when launching spark
> context.
>
> Mayur Rustagi
> Ph: +919632149971
> h ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Fri, Feb 21, 2014 at 9:36 PM, Nathan Kronenfeld <
> nkronenf...@oculusinfo.com> wrote:
>
>> Can anyone help me here?
>>
>> I've got a small spark cluster running on three machines - hadoop-s1,
>> hadoop-s2, and hadoop-s3 - with s1 acting master, and all three acting as
>> workers.  It works fine - I can connect with spark-shell, I can run 
>> jobs, I
>> can see the web ui.
>>
>> The web UI says:
>> Spark Master at spark://hadoop-s1.oculus.local:7077
>> URL: spark://hadoop-s1.oculus.local:7077
>>
>> I've connected to it fine using both a scala and a java SparkContext.
>>
>> But when I try connecting from within a Tomcat service, I get the
>> following messages:
>> [INFO] 22 Feb 2014 00:27:38 - org.apache.spark.Logging$class -
>> Connecting to master spark://hadoop-s1.oculus.local:7077...
>> [INFO] 22 Feb 2014 00:27:58 - org.apache.spark.Logging$class -
>> Connecting to master spark://hadoop-s1.oculus.local:7077...
>> [ERROR] 22 Feb 2014 00:28:18 - org.apache.spark.Logging$class - All
>> masters are unresponsive! Giving up.
>> [ERROR] 22 Feb 2014 00:28:18 - org.apache.spark.Logging$class - Spark
>> cluster looks dead, giving up.
>> [ERROR] 22 Feb 2014 00:28:18 - org.apache.spark.Logging$class -
>> Exiting due to error from cluster scheduler: Spark cluster looks down
>>
>> When I look on the spark server logs, there isn't even a sign of an
>> attempted connection.
>>
>> I'm tryi

Lazyoutput format in spark

2014-02-28 Thread Mohit Singh
Hi,
  Is there something equivalent of LazyOutputFormat equivalent in spark
(pyspark)
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/LazyOutputFormat.html
Basically, something like where I only save files which has some data in it
rather than saving all the files as in some cases, your majority of files
can be empty?
Thanks

-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates


Re: JVM error

2014-02-28 Thread Bryn Keller
Hi Mohit,

Yes, in pyspark you only get one chance to initialize a spark context. If
it goes wrong, you have to restart the process.

Thanks,
Bryn


On Fri, Feb 28, 2014 at 4:55 PM, Mohit Singh  wrote:

> And I tried that but got the error:
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/hadoop/spark/python/pyspark/context.py", line 83, in __init__
> SparkContext._ensure_initialized(self)
>   File "/home/hadoop/spark/python/pyspark/context.py", line 165, in
> _ensure_initialized
> raise ValueError("Cannot run multiple SparkContexts at once")
> ValueError: Cannot run multiple SparkContexts at once
>
>
> On Fri, Feb 28, 2014 at 11:59 AM, Bryn Keller  wrote:
>
>> Sorry, typo - that last line should be:
>>
>> sc = pyspark.Spark*Context*(conf = conf)
>>
>>
>> On Fri, Feb 28, 2014 at 9:37 AM, Mohit Singh  wrote:
>>
>>> Hi Bryn,
>>>   Thanks for the suggestion.
>>> I tried that..
>>> conf = pyspark.SparkConf().set("spark.executor.memory","20G")
>>> But.. got an error here:
>>>
>>> sc = pyspark.SparkConf(conf = conf)
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>> TypeError: __init__() got an unexpected keyword argument 'conf'
>>>
>>> ??
>>> This is in pyspark shell.
>>>
>>>
>>> On Thu, Feb 27, 2014 at 5:00 AM, Evgeniy Shishkin 
>>> wrote:
>>>

 On 27 Feb 2014, at 07:22, Aaron Davidson  wrote:

 > Setting spark.executor.memory is indeed the correct way to do this.
 If you want to configure this in spark-env.sh, you can use
 > export SPARK_JAVA_OPTS=" -Dspark.executor.memory=20g"
 > (make sure to append the variable if you've been using
 SPARK_JAVA_OPTS previously)
 >
 >
 > On Wed, Feb 26, 2014 at 7:50 PM, Bryn Keller 
 wrote:
 > Hi Mohit,
 >
 > You can still set SPARK_MEM in spark-env.sh, but that is deprecated.
 This is from SparkContext.scala:
 >
 > if (!conf.contains("spark.executor.memory") &&
 sys.env.contains("SPARK_MEM")) {
 > logWarning("Using SPARK_MEM to set amount of memory to use per
 executor process is " +
 >   "deprecated, instead use spark.executor.memory")
 >   }
 >
 > Thanks,
 > Bryn
 >
 >
 > On Wed, Feb 26, 2014 at 6:28 PM, Mohit Singh 
 wrote:
 > Hi Bryn,
 >   Thanks for responding. Is there a way I can permanently configure
 this setting?
 > like SPARK_EXECUTOR_MEMORY or somethign like that?
 >
 >
 >
 > On Wed, Feb 26, 2014 at 2:56 PM, Bryn Keller 
 wrote:
 > Hi Mohit,
 >
 > Try increasing the executor memory instead of the worker memory - the
 most appropriate place to do this is actually when you're creating your
 SparkContext, something like:
 >
 > conf = pyspark.SparkConf()
 >.setMaster("spark://master:7077")
 >.setAppName("Example")
 >.setSparkHome("/your/path/to/spark")
 >.set("spark.executor.memory", "20G")
 >.set("spark.logConf", "true")
 > sc = pyspark.SparkConf(conf = conf)
 >
 > Hope that helps,
 > Bryn
 >
 >
 >
 > On Wed, Feb 26, 2014 at 2:39 PM, Mohit Singh 
 wrote:
 > Hi,
 >   I am experimenting with pyspark lately...
 > Every now and then, I see this error bieng streamed to pyspark shell
 .. and most of the times.. the computation/operation completes.. and
 sometimes, it just gets stuck...
 > My setup is 8 node cluster.. with loads of ram(256GB's) and space(
 TB's) per node.
 > This enviornment is shared by general hadoop and hadoopy stuff..with
 recent spark addition...
 >
 > java.lang.OutOfMemoryError: Java heap space
 > at
 com.ning.compress.BufferRecycler.allocEncodingBuffer(BufferRecycler.java:59)
 > at com.ning.compress.lzf.ChunkEncoder.(ChunkEncoder.java:93)
 > at
 com.ning.compress.lzf.impl.UnsafeChunkEncoder.(UnsafeChunkEncoder.java:40)
 > at
 com.ning.compress.lzf.impl.UnsafeChunkEncoderLE.(UnsafeChunkEncoderLE.java:13)
 > at
 com.ning.compress.lzf.impl.UnsafeChunkEncoders.createEncoder(UnsafeChunkEncoders.java:31)
 > at
 com.ning.compress.lzf.util.ChunkEncoderFactory.optimalInstance(ChunkEncoderFactory.java:44)
 > at
 com.ning.compress.lzf.LZFOutputStream.(LZFOutputStream.java:61)
 > at
 org.apache.spark.io.LZFCompressionCodec.compressedOutputStream(CompressionCodec.scala:60)
 > at
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:803)
 > at
 org.apache.spark.storage.BlockManager$$anonfun$5.apply(BlockManager.scala:471)
 > at
 org.apache.spark.storage.BlockManager$$anonfun$5.apply(BlockManager.scala:471)
 > at
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
 > at
 org.apache.spark.storage.DiskB

Re: JVM error

2014-02-28 Thread Mohit Singh
And I tried that but got the error:
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/hadoop/spark/python/pyspark/context.py", line 83, in __init__
SparkContext._ensure_initialized(self)
  File "/home/hadoop/spark/python/pyspark/context.py", line 165, in
_ensure_initialized
raise ValueError("Cannot run multiple SparkContexts at once")
ValueError: Cannot run multiple SparkContexts at once


On Fri, Feb 28, 2014 at 11:59 AM, Bryn Keller  wrote:

> Sorry, typo - that last line should be:
>
> sc = pyspark.Spark*Context*(conf = conf)
>
>
> On Fri, Feb 28, 2014 at 9:37 AM, Mohit Singh  wrote:
>
>> Hi Bryn,
>>   Thanks for the suggestion.
>> I tried that..
>> conf = pyspark.SparkConf().set("spark.executor.memory","20G")
>> But.. got an error here:
>>
>> sc = pyspark.SparkConf(conf = conf)
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> TypeError: __init__() got an unexpected keyword argument 'conf'
>>
>> ??
>> This is in pyspark shell.
>>
>>
>> On Thu, Feb 27, 2014 at 5:00 AM, Evgeniy Shishkin 
>> wrote:
>>
>>>
>>> On 27 Feb 2014, at 07:22, Aaron Davidson  wrote:
>>>
>>> > Setting spark.executor.memory is indeed the correct way to do this. If
>>> you want to configure this in spark-env.sh, you can use
>>> > export SPARK_JAVA_OPTS=" -Dspark.executor.memory=20g"
>>> > (make sure to append the variable if you've been using SPARK_JAVA_OPTS
>>> previously)
>>> >
>>> >
>>> > On Wed, Feb 26, 2014 at 7:50 PM, Bryn Keller 
>>> wrote:
>>> > Hi Mohit,
>>> >
>>> > You can still set SPARK_MEM in spark-env.sh, but that is deprecated.
>>> This is from SparkContext.scala:
>>> >
>>> > if (!conf.contains("spark.executor.memory") &&
>>> sys.env.contains("SPARK_MEM")) {
>>> > logWarning("Using SPARK_MEM to set amount of memory to use per
>>> executor process is " +
>>> >   "deprecated, instead use spark.executor.memory")
>>> >   }
>>> >
>>> > Thanks,
>>> > Bryn
>>> >
>>> >
>>> > On Wed, Feb 26, 2014 at 6:28 PM, Mohit Singh 
>>> wrote:
>>> > Hi Bryn,
>>> >   Thanks for responding. Is there a way I can permanently configure
>>> this setting?
>>> > like SPARK_EXECUTOR_MEMORY or somethign like that?
>>> >
>>> >
>>> >
>>> > On Wed, Feb 26, 2014 at 2:56 PM, Bryn Keller 
>>> wrote:
>>> > Hi Mohit,
>>> >
>>> > Try increasing the executor memory instead of the worker memory - the
>>> most appropriate place to do this is actually when you're creating your
>>> SparkContext, something like:
>>> >
>>> > conf = pyspark.SparkConf()
>>> >.setMaster("spark://master:7077")
>>> >.setAppName("Example")
>>> >.setSparkHome("/your/path/to/spark")
>>> >.set("spark.executor.memory", "20G")
>>> >.set("spark.logConf", "true")
>>> > sc = pyspark.SparkConf(conf = conf)
>>> >
>>> > Hope that helps,
>>> > Bryn
>>> >
>>> >
>>> >
>>> > On Wed, Feb 26, 2014 at 2:39 PM, Mohit Singh 
>>> wrote:
>>> > Hi,
>>> >   I am experimenting with pyspark lately...
>>> > Every now and then, I see this error bieng streamed to pyspark shell
>>> .. and most of the times.. the computation/operation completes.. and
>>> sometimes, it just gets stuck...
>>> > My setup is 8 node cluster.. with loads of ram(256GB's) and space(
>>> TB's) per node.
>>> > This enviornment is shared by general hadoop and hadoopy stuff..with
>>> recent spark addition...
>>> >
>>> > java.lang.OutOfMemoryError: Java heap space
>>> > at
>>> com.ning.compress.BufferRecycler.allocEncodingBuffer(BufferRecycler.java:59)
>>> > at com.ning.compress.lzf.ChunkEncoder.(ChunkEncoder.java:93)
>>> > at
>>> com.ning.compress.lzf.impl.UnsafeChunkEncoder.(UnsafeChunkEncoder.java:40)
>>> > at
>>> com.ning.compress.lzf.impl.UnsafeChunkEncoderLE.(UnsafeChunkEncoderLE.java:13)
>>> > at
>>> com.ning.compress.lzf.impl.UnsafeChunkEncoders.createEncoder(UnsafeChunkEncoders.java:31)
>>> > at
>>> com.ning.compress.lzf.util.ChunkEncoderFactory.optimalInstance(ChunkEncoderFactory.java:44)
>>> > at
>>> com.ning.compress.lzf.LZFOutputStream.(LZFOutputStream.java:61)
>>> > at
>>> org.apache.spark.io.LZFCompressionCodec.compressedOutputStream(CompressionCodec.scala:60)
>>> > at
>>> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:803)
>>> > at
>>> org.apache.spark.storage.BlockManager$$anonfun$5.apply(BlockManager.scala:471)
>>> > at
>>> org.apache.spark.storage.BlockManager$$anonfun$5.apply(BlockManager.scala:471)
>>> > at
>>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
>>> > at
>>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
>>> > at
>>> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:164)
>>> > at
>>> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
>>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> 

java.net.SocketException on reduceByKey() in pyspark

2014-02-28 Thread nicholas.chammas
I've done a whole bunch of things to this RDD, and now when I try to
sortByKey(), this is what I get:

>>> flattened_po.flatMap(lambda x: map_to_database_types(x)).sortByKey()14/02/28
23:18:41 INFO spark.SparkContext: Starting job: sortByKey at :114/02/28
23:18:41 INFO scheduler.DAGScheduler: Got job 22 (sortByKey at :1)
with 1 output partitions (allowLocal=false)14/02/28 23:18:41 INFO
scheduler.DAGScheduler: Final stage: Stage 23 (sortByKey at :1)14/02/28
23:18:41 INFO scheduler.DAGScheduler: Parents of final stage: List()14/02/28
23:18:41 INFO scheduler.DAGScheduler: Missing parents: List()14/02/28
23:18:41 INFO scheduler.DAGScheduler: Submitting Stage 23 (PythonRDD[41] at
sortByKey at :1), which has no missing parents14/02/28 23:18:41 INFO
scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 23
(PythonRDD[41] at sortByKey at :1)14/02/28 23:18:41 INFO
scheduler.TaskSchedulerImpl: Adding task set 23.0 with 1 tasks14/02/28
23:18:41 INFO scheduler.TaskSetManager: Starting task 23.0:0 as TID 32 on
executor 0: ip-.ec2.internal (PROCESS_LOCAL)14/02/28 23:18:41 INFO
scheduler.TaskSetManager: Serialized task 23.0:0 as 4985 bytes in 1 ms14/02/28
23:18:41 WARN scheduler.TaskSetManager: Lost TID 32 (task 23.0:0)14/02/28
23:18:41 WARN scheduler.TaskSetManager: Loss was due to
java.net.SocketExceptionjava.net.SocketException: Connection reset at
java.net.SocketInputStream.read(SocketInputStream.java:196) at
java.net.SocketInputStream.read(SocketInputStream.java:122) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at
java.io.BufferedInputStream.read(BufferedInputStream.java:254) at
java.io.DataInputStream.readInt(DataInputStream.java:387) at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110) at
org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:153) at
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109) at
org.apache.spark.scheduler.Task.run(Task.scala:53) at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at
java.lang.Thread.run(Thread.java:744)14/02/28 23:18:41 INFO
scheduler.TaskSetManager: Starting task 23.0:0 as TID 33 on executor 0:
ip-.ec2.internal (PROCESS_LOCAL)14/02/28 23:18:41 INFO
scheduler.TaskSetManager: Serialized task 23.0:0 as 4985 bytes in 1 ms14/02/28
23:18:41 WARN scheduler.TaskSetManager: Lost TID 33 (task 23.0:0)14/02/28
23:18:41 INFO scheduler.TaskSetManager: Loss was due to
java.net.SocketException: Connection reset [duplicate 1]14/02/28 23:18:41
INFO scheduler.TaskSetManager: Starting task 23.0:0 as TID 34 on executor
0: ip-.ec2.internal (PROCESS_LOCAL)14/02/28 23:18:41 INFO
scheduler.TaskSetManager: Serialized task 23.0:0 as 4985 bytes in 1 ms14/02/28
23:18:41 WARN scheduler.TaskSetManager: Lost TID 34 (task 23.0:0)14/02/28
23:18:41 INFO scheduler.TaskSetManager: Loss was due to
java.net.SocketException: Connection reset [duplicate 2]14/02/28 23:18:41
INFO scheduler.TaskSetManager: Starting task 23.0:0 as TID 35 on executor
0: ip-.ec2.internal (PROCESS_LOCAL)14/02/28 23:18:41 INFO
scheduler.TaskSetManager: Serialized task 23.0:0 as 4985 bytes in 1 ms14/02/28
23:18:41 WARN scheduler.TaskSetManager: Lost TID 35 (task 23.0:0)14/02/28
23:18:41 INFO scheduler.TaskSetManager: Loss was due to
java.net.SocketException: Connection reset [duplicate 3]14/02/28 23:18:41
ERROR scheduler.TaskSetManager: Task 23.0:0 failed 4 times; aborting
job14/02/28
23:18:41 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 23.0 from
pool 14/02/28
23:18:41 INFO scheduler.DAGScheduler: Failed to run sortByKey at
:1Traceback
(most recent call last):  File "", line 1, in   File
"/root/spark/python/pyspark/rdd.py", line 361, in sortByKeyrddSize =
self.count()  File "/root/spark/python/pyspark/rdd.py", line 542, in count
  return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()  File
"/root/spark/python/pyspark/rdd.py", line 533, in sumreturn
self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)  File
"/root/spark/python/pyspark/rdd.py", line 499, in reducevals =
self.mapPartitions(func).collect()  File
"/root/spark/python/pyspark/rdd.py", line 463, in collectbytesInJava =
self._jrdd.collect().iterator()  File
"/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 537,
in __call__  File
"/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in
get_return_valuepy4j.protocol.Py4JJavaError: An error occurred while
calling o332.collect.: org.apache.spark

Re: Spark streaming on ec2

2014-02-28 Thread Nicholas Chammas
Yeah, the Spark on EMR bootstrap scripts referenced
hereneed some
polishing. I had a lot of trouble just getting through that
tutorial. And yes, the version of Spark they're using is 0.8.1.


On Fri, Feb 28, 2014 at 2:39 PM, Aureliano Buendia wrote:

> Unfortunately, that script is not under active maintenance. Given that
> spark is getting accelerated release cycles, solutions like this get
> outdated quickly.
>
>
> On Fri, Feb 28, 2014 at 7:36 PM, Mayur Rustagi wrote:
>
>> Thr is a talk to install spark on Amazon ( not sure if its updated for
>> 0.9.0).
>> http://www.youtube.com/watch?v=G0lSWUqyOhw
>> In this case the bootstrap script will run on the new slave when it comes
>> up. I am not sure how clean & production quality this is. He seems to be
>> leveraging spot instances where this needs to be done properly.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Fri, Feb 28, 2014 at 10:52 AM, Aureliano Buendia > > wrote:
>>
>>> Also, in this talk http://www.youtube.com/watch?v=OhpjgaBVUtU on using
>>> spark streaming in production, the author seems to have missed the topic of
>>> how to manage cloud instances.
>>>
>>>
>>> On Fri, Feb 28, 2014 at 6:48 PM, Aureliano Buendia >> > wrote:
>>>
 What's the updated way of deploying spark streaming apps on EMR? Using
 YARN?

 There are some out of date solutions like
 https://github.com/ianoc/SparkEMRBootstrap which setup mesos on EMR. I
 wonder if this can be simplified by spark 0.9.

 Spark-ec2 comes with a considerable amount of configuration, and some
 useful utilities like deploy to workers, porting it to a managed service
 such as EMR is not as trivial as it might seem to be.


 On Fri, Feb 28, 2014 at 6:19 PM, Mayur Rustagi >>> > wrote:

> I think what you are looking for is sort of a managed service ala EMR
> or Qubole. Spark-ec2 is just software to boot up machines & integrate them
> together using Whirr.
> I agree a managed service for Streaming would be really useful.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Fri, Feb 28, 2014 at 8:50 AM, Aureliano Buendia <
> buendia...@gmail.com> wrote:
>
>> Another subject that was not that important in spark, but it could be
>> crucial for 24/7 spark streaming, is reconstruction of lost nodes. By 
>> that,
>> I do not mean lost data reconstruction by self healing, but bringing up 
>> new
>> ec2 instances once they die for whatever reasons. Is this also supported 
>> in
>> spark ec2?
>>
>>
>> On Fri, Feb 28, 2014 at 2:24 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Yes, the default spark EC2 cluster runs the standalone deploy mode.
>>> Since Spark 0.9, the standalone deploy mode allows you to launch the 
>>> driver
>>> app within the cluster itself and automatically restart it if it fails. 
>>> You
>>> can read about launching your app inside the cluster 
>>> here.
>>> Using this you can launch your streaming app as well.
>>>
>>> TD
>>>
>>>
>>> On Thu, Feb 27, 2014 at 5:35 PM, Aureliano Buendia <
>>> buendia...@gmail.com> wrote:
>>>
 How about spark stream app itself? Does the ec2 script also provide
 means for daemonizing and monitoring spark streaming apps which are
 supposed to run 24/7? If not, any suggestions for how to do this?


 On Thu, Feb 27, 2014 at 8:23 PM, Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> Zookeeper is automatically set up in the cluster as Spark uses
> Zookeeper. However, you have to setup your own input source like 
> Kafka or
> Flume.
>
> TD
>
>
> On Thu, Feb 27, 2014 at 10:32 AM, Aureliano Buendia <
> buendia...@gmail.com> wrote:
>
>>
>>
>>
>> On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Yes! Spark streaming programs are just like any spark program
>>> and so any ec2 cluster setup using the spark-ec2 scripts can be 
>>> used to run
>>> spark streaming programs as well.
>>>
>>
>> Great. Does it come with any input source support as well? (Eg
>> kafka requires setting up zookeeper).
>>
>>
>>>
>>>
>>>
>>> On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia <
>>> buendia...@gmail.com> wrote:
>>>

Re: Kryo Registration, class is not registered, but Log.TRACE() says otherwise

2014-02-28 Thread pondwater
Has no one ever registered generic classes in scala? Is it possible?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-Registration-class-is-not-registered-but-Log-TRACE-says-otherwise-tp2077p2182.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Messy GraphX merge/reduce functions

2014-02-28 Thread Dan Davies
Are these incremental reduction functions what you'd expect when a graph is
partitioned using vertex cuts?  You'd naturally want to consolidate the
versions of a vertex's state inside partitions, then across partitions.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Messy-GraphX-merge-reduce-functions-tp2109p2181.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Spark stream example SimpleZeroMQPublisher high cpu usage

2014-02-28 Thread Aureliano Buendia
Hi,

Running:

./bin/run-example org.apache.spa.streaming.examples.SimpleZeroMQPublisher
tcp://127.0.1.1:1234 foo

causes over 100% cpu usage on os x. Given that it's just a simple zmq
publisher, this shouldn't be expected. Is there something wrong with that
example?


Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Egor Pahomov
Protobuf java code generated by ptotoc 2.4 does not compile with protobuf
library 2.5 - that's true. What I meant: You serialized message with class
generated with protobuf 2.4.1. Now you can read that message with class
generated with protobuf 2.5.0 from same .proto.


2014-03-01 0:00 GMT+04:00 Egor Pahomov :

> In that same pom
>
> 
>   yarn
>
>   
> 2
> 2.2.0
> 2.5.0
>   
>
>   
> yarn
>   
>
> 
>
>
>
> 2014-02-28 23:46 GMT+04:00 Aureliano Buendia :
>
>
>>
>>
>> On Fri, Feb 28, 2014 at 7:17 PM, Egor Pahomov wrote:
>>
>>> Spark 0.9 uses protobuf 2.5.0
>>>
>>
>> Spark 0.9 uses 2.4.1:
>>
>>
>> https://github.com/apache/incubator-spark/blob/4d880304867b55a4f2138617b30600b7fa013b14/pom.xml#L118
>>
>> Is there another pom for when hadoop 2.2 is used? I don't see another
>> branch for hadooop 2.2.
>>
>>
>>> Hadoop 2.2 uses protobuf 2.5.0
>>> protobuf 2.5.0 can read massages serialized with protobuf 2.4.1
>>>
>>
>> Protobuf java code generated by ptotoc 2.4 does not compile with protobuf
>> library 2.5. This is what the OP's error message is about.
>>
>>
>>> So there is not any reason why you can't read some messages from hadoop
>>> 2.2 with protobuf 2.5.0, probably you somehow have 2.4.1 in your class
>>> path. Of course it's very bad, that you have both 2.4.1 and 2.5.0 in your
>>> classpath. Use excludes or whatever to get rid of 2.4.1.
>>>
>>> Personally, I spend 3 days to move my project to protobuf 2.5.0 from
>>> 2.4.1. But it has to be done for the whole your project.
>>>
>>> 2014-02-28 21:49 GMT+04:00 Aureliano Buendia :
>>>
>>> Doesn't hadoop 2.2 also depend on protobuf 2.4?


 On Fri, Feb 28, 2014 at 5:45 PM, Ognen Duzlevski <
 og...@plainvanillagames.com> wrote:

> A stupid question, by the way, you did compile Spark with Hadoop 2.2.0
> support?
>
> Ognen
>
> On 2/28/14, 10:51 AM, Prasad wrote:
>
>> Hi
>> I am getting the protobuf error while reading HDFS file using
>> spark
>> 0.9.0 -- i am running on hadoop 2.2.0 .
>>
>> When i look thru, i find that i have both 2.4.1 and 2.5 and some blogs
>> suggest that there is some incompatability issues betwen 2.4.1 and 2.5
>>
>> hduser@prasadHdp1:~/spark-0.9.0-incubating$ find ~/ -name
>> protobuf-java*.jar
>> /home/hduser/.m2/repository/com/google/protobuf/protobuf-
>> java/2.4.1/protobuf-java-2.4.1.jar
>> /home/hduser/.m2/repository/org/spark-project/protobuf/
>> protobuf-java/2.4.1-shaded/protobuf-java-2.4.1-shaded.jar
>> /home/hduser/spark-0.9.0-incubating/lib_managed/
>> bundles/protobuf-java-2.5.0.jar
>> /home/hduser/spark-0.9.0-incubating/lib_managed/jars/
>> protobuf-java-2.4.1-shaded.jar
>> /home/hduser/.ivy2/cache/com.google.protobuf/protobuf-java/
>> bundles/protobuf-java-2.5.0.jar
>> /home/hduser/.ivy2/cache/org.spark-project.protobuf/
>> protobuf-java/jars/protobuf-java-2.4.1-shaded.jar
>>
>>
>> Can someone please let me know if you faced these issues and how u
>> fixed it.
>>
>> Thanks
>> Prasad.
>> Caused by: java.lang.VerifyError: class
>> org.apache.hadoop.security.proto.SecurityProtos$
>> GetDelegationTokenRequestProto
>> overrides final method
>> getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
>>  at java.lang.ClassLoader.defineClass1(Native Method)
>>  at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>>  at
>> java.security.SecureClassLoader.defineClass(
>> SecureClassLoader.java:142)
>>  at java.net.URLClassLoader.defineClass(URLClassLoader.
>> java:449)
>>  at java.net.URLClassLoader.access$100(URLClassLoader.
>> java:71)
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>  at java.security.AccessController.doPrivileged(Native
>> Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:
>> 354)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>  at java.lang.Class.getDeclaredMethods0(Native Method)
>>  at java.lang.Class.privateGetDeclaredMethods(
>> Class.java:2531)
>>  at java.lang.Class.privateGetPublicMethods(Class.java:2651)
>>  at java.lang.Class.privateGetPublicMethods(Class.java:2661)
>>  at java.lang.Class.getMethods(Class.java:1467)
>>  at
>> sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
>>  at
>> sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
>>  at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
>>  at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
>>  at
>> org

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Egor Pahomov
In that same pom


  yarn
  
2
2.2.0
2.5.0
  
  
yarn
  





2014-02-28 23:46 GMT+04:00 Aureliano Buendia :

>
>
>
> On Fri, Feb 28, 2014 at 7:17 PM, Egor Pahomov wrote:
>
>> Spark 0.9 uses protobuf 2.5.0
>>
>
> Spark 0.9 uses 2.4.1:
>
>
> https://github.com/apache/incubator-spark/blob/4d880304867b55a4f2138617b30600b7fa013b14/pom.xml#L118
>
> Is there another pom for when hadoop 2.2 is used? I don't see another
> branch for hadooop 2.2.
>
>
>> Hadoop 2.2 uses protobuf 2.5.0
>> protobuf 2.5.0 can read massages serialized with protobuf 2.4.1
>>
>
> Protobuf java code generated by ptotoc 2.4 does not compile with protobuf
> library 2.5. This is what the OP's error message is about.
>
>
>> So there is not any reason why you can't read some messages from hadoop
>> 2.2 with protobuf 2.5.0, probably you somehow have 2.4.1 in your class
>> path. Of course it's very bad, that you have both 2.4.1 and 2.5.0 in your
>> classpath. Use excludes or whatever to get rid of 2.4.1.
>>
>> Personally, I spend 3 days to move my project to protobuf 2.5.0 from
>> 2.4.1. But it has to be done for the whole your project.
>>
>> 2014-02-28 21:49 GMT+04:00 Aureliano Buendia :
>>
>> Doesn't hadoop 2.2 also depend on protobuf 2.4?
>>>
>>>
>>> On Fri, Feb 28, 2014 at 5:45 PM, Ognen Duzlevski <
>>> og...@plainvanillagames.com> wrote:
>>>
 A stupid question, by the way, you did compile Spark with Hadoop 2.2.0
 support?

 Ognen

 On 2/28/14, 10:51 AM, Prasad wrote:

> Hi
> I am getting the protobuf error while reading HDFS file using spark
> 0.9.0 -- i am running on hadoop 2.2.0 .
>
> When i look thru, i find that i have both 2.4.1 and 2.5 and some blogs
> suggest that there is some incompatability issues betwen 2.4.1 and 2.5
>
> hduser@prasadHdp1:~/spark-0.9.0-incubating$ find ~/ -name
> protobuf-java*.jar
> /home/hduser/.m2/repository/com/google/protobuf/protobuf-
> java/2.4.1/protobuf-java-2.4.1.jar
> /home/hduser/.m2/repository/org/spark-project/protobuf/
> protobuf-java/2.4.1-shaded/protobuf-java-2.4.1-shaded.jar
> /home/hduser/spark-0.9.0-incubating/lib_managed/
> bundles/protobuf-java-2.5.0.jar
> /home/hduser/spark-0.9.0-incubating/lib_managed/jars/
> protobuf-java-2.4.1-shaded.jar
> /home/hduser/.ivy2/cache/com.google.protobuf/protobuf-java/
> bundles/protobuf-java-2.5.0.jar
> /home/hduser/.ivy2/cache/org.spark-project.protobuf/
> protobuf-java/jars/protobuf-java-2.4.1-shaded.jar
>
>
> Can someone please let me know if you faced these issues and how u
> fixed it.
>
> Thanks
> Prasad.
> Caused by: java.lang.VerifyError: class
> org.apache.hadoop.security.proto.SecurityProtos$
> GetDelegationTokenRequestProto
> overrides final method
> getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
>  at java.lang.ClassLoader.defineClass1(Native Method)
>  at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>  at
> java.security.SecureClassLoader.defineClass(
> SecureClassLoader.java:142)
>  at java.net.URLClassLoader.defineClass(URLClassLoader.
> java:449)
>  at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>  at java.lang.Class.getDeclaredMethods0(Native Method)
>  at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
>  at java.lang.Class.privateGetPublicMethods(Class.java:2651)
>  at java.lang.Class.privateGetPublicMethods(Class.java:2661)
>  at java.lang.Class.getMethods(Class.java:1467)
>  at
> sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
>  at
> sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
>  at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
>  at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
>  at
> org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(
> ProtobufRpcEngine.java:92)
>  at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)
>
>
> Caused by: java.lang.reflect.InvocationTargetException
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:57)
>  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(

Re: JVM error

2014-02-28 Thread Bryn Keller
Sorry, typo - that last line should be:

sc = pyspark.Spark*Context*(conf = conf)


On Fri, Feb 28, 2014 at 9:37 AM, Mohit Singh  wrote:

> Hi Bryn,
>   Thanks for the suggestion.
> I tried that..
> conf = pyspark.SparkConf().set("spark.executor.memory","20G")
> But.. got an error here:
>
> sc = pyspark.SparkConf(conf = conf)
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: __init__() got an unexpected keyword argument 'conf'
>
> ??
> This is in pyspark shell.
>
>
> On Thu, Feb 27, 2014 at 5:00 AM, Evgeniy Shishkin wrote:
>
>>
>> On 27 Feb 2014, at 07:22, Aaron Davidson  wrote:
>>
>> > Setting spark.executor.memory is indeed the correct way to do this. If
>> you want to configure this in spark-env.sh, you can use
>> > export SPARK_JAVA_OPTS=" -Dspark.executor.memory=20g"
>> > (make sure to append the variable if you've been using SPARK_JAVA_OPTS
>> previously)
>> >
>> >
>> > On Wed, Feb 26, 2014 at 7:50 PM, Bryn Keller  wrote:
>> > Hi Mohit,
>> >
>> > You can still set SPARK_MEM in spark-env.sh, but that is deprecated.
>> This is from SparkContext.scala:
>> >
>> > if (!conf.contains("spark.executor.memory") &&
>> sys.env.contains("SPARK_MEM")) {
>> > logWarning("Using SPARK_MEM to set amount of memory to use per
>> executor process is " +
>> >   "deprecated, instead use spark.executor.memory")
>> >   }
>> >
>> > Thanks,
>> > Bryn
>> >
>> >
>> > On Wed, Feb 26, 2014 at 6:28 PM, Mohit Singh 
>> wrote:
>> > Hi Bryn,
>> >   Thanks for responding. Is there a way I can permanently configure
>> this setting?
>> > like SPARK_EXECUTOR_MEMORY or somethign like that?
>> >
>> >
>> >
>> > On Wed, Feb 26, 2014 at 2:56 PM, Bryn Keller  wrote:
>> > Hi Mohit,
>> >
>> > Try increasing the executor memory instead of the worker memory - the
>> most appropriate place to do this is actually when you're creating your
>> SparkContext, something like:
>> >
>> > conf = pyspark.SparkConf()
>> >.setMaster("spark://master:7077")
>> >.setAppName("Example")
>> >.setSparkHome("/your/path/to/spark")
>> >.set("spark.executor.memory", "20G")
>> >.set("spark.logConf", "true")
>> > sc = pyspark.SparkConf(conf = conf)
>> >
>> > Hope that helps,
>> > Bryn
>> >
>> >
>> >
>> > On Wed, Feb 26, 2014 at 2:39 PM, Mohit Singh 
>> wrote:
>> > Hi,
>> >   I am experimenting with pyspark lately...
>> > Every now and then, I see this error bieng streamed to pyspark shell ..
>> and most of the times.. the computation/operation completes.. and
>> sometimes, it just gets stuck...
>> > My setup is 8 node cluster.. with loads of ram(256GB's) and space(
>> TB's) per node.
>> > This enviornment is shared by general hadoop and hadoopy stuff..with
>> recent spark addition...
>> >
>> > java.lang.OutOfMemoryError: Java heap space
>> > at
>> com.ning.compress.BufferRecycler.allocEncodingBuffer(BufferRecycler.java:59)
>> > at com.ning.compress.lzf.ChunkEncoder.(ChunkEncoder.java:93)
>> > at
>> com.ning.compress.lzf.impl.UnsafeChunkEncoder.(UnsafeChunkEncoder.java:40)
>> > at
>> com.ning.compress.lzf.impl.UnsafeChunkEncoderLE.(UnsafeChunkEncoderLE.java:13)
>> > at
>> com.ning.compress.lzf.impl.UnsafeChunkEncoders.createEncoder(UnsafeChunkEncoders.java:31)
>> > at
>> com.ning.compress.lzf.util.ChunkEncoderFactory.optimalInstance(ChunkEncoderFactory.java:44)
>> > at
>> com.ning.compress.lzf.LZFOutputStream.(LZFOutputStream.java:61)
>> > at
>> org.apache.spark.io.LZFCompressionCodec.compressedOutputStream(CompressionCodec.scala:60)
>> > at
>> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:803)
>> > at
>> org.apache.spark.storage.BlockManager$$anonfun$5.apply(BlockManager.scala:471)
>> > at
>> org.apache.spark.storage.BlockManager$$anonfun$5.apply(BlockManager.scala:471)
>> > at
>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
>> > at
>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
>> > at
>> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:164)
>> > at
>> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> > at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>> > at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>> > at org.apache.spark.scheduler.Task.run(Task.scala:53)
>> > at
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>> > at
>> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>> > at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>> > at
>> 

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Aureliano Buendia
On Fri, Feb 28, 2014 at 7:17 PM, Egor Pahomov wrote:

> Spark 0.9 uses protobuf 2.5.0
>

Spark 0.9 uses 2.4.1:

https://github.com/apache/incubator-spark/blob/4d880304867b55a4f2138617b30600b7fa013b14/pom.xml#L118

Is there another pom for when hadoop 2.2 is used? I don't see another
branch for hadooop 2.2.


> Hadoop 2.2 uses protobuf 2.5.0
> protobuf 2.5.0 can read massages serialized with protobuf 2.4.1
>

Protobuf java code generated by ptotoc 2.4 does not compile with protobuf
library 2.5. This is what the OP's error message is about.


> So there is not any reason why you can't read some messages from hadoop
> 2.2 with protobuf 2.5.0, probably you somehow have 2.4.1 in your class
> path. Of course it's very bad, that you have both 2.4.1 and 2.5.0 in your
> classpath. Use excludes or whatever to get rid of 2.4.1.
>
> Personally, I spend 3 days to move my project to protobuf 2.5.0 from
> 2.4.1. But it has to be done for the whole your project.
>
> 2014-02-28 21:49 GMT+04:00 Aureliano Buendia :
>
> Doesn't hadoop 2.2 also depend on protobuf 2.4?
>>
>>
>> On Fri, Feb 28, 2014 at 5:45 PM, Ognen Duzlevski <
>> og...@plainvanillagames.com> wrote:
>>
>>> A stupid question, by the way, you did compile Spark with Hadoop 2.2.0
>>> support?
>>>
>>> Ognen
>>>
>>> On 2/28/14, 10:51 AM, Prasad wrote:
>>>
 Hi
 I am getting the protobuf error while reading HDFS file using spark
 0.9.0 -- i am running on hadoop 2.2.0 .

 When i look thru, i find that i have both 2.4.1 and 2.5 and some blogs
 suggest that there is some incompatability issues betwen 2.4.1 and 2.5

 hduser@prasadHdp1:~/spark-0.9.0-incubating$ find ~/ -name
 protobuf-java*.jar
 /home/hduser/.m2/repository/com/google/protobuf/protobuf-
 java/2.4.1/protobuf-java-2.4.1.jar
 /home/hduser/.m2/repository/org/spark-project/protobuf/
 protobuf-java/2.4.1-shaded/protobuf-java-2.4.1-shaded.jar
 /home/hduser/spark-0.9.0-incubating/lib_managed/
 bundles/protobuf-java-2.5.0.jar
 /home/hduser/spark-0.9.0-incubating/lib_managed/jars/
 protobuf-java-2.4.1-shaded.jar
 /home/hduser/.ivy2/cache/com.google.protobuf/protobuf-java/
 bundles/protobuf-java-2.5.0.jar
 /home/hduser/.ivy2/cache/org.spark-project.protobuf/
 protobuf-java/jars/protobuf-java-2.4.1-shaded.jar


 Can someone please let me know if you faced these issues and how u
 fixed it.

 Thanks
 Prasad.
 Caused by: java.lang.VerifyError: class
 org.apache.hadoop.security.proto.SecurityProtos$
 GetDelegationTokenRequestProto
 overrides final method
 getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
  at java.lang.ClassLoader.defineClass1(Native Method)
  at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
  at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
  at java.net.URLClassLoader.defineClass(URLClassLoader.
 java:449)
  at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.getDeclaredMethods0(Native Method)
  at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
  at java.lang.Class.privateGetPublicMethods(Class.java:2651)
  at java.lang.Class.privateGetPublicMethods(Class.java:2661)
  at java.lang.Class.getMethods(Class.java:1467)
  at
 sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
  at
 sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
  at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
  at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
  at
 org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(
 ProtobufRpcEngine.java:92)
  at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)


 Caused by: java.lang.reflect.InvocationTargetException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(
 NativeMethodAccessorImpl.java:57)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(
 DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)










 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-
 0-9-0-hado

Re: Spark streaming on ec2

2014-02-28 Thread Aureliano Buendia
Unfortunately, that script is not under active maintenance. Given that
spark is getting accelerated release cycles, solutions like this get
outdated quickly.


On Fri, Feb 28, 2014 at 7:36 PM, Mayur Rustagi wrote:

> Thr is a talk to install spark on Amazon ( not sure if its updated for
> 0.9.0).
> http://www.youtube.com/watch?v=G0lSWUqyOhw
> In this case the bootstrap script will run on the new slave when it comes
> up. I am not sure how clean & production quality this is. He seems to be
> leveraging spot instances where this needs to be done properly.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Fri, Feb 28, 2014 at 10:52 AM, Aureliano Buendia 
> wrote:
>
>> Also, in this talk http://www.youtube.com/watch?v=OhpjgaBVUtU on using
>> spark streaming in production, the author seems to have missed the topic of
>> how to manage cloud instances.
>>
>>
>> On Fri, Feb 28, 2014 at 6:48 PM, Aureliano Buendia 
>> wrote:
>>
>>> What's the updated way of deploying spark streaming apps on EMR? Using
>>> YARN?
>>>
>>> There are some out of date solutions like
>>> https://github.com/ianoc/SparkEMRBootstrap which setup mesos on EMR. I
>>> wonder if this can be simplified by spark 0.9.
>>>
>>> Spark-ec2 comes with a considerable amount of configuration, and some
>>> useful utilities like deploy to workers, porting it to a managed service
>>> such as EMR is not as trivial as it might seem to be.
>>>
>>>
>>> On Fri, Feb 28, 2014 at 6:19 PM, Mayur Rustagi 
>>> wrote:
>>>
 I think what you are looking for is sort of a managed service ala EMR
 or Qubole. Spark-ec2 is just software to boot up machines & integrate them
 together using Whirr.
 I agree a managed service for Streaming would be really useful.
 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi 



 On Fri, Feb 28, 2014 at 8:50 AM, Aureliano Buendia <
 buendia...@gmail.com> wrote:

> Another subject that was not that important in spark, but it could be
> crucial for 24/7 spark streaming, is reconstruction of lost nodes. By 
> that,
> I do not mean lost data reconstruction by self healing, but bringing up 
> new
> ec2 instances once they die for whatever reasons. Is this also supported 
> in
> spark ec2?
>
>
> On Fri, Feb 28, 2014 at 2:24 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Yes, the default spark EC2 cluster runs the standalone deploy mode.
>> Since Spark 0.9, the standalone deploy mode allows you to launch the 
>> driver
>> app within the cluster itself and automatically restart it if it fails. 
>> You
>> can read about launching your app inside the cluster 
>> here.
>> Using this you can launch your streaming app as well.
>>
>> TD
>>
>>
>> On Thu, Feb 27, 2014 at 5:35 PM, Aureliano Buendia <
>> buendia...@gmail.com> wrote:
>>
>>> How about spark stream app itself? Does the ec2 script also provide
>>> means for daemonizing and monitoring spark streaming apps which are
>>> supposed to run 24/7? If not, any suggestions for how to do this?
>>>
>>>
>>> On Thu, Feb 27, 2014 at 8:23 PM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Zookeeper is automatically set up in the cluster as Spark uses
 Zookeeper. However, you have to setup your own input source like Kafka 
 or
 Flume.

 TD


 On Thu, Feb 27, 2014 at 10:32 AM, Aureliano Buendia <
 buendia...@gmail.com> wrote:

>
>
>
> On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Yes! Spark streaming programs are just like any spark program and
>> so any ec2 cluster setup using the spark-ec2 scripts can be used to 
>> run
>> spark streaming programs as well.
>>
>
> Great. Does it come with any input source support as well? (Eg
> kafka requires setting up zookeeper).
>
>
>>
>>
>>
>> On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia <
>> buendia...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Does the ec2 support for spark 0.9 also include spark streaming?
>>> If not, is there an equivalent?
>>>
>>>
>>
>

>>>
>>
>

>>>
>>
>


Re: Spark streaming on ec2

2014-02-28 Thread Mayur Rustagi
Thr is a talk to install spark on Amazon ( not sure if its updated for
0.9.0).
http://www.youtube.com/watch?v=G0lSWUqyOhw
In this case the bootstrap script will run on the new slave when it comes
up. I am not sure how clean & production quality this is. He seems to be
leveraging spot instances where this needs to be done properly.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Fri, Feb 28, 2014 at 10:52 AM, Aureliano Buendia wrote:

> Also, in this talk http://www.youtube.com/watch?v=OhpjgaBVUtU on using
> spark streaming in production, the author seems to have missed the topic of
> how to manage cloud instances.
>
>
> On Fri, Feb 28, 2014 at 6:48 PM, Aureliano Buendia 
> wrote:
>
>> What's the updated way of deploying spark streaming apps on EMR? Using
>> YARN?
>>
>> There are some out of date solutions like
>> https://github.com/ianoc/SparkEMRBootstrap which setup mesos on EMR. I
>> wonder if this can be simplified by spark 0.9.
>>
>> Spark-ec2 comes with a considerable amount of configuration, and some
>> useful utilities like deploy to workers, porting it to a managed service
>> such as EMR is not as trivial as it might seem to be.
>>
>>
>> On Fri, Feb 28, 2014 at 6:19 PM, Mayur Rustagi 
>> wrote:
>>
>>> I think what you are looking for is sort of a managed service ala EMR or
>>> Qubole. Spark-ec2 is just software to boot up machines & integrate them
>>> together using Whirr.
>>> I agree a managed service for Streaming would be really useful.
>>> Regards
>>> Mayur
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi 
>>>
>>>
>>>
>>> On Fri, Feb 28, 2014 at 8:50 AM, Aureliano Buendia >> > wrote:
>>>
 Another subject that was not that important in spark, but it could be
 crucial for 24/7 spark streaming, is reconstruction of lost nodes. By that,
 I do not mean lost data reconstruction by self healing, but bringing up new
 ec2 instances once they die for whatever reasons. Is this also supported in
 spark ec2?


 On Fri, Feb 28, 2014 at 2:24 AM, Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> Yes, the default spark EC2 cluster runs the standalone deploy mode.
> Since Spark 0.9, the standalone deploy mode allows you to launch the 
> driver
> app within the cluster itself and automatically restart it if it fails. 
> You
> can read about launching your app inside the cluster 
> here.
> Using this you can launch your streaming app as well.
>
> TD
>
>
> On Thu, Feb 27, 2014 at 5:35 PM, Aureliano Buendia <
> buendia...@gmail.com> wrote:
>
>> How about spark stream app itself? Does the ec2 script also provide
>> means for daemonizing and monitoring spark streaming apps which are
>> supposed to run 24/7? If not, any suggestions for how to do this?
>>
>>
>> On Thu, Feb 27, 2014 at 8:23 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Zookeeper is automatically set up in the cluster as Spark uses
>>> Zookeeper. However, you have to setup your own input source like Kafka 
>>> or
>>> Flume.
>>>
>>> TD
>>>
>>>
>>> On Thu, Feb 27, 2014 at 10:32 AM, Aureliano Buendia <
>>> buendia...@gmail.com> wrote:
>>>



 On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> Yes! Spark streaming programs are just like any spark program and
> so any ec2 cluster setup using the spark-ec2 scripts can be used to 
> run
> spark streaming programs as well.
>

 Great. Does it come with any input source support as well? (Eg
 kafka requires setting up zookeeper).


>
>
>
> On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia <
> buendia...@gmail.com> wrote:
>
>> Hi,
>>
>> Does the ec2 support for spark 0.9 also include spark streaming?
>> If not, is there an equivalent?
>>
>>
>

>>>
>>
>

>>>
>>
>


Re: GraphX with UUID vertex IDs instead of Long

2014-02-28 Thread Deepak Nulu
I created an Improvement Issue for this:
https://spark-project.atlassian.net/browse/SPARK-1153



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-with-UUID-vertex-IDs-instead-of-Long-tp1953p2173.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Egor Pahomov
Spark 0.9 uses protobuf 2.5.0
Hadoop 2.2 uses protobuf 2.5.0
protobuf 2.5.0 can read massages serialized with protobuf 2.4.1
So there is not any reason why you can't read some messages from hadoop 2.2
with protobuf 2.5.0, probably you somehow have 2.4.1 in your class path. Of
course it's very bad, that you have both 2.4.1 and 2.5.0 in your classpath.
Use excludes or whatever to get rid of 2.4.1.

Personally, I spend 3 days to move my project to protobuf 2.5.0 from 2.4.1.
But it has to be done for the whole your project.

2014-02-28 21:49 GMT+04:00 Aureliano Buendia :

> Doesn't hadoop 2.2 also depend on protobuf 2.4?
>
>
> On Fri, Feb 28, 2014 at 5:45 PM, Ognen Duzlevski <
> og...@plainvanillagames.com> wrote:
>
>> A stupid question, by the way, you did compile Spark with Hadoop 2.2.0
>> support?
>>
>> Ognen
>>
>> On 2/28/14, 10:51 AM, Prasad wrote:
>>
>>> Hi
>>> I am getting the protobuf error while reading HDFS file using spark
>>> 0.9.0 -- i am running on hadoop 2.2.0 .
>>>
>>> When i look thru, i find that i have both 2.4.1 and 2.5 and some blogs
>>> suggest that there is some incompatability issues betwen 2.4.1 and 2.5
>>>
>>> hduser@prasadHdp1:~/spark-0.9.0-incubating$ find ~/ -name
>>> protobuf-java*.jar
>>> /home/hduser/.m2/repository/com/google/protobuf/protobuf-
>>> java/2.4.1/protobuf-java-2.4.1.jar
>>> /home/hduser/.m2/repository/org/spark-project/protobuf/
>>> protobuf-java/2.4.1-shaded/protobuf-java-2.4.1-shaded.jar
>>> /home/hduser/spark-0.9.0-incubating/lib_managed/
>>> bundles/protobuf-java-2.5.0.jar
>>> /home/hduser/spark-0.9.0-incubating/lib_managed/jars/
>>> protobuf-java-2.4.1-shaded.jar
>>> /home/hduser/.ivy2/cache/com.google.protobuf/protobuf-java/
>>> bundles/protobuf-java-2.5.0.jar
>>> /home/hduser/.ivy2/cache/org.spark-project.protobuf/
>>> protobuf-java/jars/protobuf-java-2.4.1-shaded.jar
>>>
>>>
>>> Can someone please let me know if you faced these issues and how u fixed
>>> it.
>>>
>>> Thanks
>>> Prasad.
>>> Caused by: java.lang.VerifyError: class
>>> org.apache.hadoop.security.proto.SecurityProtos$
>>> GetDelegationTokenRequestProto
>>> overrides final method
>>> getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
>>>  at java.lang.ClassLoader.defineClass1(Native Method)
>>>  at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>>>  at
>>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>>>  at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>>>  at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>  at java.security.AccessController.doPrivileged(Native Method)
>>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>>  at java.lang.Class.getDeclaredMethods0(Native Method)
>>>  at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
>>>  at java.lang.Class.privateGetPublicMethods(Class.java:2651)
>>>  at java.lang.Class.privateGetPublicMethods(Class.java:2661)
>>>  at java.lang.Class.getMethods(Class.java:1467)
>>>  at
>>> sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
>>>  at
>>> sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
>>>  at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
>>>  at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
>>>  at
>>> org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(
>>> ProtobufRpcEngine.java:92)
>>>  at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)
>>>
>>>
>>> Caused by: java.lang.reflect.InvocationTargetException
>>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>  at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(
>>> NativeMethodAccessorImpl.java:57)
>>>  at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>> DelegatingMethodAccessorImpl.java:43)
>>>  at java.lang.reflect.Method.invoke(Method.java:606)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-
>>> 0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>> --
>> Some people, when confronted with a problem, think "I know, I'll use
>> regular expressions." Now they have two problems.
>> -- Jamie Zawinski
>>
>>
>


-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Spark streaming on ec2

2014-02-28 Thread Aureliano Buendia
Also, in this talk http://www.youtube.com/watch?v=OhpjgaBVUtU on using
spark streaming in production, the author seems to have missed the topic of
how to manage cloud instances.


On Fri, Feb 28, 2014 at 6:48 PM, Aureliano Buendia wrote:

> What's the updated way of deploying spark streaming apps on EMR? Using
> YARN?
>
> There are some out of date solutions like
> https://github.com/ianoc/SparkEMRBootstrap which setup mesos on EMR. I
> wonder if this can be simplified by spark 0.9.
>
> Spark-ec2 comes with a considerable amount of configuration, and some
> useful utilities like deploy to workers, porting it to a managed service
> such as EMR is not as trivial as it might seem to be.
>
>
> On Fri, Feb 28, 2014 at 6:19 PM, Mayur Rustagi wrote:
>
>> I think what you are looking for is sort of a managed service ala EMR or
>> Qubole. Spark-ec2 is just software to boot up machines & integrate them
>> together using Whirr.
>> I agree a managed service for Streaming would be really useful.
>> Regards
>> Mayur
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Fri, Feb 28, 2014 at 8:50 AM, Aureliano Buendia 
>> wrote:
>>
>>> Another subject that was not that important in spark, but it could be
>>> crucial for 24/7 spark streaming, is reconstruction of lost nodes. By that,
>>> I do not mean lost data reconstruction by self healing, but bringing up new
>>> ec2 instances once they die for whatever reasons. Is this also supported in
>>> spark ec2?
>>>
>>>
>>> On Fri, Feb 28, 2014 at 2:24 AM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Yes, the default spark EC2 cluster runs the standalone deploy mode.
 Since Spark 0.9, the standalone deploy mode allows you to launch the driver
 app within the cluster itself and automatically restart it if it fails. You
 can read about launching your app inside the cluster 
 here.
 Using this you can launch your streaming app as well.

 TD


 On Thu, Feb 27, 2014 at 5:35 PM, Aureliano Buendia <
 buendia...@gmail.com> wrote:

> How about spark stream app itself? Does the ec2 script also provide
> means for daemonizing and monitoring spark streaming apps which are
> supposed to run 24/7? If not, any suggestions for how to do this?
>
>
> On Thu, Feb 27, 2014 at 8:23 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Zookeeper is automatically set up in the cluster as Spark uses
>> Zookeeper. However, you have to setup your own input source like Kafka or
>> Flume.
>>
>> TD
>>
>>
>> On Thu, Feb 27, 2014 at 10:32 AM, Aureliano Buendia <
>> buendia...@gmail.com> wrote:
>>
>>>
>>>
>>>
>>> On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Yes! Spark streaming programs are just like any spark program and
 so any ec2 cluster setup using the spark-ec2 scripts can be used to run
 spark streaming programs as well.

>>>
>>> Great. Does it come with any input source support as well? (Eg kafka
>>> requires setting up zookeeper).
>>>
>>>



 On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia <
 buendia...@gmail.com> wrote:

> Hi,
>
> Does the ec2 support for spark 0.9 also include spark streaming?
> If not, is there an equivalent?
>
>

>>>
>>
>

>>>
>>
>


Re: Spark streaming on ec2

2014-02-28 Thread Aureliano Buendia
What's the updated way of deploying spark streaming apps on EMR? Using YARN?

There are some out of date solutions like
https://github.com/ianoc/SparkEMRBootstrap which setup mesos on EMR. I
wonder if this can be simplified by spark 0.9.

Spark-ec2 comes with a considerable amount of configuration, and some
useful utilities like deploy to workers, porting it to a managed service
such as EMR is not as trivial as it might seem to be.


On Fri, Feb 28, 2014 at 6:19 PM, Mayur Rustagi wrote:

> I think what you are looking for is sort of a managed service ala EMR or
> Qubole. Spark-ec2 is just software to boot up machines & integrate them
> together using Whirr.
> I agree a managed service for Streaming would be really useful.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Fri, Feb 28, 2014 at 8:50 AM, Aureliano Buendia 
> wrote:
>
>> Another subject that was not that important in spark, but it could be
>> crucial for 24/7 spark streaming, is reconstruction of lost nodes. By that,
>> I do not mean lost data reconstruction by self healing, but bringing up new
>> ec2 instances once they die for whatever reasons. Is this also supported in
>> spark ec2?
>>
>>
>> On Fri, Feb 28, 2014 at 2:24 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Yes, the default spark EC2 cluster runs the standalone deploy mode.
>>> Since Spark 0.9, the standalone deploy mode allows you to launch the driver
>>> app within the cluster itself and automatically restart it if it fails. You
>>> can read about launching your app inside the cluster 
>>> here.
>>> Using this you can launch your streaming app as well.
>>>
>>> TD
>>>
>>>
>>> On Thu, Feb 27, 2014 at 5:35 PM, Aureliano Buendia >> > wrote:
>>>
 How about spark stream app itself? Does the ec2 script also provide
 means for daemonizing and monitoring spark streaming apps which are
 supposed to run 24/7? If not, any suggestions for how to do this?


 On Thu, Feb 27, 2014 at 8:23 PM, Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> Zookeeper is automatically set up in the cluster as Spark uses
> Zookeeper. However, you have to setup your own input source like Kafka or
> Flume.
>
> TD
>
>
> On Thu, Feb 27, 2014 at 10:32 AM, Aureliano Buendia <
> buendia...@gmail.com> wrote:
>
>>
>>
>>
>> On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Yes! Spark streaming programs are just like any spark program and so
>>> any ec2 cluster setup using the spark-ec2 scripts can be used to run 
>>> spark
>>> streaming programs as well.
>>>
>>
>> Great. Does it come with any input source support as well? (Eg kafka
>> requires setting up zookeeper).
>>
>>
>>>
>>>
>>>
>>> On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia <
>>> buendia...@gmail.com> wrote:
>>>
 Hi,

 Does the ec2 support for spark 0.9 also include spark streaming? If
 not, is there an equivalent?


>>>
>>
>

>>>
>>
>


Re: Use pyspark for following.

2014-02-28 Thread Andrew Ash
Roughly how many rows are in the most-common primary id?  If that's small,
you could group by primary id and assemble the resulting row from the group.

Is it possible to have two rows with the same primary and secondary id?
 Like this:

1,alpha,20
1,alpha,25

If not, you could map these to expanded-out rows and reduce by key to get
the result.

1,alpha,20
1,beta,22
1,gamma,25



1,(20,0,0,0)
1,(0,22,0,0)
1,(0,0,25,0)



1,(20,22,25,0)


Andrew



On Fri, Feb 28, 2014 at 10:31 AM, Chengi Liu wrote:

> My use case:
>
> prim_id,secondary_id,value
>
> There are million ids.. but 5 secondary ids.. But any secondary id is
> optional.
> For example:
> So.. secondary ids are say [alpha,beta,gamma,delta,kappa]
> 1,alpha,20
> 1,beta,22
> 1,gamma,25
> 2,alpha,1
> 2,delta,15
> 3,kappa,90
>
> What I want is to get the following output
>
> 1,20,22,25,0,0 # since kappa and delta are not present
> 2,1,0,0,15,0
> 3,0,0,0,0,90
>
> So basically flatten it out?
> How do i do this in pyspark.
> Thanks
>
>
>


Use pyspark for following.

2014-02-28 Thread Chengi Liu
My use case:

prim_id,secondary_id,value

There are million ids.. but 5 secondary ids.. But any secondary id is
optional.
For example:
So.. secondary ids are say [alpha,beta,gamma,delta,kappa]
1,alpha,20
1,beta,22
1,gamma,25
2,alpha,1
2,delta,15
3,kappa,90

What I want is to get the following output

1,20,22,25,0,0 # since kappa and delta are not present
2,1,0,0,15,0
3,0,0,0,0,90

So basically flatten it out?
How do i do this in pyspark.
Thanks


Re: Spark streaming on ec2

2014-02-28 Thread Mayur Rustagi
I think what you are looking for is sort of a managed service ala EMR or
Qubole. Spark-ec2 is just software to boot up machines & integrate them
together using Whirr.
I agree a managed service for Streaming would be really useful.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Fri, Feb 28, 2014 at 8:50 AM, Aureliano Buendia wrote:

> Another subject that was not that important in spark, but it could be
> crucial for 24/7 spark streaming, is reconstruction of lost nodes. By that,
> I do not mean lost data reconstruction by self healing, but bringing up new
> ec2 instances once they die for whatever reasons. Is this also supported in
> spark ec2?
>
>
> On Fri, Feb 28, 2014 at 2:24 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Yes, the default spark EC2 cluster runs the standalone deploy mode. Since
>> Spark 0.9, the standalone deploy mode allows you to launch the driver app
>> within the cluster itself and automatically restart it if it fails. You can
>> read about launching your app inside the cluster 
>> here.
>> Using this you can launch your streaming app as well.
>>
>> TD
>>
>>
>> On Thu, Feb 27, 2014 at 5:35 PM, Aureliano Buendia 
>> wrote:
>>
>>> How about spark stream app itself? Does the ec2 script also provide
>>> means for daemonizing and monitoring spark streaming apps which are
>>> supposed to run 24/7? If not, any suggestions for how to do this?
>>>
>>>
>>> On Thu, Feb 27, 2014 at 8:23 PM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
 Zookeeper is automatically set up in the cluster as Spark uses
 Zookeeper. However, you have to setup your own input source like Kafka or
 Flume.

 TD


 On Thu, Feb 27, 2014 at 10:32 AM, Aureliano Buendia <
 buendia...@gmail.com> wrote:

>
>
>
> On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Yes! Spark streaming programs are just like any spark program and so
>> any ec2 cluster setup using the spark-ec2 scripts can be used to run 
>> spark
>> streaming programs as well.
>>
>
> Great. Does it come with any input source support as well? (Eg kafka
> requires setting up zookeeper).
>
>
>>
>>
>>
>> On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia <
>> buendia...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Does the ec2 support for spark 0.9 also include spark streaming? If
>>> not, is there an equivalent?
>>>
>>>
>>
>

>>>
>>
>


Re: Build Spark Against CDH5

2014-02-28 Thread Brian Brunner
After successfully building the official 0.9.0 release I attempted to build
off of the github code again and was successfully able to do so. Not really
sure what happened, but it works now.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Build-Spark-Against-CDH5-tp2129p2165.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Aureliano Buendia
Doesn't hadoop 2.2 also depend on protobuf 2.4?


On Fri, Feb 28, 2014 at 5:45 PM, Ognen Duzlevski <
og...@plainvanillagames.com> wrote:

> A stupid question, by the way, you did compile Spark with Hadoop 2.2.0
> support?
>
> Ognen
>
> On 2/28/14, 10:51 AM, Prasad wrote:
>
>> Hi
>> I am getting the protobuf error while reading HDFS file using spark
>> 0.9.0 -- i am running on hadoop 2.2.0 .
>>
>> When i look thru, i find that i have both 2.4.1 and 2.5 and some blogs
>> suggest that there is some incompatability issues betwen 2.4.1 and 2.5
>>
>> hduser@prasadHdp1:~/spark-0.9.0-incubating$ find ~/ -name
>> protobuf-java*.jar
>> /home/hduser/.m2/repository/com/google/protobuf/protobuf-
>> java/2.4.1/protobuf-java-2.4.1.jar
>> /home/hduser/.m2/repository/org/spark-project/protobuf/
>> protobuf-java/2.4.1-shaded/protobuf-java-2.4.1-shaded.jar
>> /home/hduser/spark-0.9.0-incubating/lib_managed/
>> bundles/protobuf-java-2.5.0.jar
>> /home/hduser/spark-0.9.0-incubating/lib_managed/jars/
>> protobuf-java-2.4.1-shaded.jar
>> /home/hduser/.ivy2/cache/com.google.protobuf/protobuf-java/
>> bundles/protobuf-java-2.5.0.jar
>> /home/hduser/.ivy2/cache/org.spark-project.protobuf/
>> protobuf-java/jars/protobuf-java-2.4.1-shaded.jar
>>
>>
>> Can someone please let me know if you faced these issues and how u fixed
>> it.
>>
>> Thanks
>> Prasad.
>> Caused by: java.lang.VerifyError: class
>> org.apache.hadoop.security.proto.SecurityProtos$
>> GetDelegationTokenRequestProto
>> overrides final method
>> getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
>>  at java.lang.ClassLoader.defineClass1(Native Method)
>>  at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>>  at
>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>>  at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>>  at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>  at java.security.AccessController.doPrivileged(Native Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>  at java.lang.Class.getDeclaredMethods0(Native Method)
>>  at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
>>  at java.lang.Class.privateGetPublicMethods(Class.java:2651)
>>  at java.lang.Class.privateGetPublicMethods(Class.java:2661)
>>  at java.lang.Class.getMethods(Class.java:1467)
>>  at
>> sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
>>  at
>> sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
>>  at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
>>  at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
>>  at
>> org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(
>> ProtobufRpcEngine.java:92)
>>  at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)
>>
>>
>> Caused by: java.lang.reflect.InvocationTargetException
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at
>> sun.reflect.NativeMethodAccessorImpl.invoke(
>> NativeMethodAccessorImpl.java:57)
>>  at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:606)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-
>> 0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
> --
> Some people, when confronted with a problem, think "I know, I'll use
> regular expressions." Now they have two problems.
> -- Jamie Zawinski
>
>


Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Ognen Duzlevski
A stupid question, by the way, you did compile Spark with Hadoop 2.2.0 
support?

Ognen

On 2/28/14, 10:51 AM, Prasad wrote:

Hi
I am getting the protobuf error while reading HDFS file using spark
0.9.0 -- i am running on hadoop 2.2.0 .

When i look thru, i find that i have both 2.4.1 and 2.5 and some blogs
suggest that there is some incompatability issues betwen 2.4.1 and 2.5

hduser@prasadHdp1:~/spark-0.9.0-incubating$ find ~/ -name protobuf-java*.jar
/home/hduser/.m2/repository/com/google/protobuf/protobuf-java/2.4.1/protobuf-java-2.4.1.jar
/home/hduser/.m2/repository/org/spark-project/protobuf/protobuf-java/2.4.1-shaded/protobuf-java-2.4.1-shaded.jar
/home/hduser/spark-0.9.0-incubating/lib_managed/bundles/protobuf-java-2.5.0.jar
/home/hduser/spark-0.9.0-incubating/lib_managed/jars/protobuf-java-2.4.1-shaded.jar
/home/hduser/.ivy2/cache/com.google.protobuf/protobuf-java/bundles/protobuf-java-2.5.0.jar
/home/hduser/.ivy2/cache/org.spark-project.protobuf/protobuf-java/jars/protobuf-java-2.4.1-shaded.jar


Can someone please let me know if you faced these issues and how u fixed it.

Thanks
Prasad.
Caused by: java.lang.VerifyError: class
org.apache.hadoop.security.proto.SecurityProtos$GetDelegationTokenRequestProto
overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
 at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.getDeclaredMethods0(Native Method)
 at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
 at java.lang.Class.privateGetPublicMethods(Class.java:2651)
 at java.lang.Class.privateGetPublicMethods(Class.java:2661)
 at java.lang.Class.getMethods(Class.java:1467)
 at
sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
 at
sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
 at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
 at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
 at
org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(ProtobufRpcEngine.java:92)
 at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)


Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)










--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


--
Some people, when confronted with a problem, think "I know, I'll use regular 
expressions." Now they have two problems.
-- Jamie Zawinski



Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Ognen Duzlevski
I run a 2.2.0 based HDFS cluster and I use Spark-0.9.0 without any 
problems to read the files.

Ognen

On 2/28/14, 10:51 AM, Prasad wrote:

Hi
I am getting the protobuf error while reading HDFS file using spark
0.9.0 -- i am running on hadoop 2.2.0 .

When i look thru, i find that i have both 2.4.1 and 2.5 and some blogs
suggest that there is some incompatability issues betwen 2.4.1 and 2.5

hduser@prasadHdp1:~/spark-0.9.0-incubating$ find ~/ -name protobuf-java*.jar
/home/hduser/.m2/repository/com/google/protobuf/protobuf-java/2.4.1/protobuf-java-2.4.1.jar
/home/hduser/.m2/repository/org/spark-project/protobuf/protobuf-java/2.4.1-shaded/protobuf-java-2.4.1-shaded.jar
/home/hduser/spark-0.9.0-incubating/lib_managed/bundles/protobuf-java-2.5.0.jar
/home/hduser/spark-0.9.0-incubating/lib_managed/jars/protobuf-java-2.4.1-shaded.jar
/home/hduser/.ivy2/cache/com.google.protobuf/protobuf-java/bundles/protobuf-java-2.5.0.jar
/home/hduser/.ivy2/cache/org.spark-project.protobuf/protobuf-java/jars/protobuf-java-2.4.1-shaded.jar


Can someone please let me know if you faced these issues and how u fixed it.

Thanks
Prasad.
Caused by: java.lang.VerifyError: class
org.apache.hadoop.security.proto.SecurityProtos$GetDelegationTokenRequestProto
overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
 at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.getDeclaredMethods0(Native Method)
 at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
 at java.lang.Class.privateGetPublicMethods(Class.java:2651)
 at java.lang.Class.privateGetPublicMethods(Class.java:2661)
 at java.lang.Class.getMethods(Class.java:1467)
 at
sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
 at
sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
 at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
 at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
 at
org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(ProtobufRpcEngine.java:92)
 at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)


Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)










--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.




Key Sort order on reduction

2014-02-28 Thread Usman Ghani
Hi All,
In Spark associative operations like groupByKey and reduceByKey, is it
guaranteed that the keys to each reducer will flow in sorted order like
they do in Hadoop MR? Or do I have to call sortByKey first?


Re: JVM error

2014-02-28 Thread Mohit Singh
Hi Bryn,
  Thanks for the suggestion.
I tried that..
conf = pyspark.SparkConf().set("spark.executor.memory","20G")
But.. got an error here:
sc = pyspark.SparkConf(conf = conf)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: __init__() got an unexpected keyword argument 'conf'

??
This is in pyspark shell.


On Thu, Feb 27, 2014 at 5:00 AM, Evgeniy Shishkin wrote:

>
> On 27 Feb 2014, at 07:22, Aaron Davidson  wrote:
>
> > Setting spark.executor.memory is indeed the correct way to do this. If
> you want to configure this in spark-env.sh, you can use
> > export SPARK_JAVA_OPTS=" -Dspark.executor.memory=20g"
> > (make sure to append the variable if you've been using SPARK_JAVA_OPTS
> previously)
> >
> >
> > On Wed, Feb 26, 2014 at 7:50 PM, Bryn Keller  wrote:
> > Hi Mohit,
> >
> > You can still set SPARK_MEM in spark-env.sh, but that is deprecated.
> This is from SparkContext.scala:
> >
> > if (!conf.contains("spark.executor.memory") &&
> sys.env.contains("SPARK_MEM")) {
> > logWarning("Using SPARK_MEM to set amount of memory to use per
> executor process is " +
> >   "deprecated, instead use spark.executor.memory")
> >   }
> >
> > Thanks,
> > Bryn
> >
> >
> > On Wed, Feb 26, 2014 at 6:28 PM, Mohit Singh 
> wrote:
> > Hi Bryn,
> >   Thanks for responding. Is there a way I can permanently configure this
> setting?
> > like SPARK_EXECUTOR_MEMORY or somethign like that?
> >
> >
> >
> > On Wed, Feb 26, 2014 at 2:56 PM, Bryn Keller  wrote:
> > Hi Mohit,
> >
> > Try increasing the executor memory instead of the worker memory - the
> most appropriate place to do this is actually when you're creating your
> SparkContext, something like:
> >
> > conf = pyspark.SparkConf()
> >.setMaster("spark://master:7077")
> >.setAppName("Example")
> >.setSparkHome("/your/path/to/spark")
> >.set("spark.executor.memory", "20G")
> >.set("spark.logConf", "true")
> > sc = pyspark.SparkConf(conf = conf)
> >
> > Hope that helps,
> > Bryn
> >
> >
> >
> > On Wed, Feb 26, 2014 at 2:39 PM, Mohit Singh 
> wrote:
> > Hi,
> >   I am experimenting with pyspark lately...
> > Every now and then, I see this error bieng streamed to pyspark shell ..
> and most of the times.. the computation/operation completes.. and
> sometimes, it just gets stuck...
> > My setup is 8 node cluster.. with loads of ram(256GB's) and space( TB's)
> per node.
> > This enviornment is shared by general hadoop and hadoopy stuff..with
> recent spark addition...
> >
> > java.lang.OutOfMemoryError: Java heap space
> > at
> com.ning.compress.BufferRecycler.allocEncodingBuffer(BufferRecycler.java:59)
> > at com.ning.compress.lzf.ChunkEncoder.(ChunkEncoder.java:93)
> > at
> com.ning.compress.lzf.impl.UnsafeChunkEncoder.(UnsafeChunkEncoder.java:40)
> > at
> com.ning.compress.lzf.impl.UnsafeChunkEncoderLE.(UnsafeChunkEncoderLE.java:13)
> > at
> com.ning.compress.lzf.impl.UnsafeChunkEncoders.createEncoder(UnsafeChunkEncoders.java:31)
> > at
> com.ning.compress.lzf.util.ChunkEncoderFactory.optimalInstance(ChunkEncoderFactory.java:44)
> > at
> com.ning.compress.lzf.LZFOutputStream.(LZFOutputStream.java:61)
> > at
> org.apache.spark.io.LZFCompressionCodec.compressedOutputStream(CompressionCodec.scala:60)
> > at
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:803)
> > at
> org.apache.spark.storage.BlockManager$$anonfun$5.apply(BlockManager.scala:471)
> > at
> org.apache.spark.storage.BlockManager$$anonfun$5.apply(BlockManager.scala:471)
> > at
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
> > at
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:174)
> > at
> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:164)
> > at
> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> > at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> > at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> > at org.apache.spark.scheduler.Task.run(Task.scala:53)
> > at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> > at
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
> > at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> > at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:744)
> >
> >
> >
> > Most of the settings in spark are default..

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Aureliano Buendia
Using protobuf 2.5 can lead to some major issues with spark, see

http://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3ccab89jjuy0sqkkokcidetglrzrj2zlat3phbvpjoxxcy9soq...@mail.gmail.com%3E

Moving protobuf 2.5 jar after the spark jar can help with your error, but
then you'll face the

WARN ClusterScheduler: Initial job has not accepted any resources;...

error which is still an unresolved issue in spark.

I had to downgrade protobuf in my app to 2.4.1 to get it work on spark.
This is not ideal as protobuf 2.5 comes with better performance.


On Fri, Feb 28, 2014 at 4:51 PM, Prasad wrote:

> Hi
> I am getting the protobuf error while reading HDFS file using spark
> 0.9.0 -- i am running on hadoop 2.2.0 .
>
> When i look thru, i find that i have both 2.4.1 and 2.5 and some blogs
> suggest that there is some incompatability issues betwen 2.4.1 and 2.5
>
> hduser@prasadHdp1:~/spark-0.9.0-incubating$ find ~/ -name
> protobuf-java*.jar
>
> /home/hduser/.m2/repository/com/google/protobuf/protobuf-java/2.4.1/protobuf-java-2.4.1.jar
>
> /home/hduser/.m2/repository/org/spark-project/protobuf/protobuf-java/2.4.1-shaded/protobuf-java-2.4.1-shaded.jar
>
> /home/hduser/spark-0.9.0-incubating/lib_managed/bundles/protobuf-java-2.5.0.jar
>
> /home/hduser/spark-0.9.0-incubating/lib_managed/jars/protobuf-java-2.4.1-shaded.jar
>
> /home/hduser/.ivy2/cache/com.google.protobuf/protobuf-java/bundles/protobuf-java-2.5.0.jar
>
> /home/hduser/.ivy2/cache/org.spark-project.protobuf/protobuf-java/jars/protobuf-java-2.4.1-shaded.jar
>
>
> Can someone please let me know if you faced these issues and how u fixed
> it.
>
> Thanks
> Prasad.
> Caused by: java.lang.VerifyError: class
>
> org.apache.hadoop.security.proto.SecurityProtos$GetDelegationTokenRequestProto
> overrides final method
> getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
> at
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
> at java.lang.Class.privateGetPublicMethods(Class.java:2651)
> at java.lang.Class.privateGetPublicMethods(Class.java:2661)
> at java.lang.Class.getMethods(Class.java:1467)
> at
> sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
> at
> sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
> at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
> at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(ProtobufRpcEngine.java:92)
> at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)
>
>
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-02-28 Thread Prasad
Hi
I am getting the protobuf error while reading HDFS file using spark
0.9.0 -- i am running on hadoop 2.2.0 .

When i look thru, i find that i have both 2.4.1 and 2.5 and some blogs
suggest that there is some incompatability issues betwen 2.4.1 and 2.5

hduser@prasadHdp1:~/spark-0.9.0-incubating$ find ~/ -name protobuf-java*.jar
/home/hduser/.m2/repository/com/google/protobuf/protobuf-java/2.4.1/protobuf-java-2.4.1.jar
/home/hduser/.m2/repository/org/spark-project/protobuf/protobuf-java/2.4.1-shaded/protobuf-java-2.4.1-shaded.jar
/home/hduser/spark-0.9.0-incubating/lib_managed/bundles/protobuf-java-2.5.0.jar
/home/hduser/spark-0.9.0-incubating/lib_managed/jars/protobuf-java-2.4.1-shaded.jar
/home/hduser/.ivy2/cache/com.google.protobuf/protobuf-java/bundles/protobuf-java-2.5.0.jar
/home/hduser/.ivy2/cache/org.spark-project.protobuf/protobuf-java/jars/protobuf-java-2.4.1-shaded.jar


Can someone please let me know if you faced these issues and how u fixed it. 

Thanks
Prasad.
Caused by: java.lang.VerifyError: class
org.apache.hadoop.security.proto.SecurityProtos$GetDelegationTokenRequestProto
overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
at java.lang.Class.privateGetPublicMethods(Class.java:2651)
at java.lang.Class.privateGetPublicMethods(Class.java:2661)
at java.lang.Class.getMethods(Class.java:1467)
at
sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:426)
at
sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:323)
at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:636)
at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:722)
at
org.apache.hadoop.ipc.ProtobufRpcEngine.getProxy(ProtobufRpcEngine.java:92)
at org.apache.hadoop.ipc.RPC.getProtocolProxy(RPC.java:537)


Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)










--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark streaming on ec2

2014-02-28 Thread Aureliano Buendia
Another subject that was not that important in spark, but it could be
crucial for 24/7 spark streaming, is reconstruction of lost nodes. By that,
I do not mean lost data reconstruction by self healing, but bringing up new
ec2 instances once they die for whatever reasons. Is this also supported in
spark ec2?


On Fri, Feb 28, 2014 at 2:24 AM, Tathagata Das
wrote:

> Yes, the default spark EC2 cluster runs the standalone deploy mode. Since
> Spark 0.9, the standalone deploy mode allows you to launch the driver app
> within the cluster itself and automatically restart it if it fails. You can
> read about launching your app inside the cluster 
> here.
> Using this you can launch your streaming app as well.
>
> TD
>
>
> On Thu, Feb 27, 2014 at 5:35 PM, Aureliano Buendia 
> wrote:
>
>> How about spark stream app itself? Does the ec2 script also provide means
>> for daemonizing and monitoring spark streaming apps which are supposed to
>> run 24/7? If not, any suggestions for how to do this?
>>
>>
>> On Thu, Feb 27, 2014 at 8:23 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Zookeeper is automatically set up in the cluster as Spark uses
>>> Zookeeper. However, you have to setup your own input source like Kafka or
>>> Flume.
>>>
>>> TD
>>>
>>>
>>> On Thu, Feb 27, 2014 at 10:32 AM, Aureliano Buendia <
>>> buendia...@gmail.com> wrote:
>>>



 On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

> Yes! Spark streaming programs are just like any spark program and so
> any ec2 cluster setup using the spark-ec2 scripts can be used to run spark
> streaming programs as well.
>

 Great. Does it come with any input source support as well? (Eg kafka
 requires setting up zookeeper).


>
>
>
> On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia <
> buendia...@gmail.com> wrote:
>
>> Hi,
>>
>> Does the ec2 support for spark 0.9 also include spark streaming? If
>> not, is there an equivalent?
>>
>>
>

>>>
>>
>


Re: Having Spark read a JSON file

2014-02-28 Thread Paul Brown
Hi, Nick --

Not that it adds legitimacy, but there is even a MIME type line-delimited
JSON: application/x-ldjson (not to be confused with application/ld+json...)

 What I said about ser/de in inline blocks only applied in the Scala
dialect of Spark when using Jackson; for example:

  val om: ObjectMapper with ScalaObjectMapper = new ObjectMapper() with
ScalaObjectMapper

om.setPropertyNamingStrategy(PropertyNamingStrategy.CAMEL_CASE_TO_LOWER_CASE_WITH_UNDERSCORES)
  om.registerModule(DefaultScalaModule)
  om.registerModule(new JodaModule)
  val events: RDD[Event] =
sc.textFile("foo.ldj").map(om.readValue[Event](_))

That would attempt to send the ObjectMapper instance over the wire, and as
configured, the instance isn't serializable.  Instead, you can wrap the
functionality in an object that exists on the worker side:

object Foo {
  // initialize ObjectMapper here
  def mapStuff:...
}

and then in the job driver:

  val events: RDD[Event] = sc.textFile("foo.ldj").map(Foo.mapStuff)

There are probably analogs in the Python flavor as well, but IMHO things
like this are a nice object lesson (ho ho ho) about where code and data
live in a Spark system.  (/me waves hands about mobile code.)



—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


On Thu, Feb 27, 2014 at 8:52 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Thanks for the direction, Paul and Deb.
>
> I'm currently reading in the data using sc.textFile() and Python's
> json.loads(). It turns out that the big JSON data sources I care most about
> happen to be structured so that there is one object per line, even though
> the objects are correctly strung together in a JSON list.
>
> Deb,
>
> I get the files as JSON text, but they don't have to stay that way. Would
> you recommend I somehow convert the files into another format, say Avro,
> before handling them with Spark?
>
> Paul,
>
> When you say not to write your ser/de as inline blocks, could you provide
> a simple example (even in pseudocode) to illustrate?
>
> Nick
>
>
> On Mon, Feb 24, 2014 at 2:41 AM, Paul Brown  wrote:
>
>>
>> JSON handling works great, although you have to be a little bit careful
>> with just what is loaded/used where.  One approach that works is:
>>
>> - Jackson Scala 2.3.1 (or your favorite JSON lib) shipped as a JAR for
>> the job.
>> - Read data as RDD[String].
>> - Implement your per-line JSON binding in a method on an object, e.g.,
>> apply(...) for a companion object for a case class that models your line
>> items.  For the Jackson case, this would mean an ObjectMapper as a val in
>> the companion object (only need one ObjectMapper instance).
>> - .map(YourObject.apply) to get RDD[YourObject]
>>
>> And there you go.  Something similar works for writing out JSON.
>>
>> Probably obvious if you're a seasoned Spark user, but DO NOT write your
>> JSON serialization/deserialization as inline blocks, else you'll be
>> transporting your ObjectMapper instances around the cluster when you don't
>> need to (and depending on your specific configuration, it may not work).
>>  That is a facility that should (IMHO) be encapsulated with the pieces of
>> the system that directly touch the data, i.e., on the worker.
>>
>>
>> —
>> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>>
>>
>> On Sun, Feb 23, 2014 at 9:10 PM, nicholas.chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I'm new to this field, but it seems like most "Big Data" examples --
>>> Spark's included -- begin with reading in flat lines of text from a file.
>>>
>>> How would I go about having Spark turn a large JSON file into an RDD?
>>>
>>> So the file would just be a text file that looks like this:
>>>
>>> [{...}, {...}, ...]
>>>
>>>
>>>  where the individual JSON objects are arbitrarily complex (i.e. not
>>> necessarily flat) and may or may not be on separate lines.
>>>
>>> Basically, I'm guessing Spark would need to parse the JSON since it
>>> cannot rely on newlines as a delimiter. That sounds like a costly thing.
>>>
>>> Is JSON a "bad" format to have to deal with, or can Spark efficiently
>>> ingest and work with data in this format? If it can, can I get a pointer as
>>> to how I would do that?
>>>
>>>  Nick
>>>
>>> --
>>> View this message in context: Having Spark read a JSON 
>>> file
>>> Sent from the Apache Spark User List mailing list 
>>> archiveat Nabble.com.
>>>
>>
>>
>


RE: is RDD failure transparent to stream consumer

2014-02-28 Thread Adrian Mocanu
Thanks so much Matei!

From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: February-28-14 10:59 AM
To: user@spark.apache.org
Subject: Re: is RDD failure transparent to stream consumer

For output operators like this, the operator will run multiple times, so it 
need to be idempotent. However, the built-in save operators (e.g. 
saveAsTextFile) are automatically idempotent (they only create each output 
partition once).

Matei

On Feb 28, 2014, at 10:10 AM, Adrian Mocanu 
mailto:amoc...@verticalscope.com>> wrote:


Would really like an answer to this. A `yes` or `no` would suffice.

I'm talking ab RDD failure in this context:
myStream.foreachRDD(rdd=>rdd.foreach(tuple => println(tuple)))

From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: February-27-14 12:19 PM
To: u...@spark.incubator.apache.org
Subject: is RDD failure transparent to stream consumer

Is RDD failure transparent to a spark stream consumer except for the slowdown 
needed to recreate the RDD.
After reading the papers on RDDs and DStreams from spark homepage I believe it 
is, but I'd like a confirmation.

Thanks
-Adrian



Re: is RDD failure transparent to stream consumer

2014-02-28 Thread Matei Zaharia
For output operators like this, the operator will run multiple times, so it 
need to be idempotent. However, the built-in save operators (e.g. 
saveAsTextFile) are automatically idempotent (they only create each output 
partition once).

Matei

On Feb 28, 2014, at 10:10 AM, Adrian Mocanu  wrote:

> Would really like an answer to this. A `yes` or `no` would suffice.
>  
> I’m talking ab RDD failure in this context:
> myStream.foreachRDD(rdd=>rdd.foreach(tuple => println(tuple)))
>  
> From: Adrian Mocanu [mailto:amoc...@verticalscope.com] 
> Sent: February-27-14 12:19 PM
> To: u...@spark.incubator.apache.org
> Subject: is RDD failure transparent to stream consumer
>  
> Is RDD failure transparent to a spark stream consumer except for the slowdown 
> needed to recreate the RDD.
> After reading the papers on RDDs and DStreams from spark homepage I believe 
> it is, but I’d like a confirmation.
>  
> Thanks
> -Adrian



RE: is RDD failure transparent to stream consumer

2014-02-28 Thread Adrian Mocanu
Would really like an answer to this. A `yes` or `no` would suffice.

I'm talking ab RDD failure in this context:
myStream.foreachRDD(rdd=>rdd.foreach(tuple => println(tuple)))

From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: February-27-14 12:19 PM
To: u...@spark.incubator.apache.org
Subject: is RDD failure transparent to stream consumer

Is RDD failure transparent to a spark stream consumer except for the slowdown 
needed to recreate the RDD.
After reading the papers on RDDs and DStreams from spark homepage I believe it 
is, but I'd like a confirmation.

Thanks
-Adrian



Re: Rename filter() into keep(), remove() or take() ?

2014-02-28 Thread Bertrand Dechoux
Clojure made the same kind of choice too : 'filter()' and 'remove()'. So
the behavior of filter is obvious when you know about the other one...
Well, the function name makes sense if you are thinking using a 'logic
paradigm'.

Anyway, that something I had to write about. I understand that the ROI is
really likely not worth it.

Thanks for the feedback

Bertrand


On Thu, Feb 27, 2014 at 3:38 PM, Nick Pentreath wrote:

> Agree that filter is perhaps unintuitive. Though the Scala collections API
> has "filter" and "filterNot" which together provide context that makes it
> more intuitive.
>
> And yes the change could be via added methods that don't break existing
> API.
>
> Still overall I would be -1 on this unless a significant proportion of
> users would find it added value.
>
> Actually adding "filterNot" while not that necessary would make more sense
> in my view
>
>
> --
> Sent from Mailbox  for iPhone
>
>
> On Thu, Feb 27, 2014 at 3:56 PM, Bertrand Dechoux wrote:
>
>> I understand the explanation but I had to try. However, the change could
>> be made without breaking anything but that's another story.
>>
>> Regards
>>
>> Bertrand
>>
>> Bertrand Dechoux
>>
>>
>> On Thu, Feb 27, 2014 at 2:05 PM, Nick Pentreath > > wrote:
>>
>>> filter comes from the Scala collection method "filter". I'd say it's
>>> best to keep in line with the Scala collections API, as Spark has done with
>>> RDDs generally (map, flatMap, take etc), so that is is easier and natural
>>> for developers to apply the same thinking for Scala (parallel) collections
>>> to Spark RDDs.
>>>
>>> Plus, such an API change would be a major breaking one and IMO not a
>>> good idea at this stage.
>>>
>>>  deffilter(p: (A) => 
>>> Boolean
>>> ): Seq [
>>> A]
>>>
>>> Selects all elements of this sequence which satisfy a predicate.
>>>  p
>>>
>>> the predicate used to test elements.
>>>  returns
>>>
>>> a new sequence consisting of all elements of this sequence that satisfy
>>> the given predicate p. The order of the elements is preserved.
>>>
>>>
>>> On Thu, Feb 27, 2014 at 2:36 PM, Bertrand Dechoux wrote:
>>>
 Hi,

 It might seem like a trivial issue but even though it is somehow a
 standard name filter() is not really explicit in which way it does work.
 Sure, it makes sense to provide a filter function but what happens when it
 returns true? Is the current element removed or kept? It is not really
 obvious.

 Has another name been already discussed? It could be keep() or
 remove(). But take() could also be reused and instead of providing a
 number, the filter function could be requested.

  Regards

 Bertrand

>>>
>>>
>>
>


Re: Implementing a custom Spark shell

2014-02-28 Thread Prashant Sharma
You can enable debug logging for repl, thankfully it uses sparks logging
framework. Trouble must be with wrappers.

Prashant Sharma


On Fri, Feb 28, 2014 at 12:29 PM, Sampo Niskanen
wrote:

> Hi,
>
> Thanks for the pointers.  I did get my code working within the normal
> spark-shell.  However, since I'm building a separate analysis service which
> pulls in the Spark libraries using SBT, I'd much rather have the custom
> shell incorporated in that, instead of having to use the default
> downloadable distribution.
>
>
> I figured out how to create a custom Scala REPL using the instructions at
> http://stackoverflow.com/questions/18628516/embedded-scala-repl-interpreter-example-for-2-10
>  (The latter answer is my helper class that I use.)
>
> I injected the SparkContext and my RDD's and for example rdd.count works
> fine.  However, when I try to perform a filter operation, I get a
> ClassNotFoundException [1].  My guess is that the inline function I define
> is created only within the REPL, and does not get sent to the processors
> (even though I'm using a local cluster).
>
> I found out that there's a separate spark-repl library, which contains the
> SparkILoop class.  When I replace the ILoop with SparkILoop, I get the
> Spark logo + version number, a NullPointerException [2] and then the Scala
> prompt.  Still, I get exactly the same ClassNotFoundException when trying
> to perform a filter operation.
>
> Can anyone give any pointers on how to get this working?
>
>
> Best regards,
>Sampo N.
>
>
>
> ClassNotFoundException [1]:
>
> scala> data.profile.filter(p => p.email == "sampo.niska...@mwsoy.com
> ").count
> 14/02/28 08:49:16 ERROR Executor: Exception in task ID 1
> java.lang.ClassNotFoundException: $line9.$read$$iw$$iw$$anonfun$1
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>  at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>  at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
>  at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>  at
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
> at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>  at
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>  at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>  at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:195)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> 14/02/28 08:49:16 ERROR TaskSetManager: Task 1.0:0 failed 1 times;
> aborting job
> org.apache.spark.SparkException: Job aborted: Task 1.0:0 failed 1 times
> (most recent failure: Exception failure: java.lang.ClassNotFoundException:
> $anonfun$1)
>  at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
>  at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at org.apache.spark.scheduler.DAGScheduler.org