Re: Kafka + Spark streaming

2014-12-31 Thread Samya Maiti
Thanks TD. On Wed, Dec 31, 2014 at 7:19 AM, Tathagata Das tathagata.das1...@gmail.com wrote: 1. Of course, a single block / partition has many Kafka messages, and from different Kafka topics interleaved together. The message count is not related to the block count. Any message received within

FlatMapValues

2014-12-31 Thread Sanjay Subramanian
hey guys  My dataset is like this  025126,Chills,8.10,Injection site oedema,8.10,Injection site reaction,8.10,Malaise,8.10,Myalgia,8.10 Intended output is ==025126,Chills 025126,Injection site oedema 025126,Injection site reaction 025126,Malaise 025126,Myalgia My code is as

pyspark.daemon not found

2014-12-31 Thread Naveen Kumar Pokala
Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master/python: Please can somebody help me on this, how to resolve the issue. -Naveen

Re: How to set local property in beeline connect to the spark thrift server

2014-12-31 Thread 田毅
Hi, Xiaoyu! You can use `spark.sql.thriftserver.scheduler.pool` instead of `spark.scheduler.pool` only in spark thrift server. On Wed, Dec 31, 2014 at 3:55 PM, Xiaoyu Wang wangxy...@gmail.com wrote: Hi all! I use Spark SQL1.2 start the thrift server on yarn. I want to use fair scheduler

spark stream + cassandra (execution on event)

2014-12-31 Thread Oleg Ruchovets
Hi . I want to use spark streaming to read data from cassandra. But in my case I need process data based on event. (not retrieving the data constantly from Cassandra). Question: what is the way to issue the processing using spark streaming from time to time. Thanks Oleg.

Re: FlatMapValues

2014-12-31 Thread Raghavendra Pandey
Why don't you push \n instead of \t in your first transformation [ (fields(0),(fields(1)+\t+fields(3)+\t+fields(5)+\t+fields(7)+\t +fields(9)))] and then do saveAsTextFile? -Raghavendra On Wed Dec 31 2014 at 1:42:55 PM Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid wrote: hey guys My

Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

2014-12-31 Thread Hafiz Mujadid
I am accessing hdfs with spark .textFile method. and I receive error as Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 here are my dependencies

Re: Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

2014-12-31 Thread Sean Owen
This generally means you have packaged Hadoop 1.x classes into your app accidentally. The most common cause is not marking Hadoop and Spark classes as provided dependencies. Your app doesn't need to ship its own copy of these classes when you use spark-submit. On Wed, Dec 31, 2014 at 10:47 AM,

Re: pyspark.daemon not found

2014-12-31 Thread Naveen Kumar Pokala
Hi, I am receiving the following error when I am trying to connect spark cluster( which is on unix) from my windows machine using pyspark interactive shell pyspark -master (spark cluster url) Then I executed the following commands. lines = sc.textFile(hdfs://master/data/spark/SINGLE.TXT)

NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2014-12-31 Thread Christophe Billiard
Hi all, I am currently trying to combine datastax's spark-cassandra-connector and typesafe's akka-http-experimental on Spark 1.1.1 (spark-cassandra-connector for Spark 1.2.0 not out yet) and scala 2.10.4 I am using the hadoop 2.4 pre built package. (build.sbt file at the end) To solve the

Re: building spark1.2 meet error

2014-12-31 Thread xhudik
Hi J_soft, for me it is working, I didn't put -Dscala-2.10 -X parameters. I got only one warning, since I don't have hadoop 2.5 it didn't activate this profile: /larix@kovral:~/sources/spark-1.2.0 mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0 -DskipTests clean package Found 0 infos

RE: FlatMapValues

2014-12-31 Thread Kapil Malik
Hi Sanjay, I tried running your code on spark shell piece by piece – // Setup val line1 = “025126,Chills,8.10,Injection site oedema,8.10,Injection site reaction,8.10,Malaise,8.10,Myalgia,8.10” val line2 = “025127,Chills,8.10,Injection site oedema,8.10,Injection site

Re: FlatMapValues

2014-12-31 Thread Fernando O.
Hi Sanjay, Doing an if inside a Map sounds like a bad idea, it seems like you actually want to filter and then apply map On Wed, Dec 31, 2014 at 9:54 AM, Kapil Malik kma...@adobe.com wrote: Hi Sanjay, I tried running your code on spark shell piece by piece – // Setup val line1 =

Fwd: Sample Spark Program Error

2014-12-31 Thread Naveen Madhire
Hi All, I am trying to run a sample Spark program using Scala SBT, Below is the program, def main(args: Array[String]) { val logFile = E:/ApacheSpark/usb/usb/spark/bin/README.md // Should be some file on your system val sc = new SparkContext(local, Simple App,

Re: Fwd: Sample Spark Program Error

2014-12-31 Thread RK
If you look at your program output closely, you can see the following output.  Lines with a: 24, Lines with b: 15 The exception seems to be happening with Spark cleanup after executing your code. Try adding sc.stop() at the end of your program to see if the exception goes away. On

Re: FlatMapValues

2014-12-31 Thread Sanjay Subramanian
Hey Kapil, Fernando Thanks for your mail. [1] Fernando, if I don't use an if logic inside the map then if I have lines of input data that have less fields than I am expecting I get ArrayOutOfBounds exception. so the if is to safeguard against that.  [2] Kapil, I am sorry I did not clarify. Yes

Re: Fwd: Sample Spark Program Error

2014-12-31 Thread Naveen Madhire
Yes. The exception is gone now after adding stop() at the end. Can you please tell me what this stop() does at the end. Does it disable the spark context. On Wed, Dec 31, 2014 at 10:09 AM, RK prk...@yahoo.com wrote: If you look at your program output closely, you can see the following output.

Re: Long-running job cleanup

2014-12-31 Thread Ganelin, Ilya
The previously submitted code doesn’t actually show the problem I was trying to show effectively since the issue becomes clear between subsequent steps. Within a single step it appears things are cleared up properly. Memory usage becomes evident pretty quickly. def showMemoryUsage(sc:

RE: FlatMapValues

2014-12-31 Thread Kapil Malik
Hi Sanjay, Oh yes .. on flatMapValues, it's defined in PairRDDFunctions, and you need to import org.apache.spark.rdd.SparkContext._ to use them (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions ) @Sean, yes indeed flatMap / flatMapValues both can

RE: Fwd: Sample Spark Program Error

2014-12-31 Thread Kapil Malik
Hi Naveen, Quoting http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext SparkContext is Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables

Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Sean Owen
-dev, +user A decent guess: Does your 'save' function entail collecting data back to the driver? and are you running this from a machine that's not in your Spark cluster? Then in client mode you're shipping data back to a less-nearby machine, compared to with cluster mode. That could explain the

Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
Also the job was deployed from the master machine in the cluster. ᐧ On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji eshi...@gmail.com wrote: Oh sorry that was a edit mistake. The code is essentially: val msgStream = kafkaStream .map { case (k, v) = v}

Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
Oh sorry that was a edit mistake. The code is essentially: val msgStream = kafkaStream .map { case (k, v) = v} .map(DatatypeConverter.printBase64Binary) .saveAsTextFile(s3n://some.bucket/path, classOf[LzoCodec]) I.e. there is essentially no original code (I was calling

Re: building spark1.2 meet error

2014-12-31 Thread Jacek Laskowski
Hi, Where does the following path that appears in the logs below come from? /opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar Did you somehow point at the local maven repository that's H:\Soft\Maven? Jacek 31 gru 2014 01:48 j_soft

Re: Why the major.minor version of the new hive-exec is 51.0?

2014-12-31 Thread Michael Armbrust
We actually do publish our own version of this jar, because the version that the hive team publishes is an uber jar and this breaks all kinds of things. As a result I'd file the JIRA against Spark. On Wed, Dec 31, 2014 at 12:55 PM, Ted Yu yuzhih...@gmail.com wrote: Michael:

Re: Why the major.minor version of the new hive-exec is 51.0?

2014-12-31 Thread Ted Yu
I see. I logged SPARK-5041 which references this thread. Thanks On Wed, Dec 31, 2014 at 12:57 PM, Michael Armbrust mich...@databricks.com wrote: We actually do publish our own version of this jar, because the version that the hive team publishes is an uber jar and this breaks all kinds of

Re: UpdateStateByKey persist to Tachyon

2014-12-31 Thread amkcom
bumping this thread up -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/UpdateStateByKey-persist-to-Tachyon-tp20798p20930.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Tathagata Das
Whats your spark-submit commands in both cases? Is it Spark Standalone or YARN (both support client and cluster)? Accordingly what is the number of executors/cores requested? TD On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji eshi...@gmail.com wrote: Also the job was deployed from the master

limit vs sample for indexing a small amount of data quickly?

2014-12-31 Thread Kevin Burton
Is there a limit function which just returns the first N records? Sample is nice but I’m trying to do this so it’s super fast and just to test the functionality of an algorithm. With sample I’d have to compute the % that would yield 1000 results first… Kevin -- Founder/CEO Spinn3r.com

Re: limit vs sample for indexing a small amount of data quickly?

2014-12-31 Thread Fernando O.
There's a take method that might do what you need: *def take(**num**: **Int**): Array[T]* Take the first num elements of the RDD. On Jan 1, 2015 12:02 AM, Kevin Burton bur...@spinn3r.com wrote: Is there a limit function which just returns the first N records? Sample is nice but I’m trying to

NullPointerException

2014-12-31 Thread rapelly kartheek
Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file. Especially, when I increase the size of data, I face this problem. Exception in thread Thread-47 org.apache.spark.SparkException: Job aborted due to stage failure: Task

Re: NullPointerException

2014-12-31 Thread Josh Rosen
Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file. Especially, when I increase the size of data, I

spark ignoring all memory settings and defaulting to 512MB?

2014-12-31 Thread Kevin Burton
This is really weird and I’m surprised no one has found this issue yet. I’ve spent about an hour or more trying to debug this :-( My spark install is ignoring ALL my memory settings. And of course my job is running out of memory. The default is 512MB so pretty darn small. The worker and

Re: NullPointerException

2014-12-31 Thread rapelly kartheek
spark-1.0.0 On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I get this following Exception when I submit spark application that calculates

Fwd: NullPointerException

2014-12-31 Thread rapelly kartheek
-- Forwarded message -- From: rapelly kartheek kartheek.m...@gmail.com Date: Thu, Jan 1, 2015 at 12:05 PM Subject: Re: NullPointerException To: Josh Rosen rosenvi...@gmail.com, user@spark.apache.org spark-1.0.0 On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen rosenvi...@gmail.com

Re: spark ignoring all memory settings and defaulting to 512MB?

2014-12-31 Thread Kevin Burton
wow. Just figured it out: conf.set( spark.executor.memory, 2g); I have to set it in the Job… that’s really counter intuitive. Especially because the documentation in spark-env.sh says the exact opposite. What’s the resolution here. This seems like a mess. I’d propose a solution to

Re: NullPointerException

2014-12-31 Thread Josh Rosen
It looks like 'null' might be selected as a block replication peer? https://github.com/apache/spark/blob/v1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L786 I know that we fixed some replication bugs in newer versions of Spark (such as

Re: NullPointerException

2014-12-31 Thread rapelly kartheek
Ok. Let me try out on a newer version. Thank you!! On Thu, Jan 1, 2015 at 12:17 PM, Josh Rosen rosenvi...@gmail.com wrote: It looks like 'null' might be selected as a block replication peer?

Re: spark ignoring all memory settings and defaulting to 512MB?

2014-12-31 Thread Ilya Ganelin
Welcome to Spark. What's more fun is that setting controls memory on the executors but if you want to set memory limit on the driver you need to configure it as a parameter of the spark-submit script. You also set num-executors and executor-cores on the spark submit call. See both the Spark

Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
Hi Tathagata, It's a standalone cluster. The submit commands are: == CLIENT spark-submit --class com.fake.Test \ --deploy-mode client --master spark://fake.com:7077 \ fake.jar arguments == CLUSTER spark-submit --class com.fake.Test \ --deploy-mode cluster --master spark://fake.com:7077 \