Re: Error when Spark streaming consumes from Kafka

2015-02-02 Thread Greg Temchenko
Hi, This seems not fixed yet. I filed an issue in jira: https://issues.apache.org/jira/browse/SPARK-5505 Greg -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-Spark-streaming-consumes-from-Kafka-tp19570p21471.html Sent from the Apache Spark User

Re: how to send JavaDStream RDD using foreachRDD using Java

2015-02-02 Thread Tathagata Das
Hello Sachin, While Akhil's solution is correct, this is not sufficient for your usecase. RDD.foreach (that Akhil is using) will run on the workers, but you are creating the Producer object on the driver. This will not work, a producer create on the driver cannot be used from the worker/executor.

Re: Can't find spark-parent when using snapshot build

2015-02-02 Thread Sean Owen
Snapshot builds are not published. Unless you build and install snapshots locally (like with mvn install) they wont be found. On Feb 2, 2015 10:58 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I'm trying to use the master version of spark. I build and install it with $ mvn clean

Re: Error when Spark streaming consumes from Kafka

2015-02-02 Thread Tathagata Das
This is an issue that is hard to resolve without rearchitecting the whole Kafka Receiver. There are some workarounds worth looking into. http://mail-archives.apache.org/mod_mbox/kafka-users/201312.mbox/%3CCAFbh0Q38qQ0aAg_cj=jzk-kbi8xwf+1m6xlj+fzf6eetj9z...@mail.gmail.com%3E On Mon, Feb 2, 2015

Re: Error when Spark streaming consumes from Kafka

2015-02-02 Thread Dibyendu Bhattacharya
Or you can use this Low Level Kafka Consumer for Spark : https://github.com/dibbhatt/kafka-spark-consumer This is now part of http://spark-packages.org/ and is running successfully for past few months in Pearson production environment . Being Low Level consumer, it does not have this re-balancing

Spark impersonation

2015-02-02 Thread Jim Green
Hi Team, Does spark support impersonation? For example, when spark on yarn/hive/hbase/etc..., which user is used by default? The user which starts the spark job? Any suggestions related to impersonation? -- Thanks, www.openkb.info (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)

Re: Can't find spark-parent when using snapshot build

2015-02-02 Thread Jaonary Rabarisoa
That's what I did. On Mon, Feb 2, 2015 at 11:28 PM, Sean Owen so...@cloudera.com wrote: Snapshot builds are not published. Unless you build and install snapshots locally (like with mvn install) they wont be found. On Feb 2, 2015 10:58 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all,

how to specify hive connection options for HiveContext

2015-02-02 Thread guxiaobo1982
Hi, I know two options, one for spark_submit, the other one for spark-shell, but how to set for programs running inside eclipse? Regards,

Java Kafka Word Count Issue

2015-02-02 Thread Jadhav Shweta
Hi All, I am trying to run Kafka Word Count Program. please find below, the link for the same https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java I have set spark master to setMaster(local[*]) and I have

How to define a file filter for file name patterns in Apache Spark Streaming in Java?

2015-02-02 Thread Emre Sevinc
Hello, I'm using Apache Spark Streaming 1.2.0 and trying to define a file filter for file names when creating an InputDStream https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/streaming/dstream/InputDStream.html by invoking the fileStream

Re: Cheapest way to materialize an RDD?

2015-02-02 Thread Raghavendra Pandey
You can also do something like rdd.sparkContext.runJob(rdd,(iter: Iterator[T]) = { while(iter.hasNext) iter.next() }) On Sat, Jan 31, 2015 at 5:24 AM, Sean Owen so...@cloudera.com wrote: Yeah, from an unscientific test, it looks like the time to cache the blocks still dominates. Saving

Re: is there a master for spark cluster in ec2

2015-02-02 Thread Robin East
There is a file $SPARK_HOME/conf/spark-env.sh which comes readily configured with the MASTER variable. So if you start pyspark or spark-shell from the ec2 login machine you will connect to the Spark master. On 29 Jan 2015, at 01:11, Mohit Singh mohit1...@gmail.com wrote: Hi, Probably a

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-02-02 Thread Sonal Goyal
That may be the cause of your issue. Take a look at the tuning guide[1] and maybe also profile your application. See if you can reuse your objects. 1. http://spark.apache.org/docs/latest/tuning.html Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co

Re: Java Kafka Word Count Issue

2015-02-02 Thread Jadhav Shweta
Hi Sean, Kafka Producer is working fine. This is related to Spark. How can i configure spark so that it will make sure to remember count from the beginning. If my log.text file has spark apache kafka spark My Spark program gives correct output as  spark 2 apache 1 kafka 1 but when I

Re: Java Kafka Word Count Issue

2015-02-02 Thread Sean Owen
First I would check your code to see how you are pushing records into the topic. Is it reading the whole file each time and resending all of it? Then see if you are using the same consumer.id on the Spark side. Otherwise you are not reading from the same offset when restarting Spark but instead

Re: java.lang.IllegalStateException: unread block data

2015-02-02 Thread Peng Cheng
I got the same problem, maybe java serializer is unstable -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-tp20668p21463.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-02-02 Thread Yifan LI
Thanks, Sonal. But it seems to be an error happened when “cleaning broadcast”? BTW, what is the timeout of “[30 seconds]”? can I increase it? Best, Yifan LI On 02 Feb 2015, at 11:12, Sonal Goyal sonalgoy...@gmail.com wrote: That may be the cause of your issue. Take a look at the

Re: Java Kafka Word Count Issue

2015-02-02 Thread VISHNU SUBRAMANIAN
You can use updateStateByKey() to perform the above operation. On Mon, Feb 2, 2015 at 4:29 PM, Jadhav Shweta jadhav.shw...@tcs.com wrote: Hi Sean, Kafka Producer is working fine. This is related to Spark. How can i configure spark so that it will make sure to remember count from the

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-02-02 Thread Yifan LI
I think this broadcast cleaning(memory block remove?) timeout exception was caused by: 15/02/02 11:48:49 ERROR TaskSchedulerImpl: Lost executor 13 on small18-tap1.common.lip6.fr: remote Akka client disassociated 15/02/02 11:48:49 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent

Re: Spark impersonation

2015-02-02 Thread Koert Kuipers
yes jobs run as the user that launched them. if you want to run jobs on a secure cluster then use yarn. hadoop standalone does not support secure hadoop. On Mon, Feb 2, 2015 at 5:37 PM, Jim Green openkbi...@gmail.com wrote: Hi Team, Does spark support impersonation? For example, when spark

Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-02-02 Thread Aniket Bhatnagar
Alright.. I found the issue. I wasn't setting fs.s3.buffer.dir configuration. Here is the final spark conf snippet that works: spark.hadoop.fs.s3n.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem, spark.hadoop.fs.s3.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem, spark.hadoop.fs.s3bfs.impl:

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-02-02 Thread NicolasC
On 01/29/2015 08:31 PM, Ankur Dave wrote: Thanks for the reminder. I just created a PR: https://github.com/apache/spark/pull/4273 Ankur Hello, Thanks for the patch. I applied it on Pregel.scala (in Spark 1.2.0 sources) and rebuilt Spark. During execution, at the 25th iteration of Pregel,

2GB limit for partitions?

2015-02-02 Thread Michael Albert
Greetings! SPARK-1476 says that there is a 2G limit for blocks.Is this the same as a 2G limit for partitions (or approximately so?)? What I had been attempting to do is the following.1) Start with a moderately large data set (currently about 100GB, but growing).2) Create about 1,000 files

Is there a way to disable the Spark UI?

2015-02-02 Thread Nirmal Fernando
Hi All, Is there a way to disable the Spark UI? What I really need is to stop the startup of the Jetty server. -- Thanks regards, Nirmal Senior Software Engineer- Platform Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/

Re: Spark impersonation

2015-02-02 Thread Zhan Zhang
I think you can configure hadoop/hive to do impersonation. There is no difference between secure or insecure hadoop cluster by using kinit. Thanks. Zhan Zhang On Feb 2, 2015, at 9:32 PM, Koert Kuipers ko...@tresata.commailto:ko...@tresata.com wrote: yes jobs run as the user that launched

Scala on Spark functions examples cheatsheet.

2015-02-02 Thread Jim Green
Hi Team, I just spent some time these 2 weeks on Scala and tried all Scala on Spark functions in the Spark Programming Guide http://spark.apache.org/docs/1.2.0/programming-guide.html. If you need example codes of Scala on Spark functions, I created this cheat sheet

Re: Error when Spark streaming consumes from Kafka

2015-02-02 Thread Neelesh
We're planning to use this as well (Dibyendu's https://github.com/dibbhatt/kafka-spark-consumer ). Dibyendu, thanks for the efforts. So far its working nicely. I think there is merit in make it the default Kafka Receiver for spark streaming. -neelesh On Mon, Feb 2, 2015 at 5:25 PM, Dibyendu

Re: 2GB limit for partitions?

2015-02-02 Thread Sean Owen
The limit is on blocks, not partitions. Partitions have many blocks. It sounds like you are creating very large values in memory, but I'm not sure given your description. You will run into problems if a single object is more than 2GB, of course. More of the stack trace might show what is mapping

Re: Error when Spark streaming consumes from Kafka

2015-02-02 Thread Dibyendu Bhattacharya
Thanks Neelesh . Glad to know this Low Level Consumer is working for you. Dibyendu On Tue, Feb 3, 2015 at 8:06 AM, Neelesh neele...@gmail.com wrote: We're planning to use this as well (Dibyendu's https://github.com/dibbhatt/kafka-spark-consumer ). Dibyendu, thanks for the efforts. So far its

Re: Window comparison matching using the sliding window functionality: feasibility

2015-02-02 Thread nitinkak001
Mine was not really a moving average problem. It was more like partitioning on some keys and sorting(on different keys) and then running a sliding window through the partition. I reverted back to map-reduce for that(I needed secondary sort, which is not very mature in Spark right now). But, as

Re: Spark streaming - tracking/deleting processed files

2015-02-02 Thread Emre Sevinc
You can utilize the following method: https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream%28java.lang.String,%20scala.Function1,%20boolean,%20scala.reflect.ClassTag,%20scala.reflect.ClassTag,%20scala.reflect.ClassTag%29 It has a parameter:

Re: [hive context] Unable to query array once saved as parquet

2015-02-02 Thread Ayoub
Hi, given the current open issue: https://issues.apache.org/jira/browse/SPARK-5508 I cannot use HiveQL to insert schemaRDD data into a table if one of the columns is an Array of Struct. using the spark API, Is it possible to insert schema RDD into an existing and *partitioned* table ? the method

Re: Is there a way to disable the Spark UI?

2015-02-02 Thread Nirmal Fernando
Thanks Zhan! Was this introduced from Spark 1.2? or is this available in Spark 1.1 ? On Tue, Feb 3, 2015 at 11:52 AM, Zhan Zhang zzh...@hortonworks.com wrote: You can set spark.ui.enabled to false to disable ui. Thanks. Zhan Zhang On Feb 2, 2015, at 8:06 PM, Nirmal Fernando

Re: Java Kafka Word Count Issue

2015-02-02 Thread Jadhav Shweta
Hi, I added checkpoint directory and now Using updateStateByKey() import com.google.common.base.Optional; Function2ListInteger, OptionalInteger, OptionalInteger updateFunction = new Function2ListInteger, OptionalInteger, OptionalInteger() { @Override public OptionalInteger

Loading status

2015-02-02 Thread akhandeshi
I am not sure what Loading status means, followed by Running. In the application UI, I see: Executor Summary ExecutorID Worker Cores Memory State Logs 1 worker-20150202144112-hadoop-w-1.c.fi-mdd-poc.internal-3887416 83971 LOADING stdout stderr 0

Re: Loading status

2015-02-02 Thread Ami Khandeshi
Yes On Monday, February 2, 2015, Mark Hamstra m...@clearstorydata.com wrote: LOADING is just the state in which new Executors are created but before they have everything they need and are fully registered to transition to state RUNNING and begin doing actual work:

Re: Loading status

2015-02-02 Thread Mark Hamstra
LOADING is just the state in which new Executors are created but before they have everything they need and are fully registered to transition to state RUNNING and begin doing actual work:

Can't find spark-parent when using snapshot build

2015-02-02 Thread Jaonary Rabarisoa
Hi all, I'm trying to use the master version of spark. I build and install it with $ mvn clean clean install I manage to use it with the following configuration in my build.sbt : *libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.3.0-SNAPSHOT % provided, org.apache.spark %%

Re: Loading status

2015-02-02 Thread Ami Khandeshi
It seems sort of Listener UI error! I say this because, I see the status in the executor web UI to be loading, but the application UI, for the same executor the status is Running! I have also seen the reverse behavior where the application UI indicates a particular executor as loading, but the

Re: how to send JavaDStream RDD using foreachRDD using Java

2015-02-02 Thread Akhil Das
Here you go: JavaDStreamString textStream = ssc.textFileStream(/home/akhld/sigmoid/); textStream.foreachRDD(new FunctionJavaRDDString,Void() { @Override public Void call(JavaRDDString rdd) throws Exception { // TODO Auto-generated method stub rdd.foreach(new

Re: Loading status

2015-02-02 Thread Mark Hamstra
Yes, if the Master is unable to register the Executor and transition it to RUNNING, then the Executor will stay in LOADING state, so this can be caused by problems in the Master or the Master-Executor communication. On Mon, Feb 2, 2015 at 9:24 AM, Tushar Sharma tushars...@gmail.com wrote: Yes

Re: Loading status

2015-02-02 Thread Mark Hamstra
Curious. I guess the first question is whether we've got some sort of Listener/UI error so that the UI is not accurately reflecting the Executor's actual state, or whether the LOADING Executor really is spending a considerable length of time in this I'm in the process of being created, but not

Re: How to define a file filter for file name patterns in Apache Spark Streaming in Java?

2015-02-02 Thread Akhil Das
Hi Emre, This is how you do that in scala: val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](/home/akhld/sigmoid, (t: Path) = true, true) ​In java you can do something like: jssc.ssc().LongWritable, Text, SequenceFileInputFormatfileStream(/home/akhld/sigmoid, new

Re: Is pair rdd join more efficient than regular rdd

2015-02-02 Thread Akhil Das
Yes it would, you can create a key and then partition it (say HashPartitioner) and then joining would be faster as all the similar keys will go in one partition. Thanks Best Regards On Sun, Feb 1, 2015 at 5:13 PM, Sunita Arvind sunitarv...@gmail.com wrote: Hi All We are joining large tables

Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
Hi all, I have some questions about the future development of Spark's standalone resource scheduler. We've heard some users have the requirements to have multi-tenant support in standalone mode, like multi-user management, resource management and isolation, whitelist of users. Seems current

Re: Questions about Spark standalone resource scheduler

2015-02-02 Thread Patrick Wendell
Hey Jerry, I think standalone mode will still add more features over time, but the goal isn't really for it to become equivalent to what Mesos/YARN are today. Or at least, I doubt Spark Standalone will ever attempt to manage _other_ frameworks outside of Spark and become a general purpose

RE: Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
Hi Patrick, Thanks a lot for your detailed explanation. For now we have such requirements: whitelist the application submitter, user resources (CPU, MEMORY) quotas, resources allocations in Spark Standalone mode. These are quite specific requirements for production-use, generally these problem