Re: Problem with StreamingContext - getting SPARK-2243

2014-12-28 Thread Akhil Das
In the shell you could do: val ssc = StreamingContext(*sc*, Seconds(1)) as *sc* is the SparkContext, which is already instantiated. Thanks Best Regards On Sun, Dec 28, 2014 at 6:55 AM, Thomas Frisk tfris...@gmail.com wrote: Yes you are right - thanks for that :) On 27 December 2014 at

Re: init / shutdown for complex map job?

2014-12-28 Thread Akhil Das
Something like? val a = myRDD.mapPartitions(p = { //Do the init //Perform some operations //Shut it down? }) Thanks Best Regards On Sun, Dec 28, 2014 at 1:53 AM, Kevin Burton bur...@spinn3r.com wrote: I have a job where I want to map over

Re: Compile error from Spark 1.2.0

2014-12-28 Thread Akhil Das
Hi Zigen, Looks like they missed it. Thanks Best Regards On Sat, Dec 27, 2014 at 12:43 PM, Zigen Zigen dbviewer.zi...@gmail.com wrote: Hello , I am zigen. I am using the Spark SQL 1.1.0. I want to use the Spark SQL 1.2.0. but my Spark application is a compile error. Spark 1.1.0 had a

Re: unable to check whether an item is present in RDD

2014-12-28 Thread Amit Behera
Hi Nicholas, The RDD contains only one Iterable[Int]. Pankaj, I used *collect* and I am getting as *items: Array[Iterable[Int]].* Then I did like : *val check = items.take(1).contains(item)* I am getting *check: Boolean = false, *but the item is present. Thanks Amit On Sun, Dec 28, 2014

Re: unable to check whether an item is present in RDD

2014-12-28 Thread Amit Behera
Hi Sean, I have a RDD like *theItems: org.apache.spark.rdd.RDD[Iterable[Int]]* I did like *val items = theItems.collect *//to get it as an array items: Array[Iterable[Int]] *val check = items.contains(item)* Thanks Amit On Sun, Dec 28, 2014 at 1:58 PM, Amit Behera amit.bd...@gmail.com wrote:

Re: unable to check whether an item is present in RDD

2014-12-28 Thread Amit Behera
Hi Nicholas, I am getting error: value contains is not a member of Iterable[Int] On Sun, Dec 28, 2014 at 2:06 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: take(1) will just give you a single item from the RDD. RDDs are not ideal for point lookups like you are doing, but you can

Re: init / shutdown for complex map job?

2014-12-28 Thread Sean Owen
You can't quite do cleanup in mapPartitions in that way. Here is a bit more explanation (farther down): http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/ On Dec 28, 2014 8:18 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Something like? val a =

Re: unable to check whether an item is present in RDD

2014-12-28 Thread Sean Owen
Try instead i.exists(_ == target) On Dec 28, 2014 8:46 AM, Amit Behera amit.bd...@gmail.com wrote: Hi Nicholas, I am getting error: value contains is not a member of Iterable[Int] On Sun, Dec 28, 2014 at 2:06 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: take(1) will just give

Re: unable to check whether an item is present in RDD

2014-12-28 Thread Amit Behera
Hi Sean and Nicholas Thank you very much, *exists* method works here :) On Sun, Dec 28, 2014 at 2:27 PM, Sean Owen so...@cloudera.com wrote: Try instead i.exists(_ == target) On Dec 28, 2014 8:46 AM, Amit Behera amit.bd...@gmail.com wrote: Hi Nicholas, I am getting error: value contains

Strange results of running Spark GenSort.scala

2014-12-28 Thread Sam Liu
Hi Experts, I am confusing on the input parameters of GenSort.scala and encountered strange issues. It requires 3 parameters: [num-parts] [records-per-part] [output-path]. Like Hadoop, I think the sizing of any one row(or record) of the sorting file equals to 100 bytes. So if I want to

Re: init / shutdown for complex map job?

2014-12-28 Thread Ray Melton
A follow-up to the blog cited below was hinted at, per But Wait, There's More ... To keep this post brief, the remainder will be left to a follow-up post. Is this follow-up pending? Is it sort of pending? Did the follow-up happen, but I just couldn't find it on the web? Regards, Ray. On Sun,

Re: Spark core maven error

2014-12-28 Thread Lalit Agarwal
Hi Please find the complete error - |Downloading: org/apache/spark/spark-core_2.10/1.2.0/spark-core_2.10-1.2.0.pom Error | Resolve error

Re: Playing along at home: recommendations as to system requirements?

2014-12-28 Thread Amy Brown
Thanks Cody. I reported the core dump as an issue on the Spark JIRA and a developer diagnosed it as an openJDK issue. So I switched over to Oracle Java 8 and... no more core dumps on the examples. I reported the openJDK issue at the icedtea Bugzilla. Looks like I'm off and running with Spark on

Re: Long-running job cleanup

2014-12-28 Thread Patrick Wendell
What do you mean when you say the overhead of spark shuffles start to accumulate? Could you elaborate more? In newer versions of Spark shuffle data is cleaned up automatically when an RDD goes out of scope. It is safe to remove shuffle data at this point because the RDD can no longer be

Anaconda iPython notebook working with CDH Spark

2014-12-28 Thread Bin Wang
Hi there, I have a cluster with CDH5.1 running on top of Redhat6.5, where the default Python version is 2.6. I am trying to set up a proper iPython notebook environment to develop spark application using pyspark. Here

Re: init / shutdown for complex map job?

2014-12-28 Thread Sean Owen
(Still pending, but believe it's in progress and being written by a colleague here.) On Sun, Dec 28, 2014 at 2:41 PM, Ray Melton rtmel...@gmail.com wrote: A follow-up to the blog cited below was hinted at, per But Wait, There's More ... To keep this post brief, the remainder will be left to a

Re: Spark core maven error

2014-12-28 Thread Sean Owen
http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-core_2.10%7C1.2.0%7Cjar That checksum is correct for this file, and is what Maven Central reports. I wonder if your repo is corrupted. Try deleting everything under ~/.m2/repository that's related to Spark and let it download

Re: Long-running job cleanup

2014-12-28 Thread Ilya Ganelin
Hi Patrick - is that cleanup present in 1.1? The overhead I am talking about is with regards to what I believe is shuffle related metadata. If I watch the execution log I see small broadcast variables created for every stage of execution, a few KB at a time, and a certain number of MB remaining

Re: unable to do group by with 1st column

2014-12-28 Thread Michael Albert
Greetings! Thanks for the comment. I have tried several variants of this, as indicated. The code works on small sets, but fails on larger sets.However, I don't get memory errors.I see java.nio.channels.CancelledKeyException and things about lost taskand then things like Resubmitting state 1, and

Re: unable to do group by with 1st column

2014-12-28 Thread Sean Owen
One value is at least 12 + 4 + 4 + 12 + 4 = 36 bytes if you factor in object overhead, if my math is right. 60M of them is about 2.1GB for a single key. I could imagine that blowing up an executor that's trying to have one in memory and deserialize another. You won't want to use groupByKey if the

Re: MLlib(Logistic Regression) + Spark Streaming.

2014-12-28 Thread Jeremy Freeman
Along with Xiangrui’s suggestion, we will soon be adding an implantation of Streaming Logistic Regression, which will be similar to the current version of Streaming Linear Regression, and continually update the model as new data arrive (JIRA). Hopefully this will be in v1.3. — Jeremy

Re: MLlib + Streaming

2014-12-28 Thread Jeremy Freeman
Hi Fernando, There’s currently no streaming ALS in Spark. I’m exploring a streaming singular value decomposition (JIRA) based on this paper (http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf), which might be one way to think about it. There has also been some cool recent work explicitly on

Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-12-28 Thread Harihar Nahak
Yes, I had try that too. I took the pre-built spark 1.1 release. If you there are changes in up coming changes for GraphX library, just let me know or in spark 1.2 I can do try on that. --Harihar - --Harihar -- View this message in context:

Re: action progress in ipython notebook?

2014-12-28 Thread Patrick Wendell
Hey Eric, I'm just curious - which specific features in 1.2 do you find most help with usability? This is a theme we're focusing on for 1.3 as well, so it's helpful to hear what makes a difference. - Patrick On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman eric.d.fried...@gmail.com wrote: Hi

sample is not a member of org.apache.spark.streaming.dstream.DStream

2014-12-28 Thread Josh J
Hi, I'm trying to using sampling with Spark Streaming. I imported the following import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ I then call sample val streamtoread = KafkaUtils.createStream(ssc, zkQuorum, group,

Re: sample is not a member of org.apache.spark.streaming.dstream.DStream

2014-12-28 Thread Sean Owen
The method you're referring to is a method of RDD, not DStream. If you want to do something with a sample of each RDD in the DStream, then call streamtoread.foreachRDD { rdd = val sampled = rdd.sample(...) ... } On Sun, Dec 28, 2014 at 10:44 PM, Josh J joshjd...@gmail.com wrote: Hi, I'm

Re: Using TF-IDF from MLlib

2014-12-28 Thread Yao
I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context:

Re: TF-IDF in Spark 1.1.0

2014-12-28 Thread Yao
Can you show how to do IDF transform on tfWithId? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TF-IDF-in-Spark-1-1-0-tp16389p20877.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

recent join/iterator fix

2014-12-28 Thread Stephen Haberman
Hey, I saw this commit go by, and find it fairly fascinating: https://github.com/apache/spark/commit/c233ab3d8d75a33495298964fe73dbf7dd8fe305 For two reasons: 1) we have a report that is bogging down exactly in a .join with lots of elements, so, glad to see the fix, but, more interesting I

Re: action progress in ipython notebook?

2014-12-28 Thread Eric Friedman
Hi Patrick, I don't mean to be glib, but the fact that it works at all on my cluster (600 nodes) and data is a novel experience. This is the first release that I haven't had to struggle with and then give up entirely. I can , for example, finally use HiveContext from PySpark on CDH, at least

Setting up Simple Kafka Consumer via Spark Java app

2014-12-28 Thread suhshekar52
Hello Everyone, Thank you for the time and the help :). My goal here is to get this program working: https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java The only lines I do not have from the example are lines

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-28 Thread Akhil Das
Make sure you verify the following: - Scala version : I think the correct version would be 2.10.x - SparkMasterURL: Be sure that you copied the one displayed on the webui's top left corner (running on port 8080) Thanks Best Regards On Mon, Dec 29, 2014 at 12:26 PM, suhshekar52

Spark 1.2.0 Yarn not published

2014-12-28 Thread Aniket Bhatnagar
Hi all I just realized that spark-yarn artifact hasn't been published for 1.2.0 release. Any particular reason for that? I was using it in my yet another spark-job-server project to submit jobs to a YARN cluster through convenient REST APIs (with some success). The job server was creating

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-28 Thread Suhas Shekar
1) Could you please clarify on what you mean by checking the Scala version is correct? In my pom.xml file it is 2.10.4 (which is the same as when I start spark-shell). 2) The spark master URL is definitely correct as I have run other apps with the same script that use Spark (like a word count

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-28 Thread Akhil Das
Just looked at the pom file that you are using, why are you having different versions in it? dependency groupIdorg.apache.spark/groupId artifactIdspark-streaming-kafka_2.10/artifactId version*1.1.1*/version /dependency dependency groupIdorg.apache.spark/groupId

Re: Spark 1.2.0 Yarn not published

2014-12-28 Thread Ted Yu
See this thread: http://search-hadoop.com/m/JW1q5vd61V1/Spark-yarn+1.2.0subj=Re+spark+yarn_2+10+1+2+0+artifacts Cheers On Dec 28, 2014, at 11:13 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi all I just realized that spark-yarn artifact hasn't been published for 1.2.0

Re: Setting up Simple Kafka Consumer via Spark Java app

2014-12-28 Thread Suhas Shekar
I made both versions 1.1.1 and I got the same error. I then tried making both 1.1.0 as that is the version of my Spark Core, but I got the same error. I noticed my Kafka dependency is for scala 2.9.2, while my spark streaming kafka dependency is 2.10.x...I will try changing that next, but don't