unsubscribe

2016-05-06 Thread Pranav Shukla

Re: Updating Values Inside Foreach Rdd loop

2016-05-06 Thread Ted Yu
Using RDDs requires some 'low level' optimization techniques. While using dataframes / Spark SQL allows you to leverage existing code. If you can share some more of your use case, that would help other people provide suggestions. Thanks > On May 6, 2016, at 6:57 PM, HARSH TAKKAR

Re: Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-06 Thread nguyen duc tuan
Try to use Dataframe instead of RDD. Here's an introduction to Dataframe: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html 2016-05-06 21:52 GMT+07:00 pratik gawande : > Thanks Shao for quick reply. I will look

Re: Updating Values Inside Foreach Rdd loop

2016-05-06 Thread HARSH TAKKAR
Hi Ted I am aware that rdd are immutable, but in my use case i need to update same data set after each iteration. Following are the points which i was exploring. 1. Generating rdd in each iteration.( It might use a lot of memory). 2. Using Hive tables and update the same table after each

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
Thanks for the clarification Michael and good luck with Spark 2.0. It really looks promising. I am especially interested in adhoc queries aspect. Probably that is what is being referred to as Continuous SQL in the slides. What is the timeframe for availability this functionality? regards Sunita

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
Agreed. Just sharing what I saw, http://www.slideshare.net/databricks/realtime-spark-from-interactive-queries-to-streaming http://www.slideshare.net/rxin/the-future-of-realtime-in-spark?next_slideshow=3 It claims to support kafka, files and databases. However, continuous SQL will be available in

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Michael Armbrust
That is a forward looking design doc and not all of it has been implemented yet. With Spark 2.0 the main sources and sinks will be file based, though we hope to quickly expand that now that a lot of infrastructure is in place. On Fri, May 6, 2016 at 2:11 PM, Ted Yu wrote:

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Ted Yu
I was reading StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdf attached to SPARK-8360 On page 12, there was mentioning of .format(“kafka”) but I searched the codebase and didn't find any occurrence. FYI On Fri, May 6, 2016 at 1:06 PM, Michael Malak <

Working out min() and max() values in Spark streaming sliding interval

2016-05-06 Thread Mich Talebzadeh
Hi, I have this code that filters out those prices that are over 99.8 within the the sliding window. The code works OK as shown below. Now I need to work out min(price), max(price) and avg(price) in the sliding window. What I need is to have a counter and method of getting these values. Any

Re: killing spark job which is submitted using SparkSubmit

2016-05-06 Thread satish saley
Thank you Anthony. I am clearer on yarn-cluster and yarn-client now. On Fri, May 6, 2016 at 1:05 PM, Anthony May wrote: > Making the master yarn-cluster means that the driver is then running on > YARN not just the executor nodes. It's then independent of your application >

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Michael Malak
At first glance, it looks like the only streaming data sources available out of the box from the github master branch are  https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala  and 

Re: killing spark job which is submitted using SparkSubmit

2016-05-06 Thread Anthony May
Making the master yarn-cluster means that the driver is then running on YARN not just the executor nodes. It's then independent of your application and can only be killed via YARN commands, or if it's batch and completes. The simplest way to tie the driver to your app is to pass in yarn-client as

Re: killing spark job which is submitted using SparkSubmit

2016-05-06 Thread satish saley
Hi Anthony, I am passing --master yarn-cluster --name pysparkexample --executor-memory 1G --driver-memory 1G --conf

Re: killing spark job which is submitted using SparkSubmit

2016-05-06 Thread Anthony May
Greetings Satish, What are the arguments you're passing in? On Fri, 6 May 2016 at 12:50 satish saley wrote: > Hello, > > I am submitting a spark job using SparkSubmit. When I kill my application, > it does not kill the corresponding spark job. How would I kill the >

killing spark job which is submitted using SparkSubmit

2016-05-06 Thread satish saley
Hello, I am submitting a spark job using SparkSubmit. When I kill my application, it does not kill the corresponding spark job. How would I kill the corresponding spark job? I know, one way is to use SparkSubmit again with appropriate options. Is there any way though which I can tell SparkSubmit

Re: getting NullPointerException while doing left outer join

2016-05-06 Thread Adam Westerman
For anyone interested, the problem ended up being that in some rare cases, the value from the pair RDD on the right side of the left outer join was Java's null. The Spark optionToOptional method attempted to apply Some() to null, which caused the NPE to be thrown. The lesson is to filter out any

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Anthony May
Yeah, there isn't even a RC yet and no documentation but you can work off the code base and test suites: https://github.com/apache/spark And this might help: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala On Fri,

CreateProcess error=5, Access is denied when trying SparkLauncher example in Win10

2016-05-06 Thread Augusto Uehara
Hi folks, I'm getting the following exception: *Exception in thread "main" java.io.IOException: Cannot run program "E:\Software\spark-1.6.1\bin\spark-submit.cmd": CreateProcess error=5, Access is denied* * at java.lang.ProcessBuilder.start(Unknown Source)* * at

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Deepak Sharma
Spark 2.0 is yet to come out for public release. I am waiting to get hands on it as well. Please do let me know if i can download source and build spark2.0 from github. Thanks Deepak On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind wrote: > Hi All, > > We are evaluating a

Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Sunita Arvind
Hi All, We are evaluating a few real time streaming query engines and spark is my personal choice. The addition of adhoc queries is what is getting me further excited about it, however the talks I have heard so far only mention about it but do not provide details. I need to build a prototype to

Re: getting NullPointerException while doing left outer join

2016-05-06 Thread Adam Westerman
Hi Ted, I am working on replicating the problem on a smaller scale. I saw that Spark 2.0 is moving to Java 8 Optional instead of Guava Optional, but in the meantime I'm stuck with 1.6.1. -Adam On Fri, May 6, 2016 at 9:40 AM, Ted Yu wrote: > Is it possible to write a

Re: createDirectStream with offsets

2016-05-06 Thread Cody Koeninger
Look carefully at the error message, the types you're passing in don't match. For instance, you're passing in a message handler that returns a tuple, but the rdd return type you're specifying (the 5th type argument) is just String. On Fri, May 6, 2016 at 9:49 AM, Eric Friedman

Re: Sliding Average over Window in Spark Streaming

2016-05-06 Thread Mich Talebzadeh
Hi Matthias, Say with the following you have "Batch interval" is the basic interval at which the system with receive the data in batches. val ssc = new StreamingContext(sparkConf, Seconds(n)) // window length - The duration of the window below that must be multiple of batch interval n in = >

Re: Spark 1.5.2 Shuffle Blocks - running out of memory

2016-05-06 Thread Nirav Patel
Is this a limit of spark shuffle block currently? On Tue, May 3, 2016 at 11:18 AM, Nirav Patel wrote: > Hi, > > My spark application getting killed abruptly during a groupBy operation > where shuffle happens. All shuffle happens with PROCESS_LOCAL locality. I > see

Sliding Average over Window in Spark Streaming

2016-05-06 Thread Matthias Niehoff
Hi, If i want to have a sliding average over the 10 minutes for some keys I can do something like groupBy(window(…),“my-key“).avg(“some-values“) in Spark 2.0 I try to implement this sliding average using Spark 1.6.x: I tried with reduceByKeyAndWindow but it did not find a solution. Imo i have to

Re: Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-06 Thread pratik gawande
Thanks Shao for quick reply. I will look into how pyspark jobs are executed. Any suggestions or reference to docs on how to tune pyspark jobs? On Thu, May 5, 2016 at 10:12 PM -0700, "Saisai Shao" > wrote: Writing RDD based application

Re: createDirectStream with offsets

2016-05-06 Thread Eric Friedman
My build dependencies: compile 'org.scala-lang:scala-library:2.10.4' compile 'org.apache.spark:spark-core_2.10:1.6.1' compile 'org.apache.spark:spark-sql_2.10:1.6.1' compile 'org.apache.spark:spark-hive_2.10:1.6.1' compile

createDirectStream with offsets

2016-05-06 Thread Eric Friedman
Hello, I've been using createDirectStream with Kafka and now need to switch to the version of that API that lets me supply offsets for my topics. I'm unable to get this to compile for some reason, even if I lift the very same usage from the Spark test suite. I'm calling it like this: val

Reading Shuffle Data from highly loaded nodes

2016-05-06 Thread Alvaro Brandon
Hello everyone: I'm running an experiment in a Spark cluster where some of the machines are highly loaded with CPU, memory and network consuming process ( let's call them straggler machines ). Obviously the tasks of these machines take longer to execute than in other nodes of the cluster.

Re: getting NullPointerException while doing left outer join

2016-05-06 Thread Ted Yu
Is it possible to write a short test which exhibits this problem ? For Spark 2.0, this part of code has changed: [SPARK-4819] Remove Guava's "Optional" from public API FYI On Fri, May 6, 2016 at 6:57 AM, Adam Westerman wrote: > Hi, > > I’m attempting to do a left outer

Re: Error Kafka/Spark. Ran out of messages before reaching ending offset

2016-05-06 Thread Cody Koeninger
Yeah, so that means the driver talked to kafka and kafka told it the highest available offset was 2723431. Then when the executor tried to consume messages, it stopped getting messages before reaching that offset. That almost certainly means something's wrong with Kafka, have you looked at your

Spark Web UI issue

2016-05-06 Thread Pietro Gentile
Hi all, I have a spark application running to which I submit jobs continuosly. These job use different instances of sqlContext. So the web ui of application starts to fill up more and more with this instance. Is there any way to prevent this? I don't want to see created sql context in the web

getting NullPointerException while doing left outer join

2016-05-06 Thread Adam Westerman
Hi, I’m attempting to do a left outer join in Spark, and I’m getting an NPE that appears to be due to some Spark Java API bug. (I’m running Spark 1.6.0 in local mode on a Mac). For a little background, the left outer join returns all keys from the left side of the join regardless of whether or

Re: Updating Values Inside Foreach Rdd loop

2016-05-06 Thread Ted Yu
Please see the doc at the beginning of RDD class: * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, * partitioned collection of elements that can be operated on in parallel. This class contains the * basic operations available on all RDDs, such

Reading text file from Amazon S3

2016-05-06 Thread Jinan Alhajjaj
Hello All,I am trying to read text files from Amazon s3.Any solution for this Error ASAP: Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope) at [Source:

Updating Values Inside Foreach Rdd loop

2016-05-06 Thread HARSH TAKKAR
Hi Is there a way i can modify a RDD, in for-each loop, Basically, i have a use case in which i need to perform multiple iteration over data and modify few values in each iteration. Please help.

TaskEnd Metrics

2016-05-06 Thread Manivannan Selvadurai
Hi All, I tried to run a simple spark program to find out the metrics collected while executing the program. What I observed is, I'm able to get TaskMetrics.inputMetrics data like records read, bytesread etc. But I do not get any metrics about the output. I ran the below code in

Spark UI only shows lines belonging to py4j lib

2016-05-06 Thread cmbendre
Hi, I am using spark 1.6.1 for my streaming jobs with Kinesis connector. Without any extra configurations put, the jobs run fine but they only show one line in UI. I used to see the actual line number in Python script in previous version. Please see the screenshot to understand what i mean.

Lost names of struct fields in UDF

2016-05-06 Thread Alexander Chermenin
Hi everybody! This code: DataFrame df = sqlContext.read().json(FILE_NAME); DataFrame profiles = df.select(        column("_id"),        struct(                column("name.first").as("first_name"),                column("name.last").as("last_name"),                column("friends")        

Found Data Quality check package for Spark

2016-05-06 Thread Divya Gehlot
Hi, I just stumbled upon some data quality check package for spark https://github.com/FRosner/drunken-data-quality Has any body used it ? Would really appreciate the feedback . Thanks, Divya

Re: Spark structured streaming is Micro batch?

2016-05-06 Thread Sachin Aggarwal
Hi Madhukara, What I understood from the code is that when ever runBatch return they trigger constructBatch so whatever is processing time for a batch will be ur batch time if u dnt specify a trigger. one flaw which i think in this is if your processing time keeps increasing with amount of data

Re: Error Kafka/Spark. Ran out of messages before reaching ending offset

2016-05-06 Thread Guillermo Ortiz
This is the complete error. 2016-05-06 11:18:05,424 [task-result-getter-0] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 5.0 in stage 13.0 (TID 60) in 11692 ms on xx (6/8) 2016-05-06 11:18:08,978 [task-result-getter-1] WARN org.apache.spark.scheduler.TaskSetManager - Lost

Re: Spark structured streaming is Micro batch?

2016-05-06 Thread Deepak Sharma
With Structured Streaming ,Spark would provide apis over spark sql engine. Its like once you have the structured stream and dataframe created out of this , you can do ad-hoc querying on the DF , which means you are actually querying the stram without having to store or transform. I have not used

Re: Error Kafka/Spark. Ran out of messages before reaching ending offset

2016-05-06 Thread Guillermo Ortiz
I think that it's a kafka error, but I'm starting thinking if it could be something about elasticsearch since I have seen more people with same error using elasticsearch. I have no idea. 2016-05-06 11:05 GMT+02:00 Guillermo Ortiz : > I'm trying to read data from Spark and

Error Kafka/Spark. Ran out of messages before reaching ending offset

2016-05-06 Thread Guillermo Ortiz
I'm trying to read data from Spark and index to ES with its library (es-hadoop 2.2.1 version). IIt was working right for a while but now it has started to happen this error. I have delete the checkpoint and even the kafka topic and restart all the machines with kafka and zookeeper but it didn't

Spark structured streaming is Micro batch?

2016-05-06 Thread madhu phatak
Hi, As I was playing with new structured streaming API, I noticed that spark starts processing as and when the data appears. It's no more seems like micro batch processing. Is spark structured streaming will be an event based processing? -- Regards, Madhukara Phatak http://datamantra.io/

Re: Missing data in Kafka Consumer

2016-05-06 Thread Nirav Shah
I had same issue while using with storm. Than I found no of storm spout instance should not be greater than no of partition. if you increase that than nos were not matching.May be you can check something similar for spark. Regards, Nirav On May 5, 2016 9:48 PM, "Jerry"