Re: Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

2017-06-06 Thread Jörn Franke
What does your Spark job do? Have you tried standard configurations and changing them gradually? Have you checked the logfiles/ui which tasks take long? 17 Mio records does not sound much, but it depends what you do with it. I do not think that for such a small "cluster" it makes sense to

StructuredStreaming : org.apache.spark.sql.streaming.StreamingQueryException

2017-06-06 Thread aravias
hi, I have one read stream to consume data from a Kafka topic , and based on an attribute value in each of the incoming messages, I have to write data to either of the 2 different locations in S3 (if value1 write to location1, otherwise to location2). On a high level below is what I have for

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-06 Thread Tathagata Das
Yes, and in general any mutable data structure. You have to immutable data structures whose hashcode and equals is consistent enough for being put in a set. On Jun 6, 2017 4:50 PM, "swetha kasireddy" wrote: > Are you suggesting against the usage of HashSet? > > On

Re: How to perform clean-up after stateful streaming processes an RDD?

2017-06-06 Thread David Rosenstrauch
Thanks much for the suggestion. Reading over your mail though, I realize that I may not have made something clear: I don't just have a single external service object; the idea is that I have one per executor, so that the code running on each executor accesses the external service independently.

Re: How to perform clean-up after stateful streaming processes an RDD?

2017-06-06 Thread David Rosenstrauch
Thanks much for the suggestion. Reading over your mail though, I realize that I may not have made something clear: I don't just have a single external service object; the idea is that I have one per executor, so that the code running on each executor accesses the external service independently.

Re: a stage can belong to more than one job please?

2017-06-06 Thread ??????????
Hi Mark, Thanks. ---Original--- From: "Mark Hamstra" Date: 2017/6/6 23:27:43 To: "dev"; Cc: "user"; Subject: Re: a stage can belong to more than one job please? Yes, a Stage can be part of more than one Job. The

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-06 Thread swetha kasireddy
Are you suggesting against the usage of HashSet? On Tue, Jun 6, 2017 at 3:36 PM, Tathagata Das wrote: > This may be because of HashSet is a mutable data structure, and it seems > you are actually mutating it in "set1 ++set2". I suggest creating a new > HashMap in

Re: How to perform clean-up after stateful streaming processes an RDD?

2017-06-06 Thread Gerard Maas
It looks like the clean up should go into the foreachRDD function: stateUpdateStream.foreachRdd(...) { rdd => // do stuff with the rdd stateUpdater.cleanupExternalService// should work in this position } Code within the foreachRDD(*) executes on the driver, so you can keep the state of

How to perform clean-up after stateful streaming processes an RDD?

2017-06-06 Thread David Rosenstrauch
We have some code we've written using stateful streaming (mapWithState) which works well for the most part. The stateful streaming runs, processes the RDD of input data, calls the state spec function for each input record, and does all proper adding and removing from the state cache. However, I

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-06 Thread Tathagata Das
This may be because of HashSet is a mutable data structure, and it seems you are actually mutating it in "set1 ++set2". I suggest creating a new HashMap in the function (and add both maps into it), rather than mutating one of them. On Tue, Jun 6, 2017 at 11:30 AM, SRK

Re: Edge Node in Spark

2017-06-06 Thread ayan guha
They are all same thing. Essentially it means a machine which is not part of the cluster butHas all clients. On Wed, 7 Jun 2017 at 5:48 am, Irving Duran wrote: > Where in the documentation did you find "edge node"? Spark would call it > worker or executor, but not

Re: Adding header to an rdd before saving to text file

2017-06-06 Thread Irving Duran
Not a best option, but I've done this before. If you know the columns structure you could manually write them to the file before exporting. On Tue, Jun 6, 2017 at 12:39 AM 颜发才(Yan Facai) wrote: > Hi, upendra. > It will be easier to use DataFrame to read/save csv file with

Re: Edge Node in Spark

2017-06-06 Thread Irving Duran
Where in the documentation did you find "edge node"? Spark would call it worker or executor, but not "edge node". Her is some info about yarn logs -> https://spark.apache.org/docs/latest/running-on-yarn.html. Thank You, Irving Duran On Tue, Jun 6, 2017 at 11:48 AM, Ashok Kumar

Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-06 Thread SRK
Hi, I see the following error when I use ReduceByKeyAndWindow in my Spark Streaming app. I use reduce, invReduce and filterFunction as shown below. Any idea as to why I get the error? java.lang.Exception: Neither previous window has value for key, nor new values found. Are you sure your key

Re: Edge Node in Spark

2017-06-06 Thread Ashok Kumar
Just Straight Spark please. Also if I run a spark job using Python or Scala using Yarn where the log files are kept in the edge node?  Are these under logs directory for yarn? thanks On Tuesday, 6 June 2017, 14:11, Irving Duran wrote: Ashok,Are you working with

Performance issue when running Spark-1.6.1 in yarn-client mode with Hadoop 2.6.0

2017-06-06 Thread satishjohn
Performance issue / time taken to complete spark job in yarn is 4 x slower, when considered spark standalone mode. However, in spark standalone mode jobs often fails with executor lost issue. Hardware configuration 32GB RAM 8 Cores (16) and 1 TB HDD 3 (1 Master and 2 Workers) Spark

Re: a stage can belong to more than one job please?

2017-06-06 Thread Mark Hamstra
Yes, a Stage can be part of more than one Job. The jobIds field of Stage is used repeatedly in the DAGScheduler. On Tue, Jun 6, 2017 at 5:04 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > I read same code of spark about stage. > > The constructor of stage keep the first job ID the stage was

Re: Spark Streaming Job Stuck

2017-06-06 Thread Richard Moorhead
Set your master to local[10]; you are only allocating one core currently. . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard Moorhead Software Engineer richard.moorh...@c2fo.com C2FO: The World's Market for Working Capital®

Re: Spark Streaming Job Stuck

2017-06-06 Thread Jain, Nishit
That helped, thanks TD! :D From: Tathagata Das > Date: Tuesday, June 6, 2017 at 3:26 AM To: "Jain, Nishit" > Cc: "user@spark.apache.org"

problem initiating spark context with pyspark

2017-06-06 Thread Curtis Burkhalter
Hello all, I'm new to Spark and I'm trying to interact with it using Pyspark. I'm using the prebuilt version of spark v. 2.1.1 and when I go to the command line and use the command 'bin\pyspark' I have initialization problems and get the following message: C:\spark\spark-2.1.1-bin-hadoop2.7>

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread Tathagata Das
In either case, end to end exactly once guarantee can only be ensured only if the output sink is updated transactionally. The engine has to re execute data on failure. Exactly once guarantee means that the external storage is updated as if each data record was computed exactly once. That's why you

Re: [Spark Structured Streaming] Exception while using watermark with type of timestamp

2017-06-06 Thread Tathagata Das
Cast the timestamp column to a timestamp type. E.g. "cast timestamp as timestamp" Watermark can be defined only columns that are of type timestamp. On Jun 6, 2017 3:06 AM, "Biplob Biswas" wrote: > Hi, > > I am playing around with Spark structured streaming and we have

Re: Edge Node in Spark

2017-06-06 Thread Irving Duran
Ashok, Are you working with straight spark or referring to GraphX? Thank You, Irving Duran On Mon, Jun 5, 2017 at 3:45 PM, Ashok Kumar wrote: > Hi, > > I am a bit confused between Edge node, Edge server and gateway node in > Spark. > > Do these mean the same

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread ALunar Beach
Thanks TD. In pre-structured streaming, exactly once guarantee on input is not guaranteed. is it? On Tue, Jun 6, 2017 at 4:30 AM, Tathagata Das wrote: > This is the expected behavior. There are some confusing corner cases. > If you are starting to play with Spark

a stage can belong to more than one job please?

2017-06-06 Thread ??????????
Hi all, I read same code of spark about stage. The constructor of stage keep the first job ID the stage was part of. does that means a stage can belong to more than one job please? And I find the member jobIds is never used. It looks strange. thanks adv

Re: Java SPI jar reload in Spark

2017-06-06 Thread Jonnas Li(Contractor)
I used java.util.ServiceLoader , as the javadoc says it supports reloading. Please point it out if I'm mis-understanding. Regards Jonnas Li 发件人: Alonso Isidoro Roman > 日期:

Re: Java SPI jar reload in Spark

2017-06-06 Thread Alonso Isidoro Roman
Hi, a quick search on google. https://github.com/spark-jobserver/spark-jobserver/issues/130 Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2017-06-06 12:14

Re: Java SPI jar reload in Spark

2017-06-06 Thread Jonnas Li(Contractor)
Thank for your quick response. These jars are used to define some customize business logic, and they can be treat as plug-ins in our business scenario. And the jars are developed/maintain by some third-party partners, this means there will be some version updating. My expectation is update the

[Spark Structured Streaming] Exception while using watermark with type of timestamp

2017-06-06 Thread Biplob Biswas
Hi, I am playing around with Spark structured streaming and we have a use case to use this as a CEP engine. I am reading from 3 different kafka topics together. I want to perform windowing on this structured stream and then run some queries on this block on a sliding scale. Also, all of this

[Spark Structured Streaming] Exception while using watermark with type of timestamp

2017-06-06 Thread Biplob Biswas
Hi, I am playing around with Spark structured streaming and we have a use case to use this as a CEP engine. I am reading from 3 different kafka topics together. I want to perform windowing on this structured stream and then run some queries on this block on a sliding scale. Also, all of this

Re: Java SPI jar reload in Spark

2017-06-06 Thread Jörn Franke
Why do you need jar reloading? What functionality is executed during jar reloading. Maybe there is another way to achieve the same without jar reloading. In fact, it might be dangerous from a functional point of view- functionality in jar changed and all your computation is wrong. > On 6. Jun

Java SPI jar reload in Spark

2017-06-06 Thread Jonnas Li(Contractor)
I have a Spark Streaming application, which dynamically calling a jar (Java SPI), and the jar is called in a mapWithState() function, it was working fine for a long time. Recently, I got a requirement which required to reload the jar during runtime. But when the reloading is completed, the spark

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread Tathagata Das
This is the expected behavior. There are some confusing corner cases. If you are starting to play with Spark Streaming, i highly recommend learning Structured Streaming instead. On Mon, Jun 5, 2017 at 11:16 AM,

Re: Spark Streaming Job Stuck

2017-06-06 Thread Tathagata Das
http://spark.apache.org/docs/latest/streaming-programming-guide.html#points-to-remember-1 Hope this helps. On Mon, Jun 5, 2017 at 2:51 PM, Jain, Nishit wrote: > I have a very simple spark streaming job running locally in standalone > mode. There is a customer receiver