Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
Have you tried rdd.distinc? > > On Sun, Nov 13, 2016 at 8:28 AM, Cody Koeninger <c...@koeninger.org> > wrote: > >> Can you come up with a minimal reproducible example? >> >> Probably unrelated, but why are you doing a union of 3 streams? >> >> On Sat

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
led tasks or other errors? > Output actions like foreach aren't exactly once and will be retried on > failures. > > On Nov 12, 2016 06:36, "dev loper" <spark...@gmail.com> wrote: > >> Dear fellow Spark Users, >> >> My Spark Streaming application (Spark 2.

Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
Dear fellow Spark Users, My Spark Streaming application (Spark 2.0 , on AWS EMR yarn cluster) listens to Campaigns based on live stock feeds and the batch duration is 5 seconds. The applications uses Kafka DirectStream and based on the feed source there are three streams. As given in the code

Re: (YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

2016-04-29 Thread dev loper
uest your help to identify the issue . On Fri, Apr 29, 2016 at 7:32 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Please use the following syntax: > > --conf > > "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///local/file/log4j.properties" > > FYI >

(YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

2016-04-29 Thread dev loper
Hi Spark Team, I have asked the same question on stack overflow , no luck yet. http://stackoverflow.com/questions/36923949/where-to-find-logs-within-spark-rdd-processing-function-yarn-cluster-mode?noredirect=1#comment61419406_36923949 I am running my Spark Application on Yarn Cluster. No