[structured-streaming] foreachPartition alternative in structured streaming.

2018-05-17 Thread karthikjay
e: I create connection object inside foreachpartition. How do I do this in Structured Streaming ? I tried connection pooling approach (where I create a pool of connections on the master node and pass it to worker nodes ) here <https://stackoverflow.com/questions/50205650/spark-connection-

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova
; user@spark.apache.org Subject: Re: How to properly execute `foreachPartition` in Spark 2.2 Spark Dataset / Dataframe has foreachPartition() as well. Its implementation is much more efficient than RDD's. There is ton of code snippets, say https://github.com/hdinsight/spark-streaming-data-persi

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Cody Koeninger
Stream to Kafka? > > > > > > *From: *Liana Napalkova <liana.napalk...@eurecat.org> > *Date: *Monday, December 18, 2017 at 10:07 AM > *To: *Silvio Fiorito <silvio.fior...@granturing.com>, " > user@spark.apache.org" <user@spark.apache.org> > > *Subject:

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova
If there is no other way, then I will follow this recommendation. From: Silvio Fiorito <silvio.fior...@granturing.com> Sent: 18 December 2017 16:20:03 To: Liana Napalkova; user@spark.apache.org Subject: Re: How to properly execute `foreachPartition` in Spa

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Silvio Fiorito
iorito <silvio.fior...@granturing.com>, "user@spark.apache.org" <user@spark.apache.org> Subject: Re: How to properly execute `foreachPartition` in Spark 2.2 I need to firstly read from Kafka queue into a DataFrame. Then I should perform some transformations with the data. Finally, f

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Timur Shenkao
Spark Dataset / Dataframe has foreachPartition() as well. Its implementation is much more efficient than RDD's. There is ton of code snippets, say https://github.com/hdinsight/spark-streaming-data-persistence-examples/blob/master/src/main/scala/com/microsoft/spark/streaming/examples/common

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova
. From: Silvio Fiorito <silvio.fior...@granturing.com> Sent: 18 December 2017 16:00:39 To: Liana Napalkova; user@spark.apache.org Subject: Re: How to properly execute `foreachPartition` in Spark 2.2 Why don’t you just use the Kafka sink for Spark 2.2?

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Silvio Fiorito
"user@spark.apache.org" <user@spark.apache.org> Subject: How to properly execute `foreachPartition` in Spark 2.2 Hi, I wonder how to properly execute `foreachPartition` in Spark 2.2. Below I explain the problem is details. I appreciate any help. In Spark 1.6 I was d

How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova
Hi, I wonder how to properly execute `foreachPartition` in Spark 2.2. Below I explain the problem is details. I appreciate any help. In Spark 1.6 I was doing something similar to this: DstreamFromKafka.foreachRDD(session => { session.foreachPartition { partitionOfReco

Re: foreachPartition in Spark Java API

2017-05-30 Thread Anton Kravchenko
< kravchenko.anto...@gmail.com> wrote: > Ok, there are at least two ways to do it: > Dataset df = spark.read.csv("file:///C:/input_data/*.csv") > > df.foreachPartition(new ForEachPartFunction()); > df.toJavaRDD().foreachPartition(new Void_java_func()); > > where ForEachPartFu

Re: foreachPartition in Spark Java API

2017-05-30 Thread Anton Kravchenko
Ok, there are at least two ways to do it: Dataset df = spark.read.csv("file:///C:/input_data/*.csv") df.foreachPartition(new ForEachPartFunction()); df.toJavaRDD().foreachPartition(new Void_java_func()); where ForEachPartFunction and Void_java_func are defined below: // ForEachPartFun

foreachPartition in Spark Java API

2017-05-30 Thread Anton Kravchenko
What would be a Java equivalent of the Scala code below? def void_function_in_scala(ipartition: Iterator[Row]): Unit ={ var df_rows=ArrayBuffer[String]() for(irow<-ipartition){ df_rows+=irow.toString } val df = spark.read.csv("file:///C:/input_data/*.csv")

Re: Foreachpartition in spark streaming

2017-03-20 Thread Ryan
foreachPartition is an action but run on each worker, which means you won't see anything on driver. mapPartitions is a transformation which is lazy and won't do anything until an action. it depends on the specific use case which is better. To output sth(like a print in single machine) you could

Foreachpartition in spark streaming

2017-03-20 Thread Diwakar Dhanuskodi
Just wanted to clarify!!! Is foreachPartition in spark an output operation? Which one is better use mapPartitions or foreachPartitions? Regards Diwakar

Re: Why foreachPartition function make duplicate invocation to map function for every message ? (Spark 2.0.2)

2016-12-16 Thread Cody Koeninger
, which returns the string > value from tuple2._2() for JavaDStream as in > > return tuple2._2(); > > The returned JavaDStream is then processed by foreachPartition, which is > wrapped by foreachRDD. > > foreachPartition's call function does Iterator on the RDD as in > i

Why foreachPartition function make duplicate invocation to map function for every message ? (Spark 2.0.2)

2016-12-15 Thread Michael Nguyen
I have the following sequence of Spark Java API calls (Spark 2.0.2): 1. Kafka stream that is processed via a map function, which returns the string value from tuple2._2() for JavaDStream as in return tuple2._2(); 1. The returned JavaDStream is then processed by foreachPartition

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
) partKeyFileRDD = keyFileRDD.repartition(16) Looking again at the UI, this file has 16 partitions now (all on the same executor). When the forEachPartition runs, this then uses these 16 partitions (all on the same executor). I think this is really the problem. I'm not sure why the repartition didn't spread

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
ileRDD = partKeyFileRDD.mapToPair(new SimpleStorageServiceAsset()); The worker then has the following. The issue I believe is that the following log.info statements only appear in the log file for one of my executors (and not both). In other words, when executing the forEachPartition above, Spark a

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Jacek Laskowski
Hi, Could you share the code with foreachPartition? Jacek 11.03.2016 7:33 PM "Darin McBeath" <ddmcbe...@yahoo.com> napisał(a): > > > I can verify this by looking at the log file for the workers. > > Since I output logging statements in the object called by th

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
I can verify this by looking at the log file for the workers. Since I output logging statements in the object called by the foreachPartition, I can see the statements being logged. Oddly, these output statements only occur in one executor (and not the other). It occurs 16 times

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Jacek Laskowski
Hi, How do you check which executor is used? Can you include a screenshot of the master's webUI with workers? Jacek 11.03.2016 6:57 PM "Darin McBeath" <ddmcbe...@yahoo.com.invalid> napisał(a): > I've run into a situation where it would appear that foreachPartition is > onl

spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
I've run into a situation where it would appear that foreachPartition is only running on one of my executors. I have a small cluster (2 executors with 8 cores each). When I run a job with a small file (with 16 partitions) I can see that the 16 partitions are initialized but they all appear

Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread swetha kasireddy
Looks like I can use mapPartitions but can it be done using forEachPartition? On Tue, Nov 17, 2015 at 11:51 PM, swetha <swethakasire...@gmail.com> wrote: > Hi, > > How to return an RDD of key/value pairs from an RDD that has > foreachPartition applied. I have my code something

Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread Sathish Kumaran Vairavelu
I think you can use mapPartitions that returns PairRDDs followed by forEachPartition for saving it On Wed, Nov 18, 2015 at 9:31 AM swetha kasireddy <swethakasire...@gmail.com> wrote: > Looks like I can use mapPartitions but can it be done using > forEachPartition? > > On Tue,

Re: [SPARK STREAMING] Questions regarding foreachPartition

2015-11-17 Thread Nipun Arora
ion basis, not global ordering. > > You typically want to acquire resources inside the foreachpartition > closure, just before handling the iterator. > > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd > > On Mon, Nov

How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-17 Thread swetha
Hi, How to return an RDD of key/value pairs from an RDD that has foreachPartition applied. I have my code something like the following. It looks like an RDD that has foreachPartition can have only the return type as Unit. How do I apply foreachPartition and do a save and at the same return a pair

[SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Nipun Arora
Hi, I wanted to understand forEachPartition logic. In the code below, I am assuming the iterator is executing in a distributed fashion. 1. Assuming I have a stream which has timestamp data which is sorted. Will the stringiterator in foreachPartition process each line in order? 2. Assuming I have

Re: [SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Cody Koeninger
Ordering would be on a per-partition basis, not global ordering. You typically want to acquire resources inside the foreachpartition closure, just before handling the iterator. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Mon, Nov

Re: foreachPartition

2015-10-30 Thread Mark Hamstra
The closure is sent to and executed an Executor, so you need to be looking at the stdout of the Executors, not on the Driver. On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > I'm just trying to do some operation inside foreachPartition, bu

foreachPartition

2015-10-30 Thread Alex Nastetsky
I'm just trying to do some operation inside foreachPartition, but I can't even get a simple println to work. Nothing gets printed. scala> val a = sc.parallelize(List(1,2,3)) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :21 scala> a.foreachPartition(p =>

Re: foreachPartition

2015-10-30 Thread Alex Nastetsky
gt; On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky < > alex.nastet...@vervemobile.com> wrote: > >> I'm just trying to do some operation inside foreachPartition, but I can't >> even get a simple println to work. Nothing gets printed. >> >> scala> val a = sc.pa

Re: Performance issue with Spak's foreachpartition method

2015-07-27 Thread diplomatic Guru
, which is holding us from going live. We have a job that carries out computation on log files and write the results into Oracle DB. The reducer 'reduceByKey' have been set to parallelize by 4 as we don't want to establish too many DB connections. We are then calling the foreachPartition

Re: Performance issue with Spak's foreachpartition method

2015-07-24 Thread Bagavath
DB connections. We are then calling the foreachPartition on the RDD pairs that were reduced by the key. Within this foreachPartition method we establish DB connection, then iterate the results, prepare the Oracle statement for batch insertion then we commit the batch and close the connection

Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
many DB connections. We are then calling the foreachPartition on the RDD pairs that were reduced by the key. Within this foreachPartition method we establish DB connection, then iterate the results, prepare the Oracle statement for batch insertion then we commit the batch and close the connection

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread Robin East
don't want to establish too many DB connections. We are then calling the foreachPartition on the RDD pairs that were reduced by the key. Within this foreachPartition method we establish DB connection, then iterate the results, prepare the Oracle statement for batch insertion then we commit

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
DB. The reducer 'reduceByKey' have been set to parallelize by 4 as we don't want to establish too many DB connections. We are then calling the foreachPartition on the RDD pairs that were reduced by the key. Within this foreachPartition method we establish DB connection, then iterate

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov
: dgoldenberg; user Subject: Re: Objects serialized before foreachRDD/foreachPartition ? Considering memory footprint of param as mentioned by Dmitry, option b seems better. Cheers On Wed, Jun 3, 2015 at 6:27 AM, Evo Eftimov evo.efti...@isecc.com wrote: Hmmm a spark streaming app code

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov
[mailto:dgoldenberg...@gmail.com] Sent: Wednesday, June 3, 2015 1:56 PM To: user@spark.apache.org Subject: Objects serialized before foreachRDD/foreachPartition ? I'm looking at https://spark.apache.org/docs/latest/tuning.html. Basically the takeaway is that all objects passed into the code processing RDD's

Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread dgoldenberg
need to think twice about the costs of serializing such objects, it would seem. In the below, does the Spark serialization happen before calling foreachRDD or before calling foreachPartition? Param param = new Param(); param.initialize(); messageBodies.foreachRDD(new FunctionJavaRDDlt

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Ted Yu
serialized before foreachRDD/foreachPartition ? I'm looking at https://spark.apache.org/docs/latest/tuning.html. Basically the takeaway is that all objects passed into the code processing RDD's must be serializable. So if I've got a few objects that I'd rather initialize once and deinitialize once

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Dmitry Goldenberg
, 2015 2:44 PM *To:* Evo Eftimov *Cc:* dgoldenberg; user *Subject:* Re: Objects serialized before foreachRDD/foreachPartition ? Considering memory footprint of param as mentioned by Dmitry, option b seems better. Cheers On Wed, Jun 3, 2015 at 6:27 AM, Evo Eftimov evo.efti...@isecc.com

Re: mapPartitions vs foreachPartition

2015-04-20 Thread Archit Thakur
The same, which is between map and foreach. map takes iterator returns iterator foreach takes iterator returns Unit. On Mon, Apr 20, 2015 at 4:05 PM, Arun Patel arunp.bigd...@gmail.com wrote: What is difference between mapPartitions vs foreachPartition? When to use these? Thanks, Arun

Re: mapPartitions vs foreachPartition

2015-04-20 Thread Archit Thakur
True. On Mon, Apr 20, 2015 at 4:14 PM, Arun Patel arunp.bigd...@gmail.com wrote: mapPartitions is a transformation and foreachPartition is a an action? Thanks Arun On Mon, Apr 20, 2015 at 4:38 AM, Archit Thakur archit279tha...@gmail.com wrote: The same, which is between map and foreach

mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel
What is difference between mapPartitions vs foreachPartition? When to use these? Thanks, Arun

Re: mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel
mapPartitions is a transformation and foreachPartition is a an action? Thanks Arun On Mon, Apr 20, 2015 at 4:38 AM, Archit Thakur archit279tha...@gmail.com wrote: The same, which is between map and foreach. map takes iterator returns iterator foreach takes iterator returns Unit. On Mon, Apr

Creating RDDs from within foreachPartition() [Spark-Streaming]

2015-02-18 Thread t1ny
anywhere within /foreachPartition()/. The code above throws a NullPointerException, but if I change it to ssc.sc.makeRDD (where ssc is the StreamingContext object created in the main function, outside of /foreachPartition/) then I get a NotSerializableException. What is the correct way to do

Re: Creating RDDs from within foreachPartition() [Spark-Streaming]

2015-02-18 Thread Sean Owen
the SparkContext from anywhere within /foreachPartition()/. The code above throws a NullPointerException, but if I change it to ssc.sc.makeRDD (where ssc is the StreamingContext object created in the main function, outside of /foreachPartition/) then I get a NotSerializableException. What

Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer
(i.e. from the foreach() call). (The result is the same also if I exchange the order.) What exactly is the meaning of foreachPartition and how would I use it correctly? Thanks Tobias

Re: Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer
Hi again, On Thu, Dec 18, 2014 at 6:43 PM, Tobias Pfeiffer t...@preferred.jp wrote: tmpRdd.foreachPartition(iter = { iter.map(item = { println(xyz: + item) }) }) Uh, with iter.foreach(...) it works... the reason being apparently that

NullPointerException on cluster mode when using foreachPartition

2014-12-16 Thread richiesgr
: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-cluster-mode-when-using-foreachPartition-tp20719.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail

Re: NullPointerException on cluster mode when using foreachPartition

2014-12-16 Thread Shixiong Zhu
= dbActorUpdater ! updateDBMessage(r))) There is no problem. I think something is misconfigured Thanks for help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-cluster-mode-when-using-foreachPartition-tp20719.html Sent from the Apache Spark

foreachPartition and task status

2014-10-14 Thread Salman Haq
Hi, In my application, I am successfully using foreachPartition to write large amounts of data into a Cassandra database. What is the recommended practice if the application wants to know that the tasks have completed for all partitions? Thanks, Salman

Re: foreachPartition and task status

2014-10-14 Thread Sean McNamara
Are you using spark streaming? On Oct 14, 2014, at 10:35 AM, Salman Haq sal...@revmetrix.com wrote: Hi, In my application, I am successfully using foreachPartition to write large amounts of data into a Cassandra database. What is the recommended practice if the application wants to know

Re: foreachPartition and task status

2014-10-14 Thread Salman Haq
On Tue, Oct 14, 2014 at 12:42 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Are you using spark streaming? No, not at this time.

Re: foreachPartition: write to multiple files

2014-10-08 Thread david
Hi, I finally found a solution after reading the post : http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-RDD-by-key-and-save-to-different-path-td11887.html#a11983 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/foreachPartition-write