Re: DB Config data update across multiple Spark Streaming Jobs

2021-03-15 Thread forece85
Any suggestion on this? How to update configuration data on all executors
with out downtime?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



DB Config data update across multiple Spark Streaming Jobs

2021-03-13 Thread forece85
Hi,

We have multiple spark jobs running on a single EMR cluster. All jobs use
same business related configurations which are stored in Postgres. How to
update this configuration data at all executors dynamically if any changes
happened to Postgres db data with out spark restarts.

We are using Kinesis for streaming. Tried of creating new kinesis stream
called cache. Pushing a dummy event and processing in all sparks to refresh
all configuration data at all executors. But not working good. Any better
approach for this problem statement? Or how to correctly implement this?

Thanks in Advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
Not sure if kinesis have such flexibility. What else possibilities are there
at transformations level?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
Any example for this please



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
We are doing batch processing using Spark Streaming with Kinesis with a batch
size of 5 mins. We want to send all events with same eventId to same
executor for a batch so that we can do multiple events based grouping
operations based on eventId. No previous batch or future batch data is
concerned. Only Current batch keyed operation needed.

Please help me how to achieve this. 

Thanks.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread forece85
We are doing batch processing using Spark Streaming with Kinesis with a batch
size of 5 mins. We want to send all events with same eventId to same
executor for a batch so that we can do multiple events based grouping
operations based on eventId. No previous batch or future batch data is
concerned. Only Current batch keyed operation needed.

Please help me how to achieve this. 

Thanks.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Stremaing - Dstreams - Removing RDD

2020-07-27 Thread forece85
We are using Spark Streaming (Dstreams) with Kinesis batch interval as 10sec.
For every random batch, processing time is taking very long. While checking
logs, we found below log lines when ever we are getting spike in processing
time:

 

Processing Time:

 

We dont have any code to manually remove RDDs. How to get rid of these?

Thanks in Advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread forece85
Thanks for reply. Please find sudo code below. Its Dstreams reading for every
10secs from kinesis stream and after transformations, pushing into hbase.
Once got Dstream, we are using below code to repartition and do processing:

dStream = dStream.repartition(javaSparkContext.defaultMinPartitions() * 3);
dStream.foreachRDD(javaRDD -> javaRDD.foreachPartition(partitionOfRecords ->
{
   Connection hbaseConnection= ConnectionUtil.getHbaseConnection();
   List listOfRecords = new ArrayList<>();
   while (partitionOfRecords.hasNext()) {
 listOfRecords.add(partitionOfRecords.next());

 if (listOfRecords.size() < 10 && partitionOfRecords.hasNext())
continue;
 
 List finalListOfRecords = listOfRecords;
 doJob(finalListOfRecords, hbaseConnection);
 listOfRecords = new ArrayList<>();
   }
}));


We are batching every 10 records and pass to doJob method where we batch
process and bulk insert to hbase.

With above code, will it be able to tell what is happening at job 1 -> 6
tasks? and how to replace repartition method efficiently.

Thanks in Advance



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread forece85
Thanks for reply. Please find sudo code below. We are fetching Dstreams from
kinesis stream for every 10sec and performing transformations and finally
persisting to hbase tables using batch insertions.

dStream = dStream.repartition(jssc.defaultMinPartitions() * 3);
dStream.foreachRDD(javaRDD -> javaRDD.foreachPartition(partitionOfRecords ->
{
Connection hbaseConnection =
ConnectionUtil.getHbaseConnection();
List listOfRecords = new ArrayList<>();
while (partitionOfRecords.hasNext()) {
try {
listOfRecords.add(partitionOfRecords.next());

if (listOfRecords.size() < 10 &&
partitionOfRecords.hasNext())
continue;

List finalListOfRecords = listOfRecords;
doJob(finalListOfRecords, primaryConnection,
lookupsConnection);
listOfRecords = new ArrayList<>();
} catch (Exception e) {
e.printStackTrace();
}
}
})); 

We are batching every 10 records and sending to doJob method where actual
transformations happen and every batch will get batch inserted to hbase
table.

With above code can we guess whats happening at Job 1 => 6 tasks and how to
reduce that time. 
Mainly how to effectively set parallelism avoiding repartition() method.

Thanks in Advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Streaming - Set Parallelism and Optimize driver

2020-07-20 Thread forece85
I am new to spark streaming and trying to understand spark ui and to do
optimizations.

1. Processing at executors took less time than at driver. How to optimize to
make driver tasks fast ?
2. We are using dstream.repartition(defaultParallelism*3) to increase
parallelism which is causing high shuffles. Is there any option to avoid
repartition manually to reduce data shuffles.
3. Also trying to understand how 6 tasks in stage1 and 199 tasks in stage2
got created?

*hardware configuration:* executor-cores: 3; driver-cores: 3;
dynamicAllocation is true; 
initial,min,maxExecutors: 25

StackOverFlow link for screenshots:
https://stackoverflow.com/questions/62993030/spark-dstream-help-needed-to-understand-ui-and-how-to-set-parallelism-or-defau

Thanks in Advance



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org