Re: Happy Diwali everyone!!!

2018-11-07 Thread Dilip Biswal
Thank you Sean. Happy Diwali !!
 
-- Dilip
- Original message -From: Xiao Li To: "user@spark.apache.org" , user Cc:Subject: Happy Diwali everyone!!!Date: Wed, Nov 7, 2018 3:10 PM 
Happy Diwali everyone!!!
 
Xiao Li
 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Happy Diwali everyone!!!

2018-11-07 Thread Xiao Li
Happy Diwali everyone!!!

Xiao Li


subscribe

2018-11-07 Thread Vein Kong
subscribe

Re: How to increase the parallelism of Spark Streaming application?

2018-11-07 Thread Shahbaz
Hi ,

   - Do you have adequate CPU cores allocated to handle increased
   partitions ,generally if you have Kafka partitions >=(greater than or equal
   to) CPU Cores Total (Number of Executor Instances * Per Executor Core)
   ,gives increased task parallelism for reader phase.
   - However if you have too many partitions but not enough cores ,it would
   eventually slow down the reader (Ex: 100 Partitions and only 20 Total
   Cores).
   - Additionally ,the next set of transformation will have there own
   partitions ,if its involving  shuffle ,sq.shuffle.partitions then defines
   next level of parallelism ,if you are not having any data skew,then you
   should get good performance.


Regards,
Shahbaz

On Wed, Nov 7, 2018 at 12:58 PM JF Chen  wrote:

> I have a Spark Streaming application which reads data from kafka and save
> the the transformation result to hdfs.
> My original partition number of kafka topic is 8, and repartition the data
> to 100 to increase the parallelism of spark job.
> Now I am wondering if I increase the kafka partition number to 100 instead
> of setting repartition to 100, will the performance be enhanced? (I know
> repartition action cost a lot cpu resource)
> If I set the kafka partition number to 100, does it have any negative
> efficiency?
> I just have one production environment so it's not convenient for me to do
> the test
>
> Thanks!
>
> Regard,
> Junfeng Chen
>


How does shuffle operation work in Spark?

2018-11-07 Thread Joe

Hello,
I'm looking for a detailed description of the shuffle operation in 
Spark, something that would explain what are the criteria for assigning 
blocks to nodes, how many go where, what happens when there are memory 
constraints, etc.
If anyone knows of such a document I'd appreciate a link (or a detailed 
answer).

Thanks a lot,

Joe


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-07 Thread bsikander
Actually, our job runs fine for 17-18 hours and this behavior just suddenly
starts happening after that. 

We found the following ticket which is exactly what is happening in our
Kafka cluster also.
WARN Failed to send SSL Close message 
(org.apache.kafka.common.network.SslTransportLayer)

You also replied to this ticket with a problem very similar to ours.

what fix you did to avoid these SSL Close exceptions and long delays in
spark job?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



DB2 Sequence - Error while invoking

2018-11-07 Thread ☼ R Nair
Hi all,

We are trying to call the DB2 Sequence through Spark and assign that value
to one of the column (PK) in table. We are getting the below issue:

SEQ: CITI_VENDOR_UNITED_LIST_TARGET_SEQ
Table: CITI_VENDOR_UNITED_LIST_TARGET
DB: CITIVENDORS
Host: CIT_XX
Port: 42194
Schema: MINE

DB2 SQL ERROR: SQLCODE = 348, SQLSTATE= 428F9 SQLERRMC=NEXTVAL FOR
MINE.CITI_VENDOR_UNITED_LIST_TARGET_SEQ

IS this error because SEQUENCE is not deterministic a column ? I can query
using simple SQL. But not from my spark code. The spark code is usual one
with connection properties using spark.read.jdbc.

Thanks and regards,
Ravion


Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-07 Thread Biplob Biswas
Hi,

This has to do with your batch duration and processing time, as a rule, the
batch duration should be lower than the processing time of your data. As I
can see from your screenshots, your batch duration is 10 seconds but your
processing time is more than a minute mostly, this adds up and you will end
up having a lot of scheduling delay.

Maybe see, why does it take 1 min to process 100 records and fix the logic.
Also, I see you have higher number of events which takes some time lower
amount of processing time. Fix the code logic and this should be fixed.

Thanks & Regards
Biplob Biswas


On Wed, Nov 7, 2018 at 11:08 AM bsikander  wrote:

> We are facing an issue with very long scheduling delays in Spark (upto 1+
> hours).
> We are using Spark-standalone. The data is being pulled from Kafka.
>
> Any help would be much appreciated.
>
> I have attached the screenshots.
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/t8018/1-stats.png>
>
> 
> 
> 
>
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[Spark-Core] Long scheduling delays (1+ hour)

2018-11-07 Thread bsikander
We are facing an issue with very long scheduling delays in Spark (upto 1+
hours).
We are using Spark-standalone. The data is being pulled from Kafka.

Any help would be much appreciated.

I have attached the screenshots.
 
 
 
 







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to increase the parallelism of Spark Streaming application?

2018-11-07 Thread vincent gromakowski
On the other side increasing parallelism with kakfa partition avoid the
shuffle in spark to repartition

Le mer. 7 nov. 2018 à 09:51, Michael Shtelma  a écrit :

> If you configure to many Kafka partitions, you can run into memory issues.
> This will increase memory requirements for spark job a lot.
>
> Best,
> Michael
>
>
> On Wed, Nov 7, 2018 at 8:28 AM JF Chen  wrote:
>
>> I have a Spark Streaming application which reads data from kafka and save
>> the the transformation result to hdfs.
>> My original partition number of kafka topic is 8, and repartition the
>> data to 100 to increase the parallelism of spark job.
>> Now I am wondering if I increase the kafka partition number to 100
>> instead of setting repartition to 100, will the performance be enhanced? (I
>> know repartition action cost a lot cpu resource)
>> If I set the kafka partition number to 100, does it have any negative
>> efficiency?
>> I just have one production environment so it's not convenient for me to
>> do the test
>>
>> Thanks!
>>
>> Regard,
>> Junfeng Chen
>>
>


Re: How to increase the parallelism of Spark Streaming application?

2018-11-07 Thread Michael Shtelma
If you configure to many Kafka partitions, you can run into memory issues.
This will increase memory requirements for spark job a lot.

Best,
Michael


On Wed, Nov 7, 2018 at 8:28 AM JF Chen  wrote:

> I have a Spark Streaming application which reads data from kafka and save
> the the transformation result to hdfs.
> My original partition number of kafka topic is 8, and repartition the data
> to 100 to increase the parallelism of spark job.
> Now I am wondering if I increase the kafka partition number to 100 instead
> of setting repartition to 100, will the performance be enhanced? (I know
> repartition action cost a lot cpu resource)
> If I set the kafka partition number to 100, does it have any negative
> efficiency?
> I just have one production environment so it's not convenient for me to do
> the test
>
> Thanks!
>
> Regard,
> Junfeng Chen
>