Query regarding kafka version

2022-01-04 Thread Renu Yadav
Hi Team,

I am using spark 2.2 , so can I use kafka version 2.5 in my spark streaming
application?

Thanks & Regards,
Renu Yadav


Re: Updating spark-env.sh per application

2021-05-08 Thread Renu Yadav
Hi Mich,

In spark-env.sh , SPARK_DIST_CLASSPATH is set . I want to override this
variable during runtime as wanted to exclude one lib class from it.



On Fri, 7 May, 2021, 6:51 pm Mich Talebzadeh, 
wrote:

> Hi,
>
> Environment variables Re read in when spark-submit kicks off. What exactly
> you need to refresh at the application level?
>
> HTH
>
> On Fri, 7 May 2021 at 11:34, Renu Yadav  wrote:
>
>>   Hi Team,
>>
>> Is it possible to override the variable of spark-env.sh on application
>> level ?
>>
>> Thanks & Regards,
>> Renu Yadav
>>
>>
>> On Fri, May 7, 2021 at 12:16 PM Renu Yadav  wrote:
>>
>>> Hi Team,
>>>
>>> Is it possible to override the variable of spark-env.sh on application
>>> level ?
>>>
>>> Thanks & Regards,
>>> Renu Yadav
>>>
>>> --
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Updating spark-env.sh per application

2021-05-07 Thread Renu Yadav
  Hi Team,

Is it possible to override the variable of spark-env.sh on application
level ?

Thanks & Regards,
Renu Yadav


On Fri, May 7, 2021 at 12:16 PM Renu Yadav  wrote:

> Hi Team,
>
> Is it possible to override the variable of spark-env.sh on application
> level ?
>
> Thanks & Regards,
> Renu Yadav
>
>


Spark streaming giving error for version 2.4

2021-03-15 Thread Renu Yadav
Hi Team,


I have upgraded my spark streaming from 2.2 to 2.4 but getting below error:


spark-streaming-kafka_0-10.2.11_2.4.0


scala 2.11


Any Idea?



main" java.lang.AbstractMethodError

at
org.apache.spark.util.ListenerBus$class.$init$(ListenerBus.scala:34)

at
org.apache.spark.streaming.scheduler.StreamingListenerBus.(StreamingListenerBus.scala:30)

at
org.apache.spark.streaming.scheduler.JobScheduler.(JobScheduler.scala:57)

at
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:184)

at
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:85)


Thanks & Regards,

Renu Yadav


Re: How to upgrade kafka client in spark_streaming_kafka 2.2

2021-03-12 Thread Renu Yadav
Ok, thanks for the clarification.
I will  try to migrate my project to structured streaming .

Regards,
Renu


On Fri, Mar 12, 2021 at 7:38 PM Gabor Somogyi 
wrote:

> Mainly bugfixes and no breaking AFAIK.
>
> As a side note there were intentions to close DStreams and discontinue
> as-is.
> It's not yet happened but it's on the road so I strongly recommend to
> migrate to Structured Streaming...
> We simply can't support 2 streaming engines for huge amount of time.
>
> G
>
>
> On Fri, Mar 12, 2021 at 3:02 PM Renu Yadav  wrote:
>
>> Hi Gabor,
>>
>> It seems like it is better to upgrade my spark version .
>>
>> Are there major changes in terms of streaming from spark 2.2 to spark 2.4?
>>
>> PS: I am using KafkaUtils api to create steam
>>
>> Thanks & Regards,
>> Renu yadav
>>
>> On Fri, Mar 12, 2021 at 7:25 PM Renu Yadav  wrote:
>>
>>> Thanks Gabor,
>>> This is  very useful.
>>>
>>> Regards,
>>> Renu Yadav
>>>
>>> On Fri, Mar 12, 2021 at 5:36 PM Gabor Somogyi 
>>> wrote:
>>>
>>>> Kafka client upgrade is not a trivial change which may or may not work
>>>> since new versions can contain incompatible API and/or behavior changes.
>>>> I've collected how Spark evolved in terms of Kafka client and there
>>>> I've gathered the breaking changes to make our life easier.
>>>> Have a look and based on that you can make your choice:
>>>> https://gist.github.com/gaborgsomogyi/3476c32d69ff2087ed5d7d031653c7a9
>>>>
>>>> As a general suggestion it would be best to upgrade Spark as-is because
>>>> we've added many fixes which one can face...
>>>>
>>>> Hope this helps!
>>>>
>>>> G
>>>>
>>>>
>>>> On Fri, Mar 12, 2021 at 9:45 AM Renu Yadav  wrote:
>>>>
>>>>> Hi Team,
>>>>>  I am using spark -2.2 and spark_streamin_kafka 2.2  , which is
>>>>> pointing to kafka-client 0.10 . How can I upgrade a kafka client to kafka
>>>>> 2.2.0 ?
>>>>>
>>>>> Thanks & Regards,
>>>>> Renu Yadav
>>>>>
>>>>


Re: How to upgrade kafka client in spark_streaming_kafka 2.2

2021-03-12 Thread Renu Yadav
Hi Gabor,

It seems like it is better to upgrade my spark version .

Are there major changes in terms of streaming from spark 2.2 to spark 2.4?

PS: I am using KafkaUtils api to create steam

Thanks & Regards,
Renu yadav

On Fri, Mar 12, 2021 at 7:25 PM Renu Yadav  wrote:

> Thanks Gabor,
> This is  very useful.
>
> Regards,
> Renu Yadav
>
> On Fri, Mar 12, 2021 at 5:36 PM Gabor Somogyi 
> wrote:
>
>> Kafka client upgrade is not a trivial change which may or may not work
>> since new versions can contain incompatible API and/or behavior changes.
>> I've collected how Spark evolved in terms of Kafka client and there I've
>> gathered the breaking changes to make our life easier.
>> Have a look and based on that you can make your choice:
>> https://gist.github.com/gaborgsomogyi/3476c32d69ff2087ed5d7d031653c7a9
>>
>> As a general suggestion it would be best to upgrade Spark as-is because
>> we've added many fixes which one can face...
>>
>> Hope this helps!
>>
>> G
>>
>>
>> On Fri, Mar 12, 2021 at 9:45 AM Renu Yadav  wrote:
>>
>>> Hi Team,
>>>  I am using spark -2.2 and spark_streamin_kafka 2.2  , which is pointing
>>> to kafka-client 0.10 . How can I upgrade a kafka client to kafka 2.2.0 ?
>>>
>>> Thanks & Regards,
>>> Renu Yadav
>>>
>>


Re: How to upgrade kafka client in spark_streaming_kafka 2.2

2021-03-12 Thread Renu Yadav
Thanks Gabor,
This is  very useful.

Regards,
Renu Yadav

On Fri, Mar 12, 2021 at 5:36 PM Gabor Somogyi 
wrote:

> Kafka client upgrade is not a trivial change which may or may not work
> since new versions can contain incompatible API and/or behavior changes.
> I've collected how Spark evolved in terms of Kafka client and there I've
> gathered the breaking changes to make our life easier.
> Have a look and based on that you can make your choice:
> https://gist.github.com/gaborgsomogyi/3476c32d69ff2087ed5d7d031653c7a9
>
> As a general suggestion it would be best to upgrade Spark as-is because
> we've added many fixes which one can face...
>
> Hope this helps!
>
> G
>
>
> On Fri, Mar 12, 2021 at 9:45 AM Renu Yadav  wrote:
>
>> Hi Team,
>>  I am using spark -2.2 and spark_streamin_kafka 2.2  , which is pointing
>> to kafka-client 0.10 . How can I upgrade a kafka client to kafka 2.2.0 ?
>>
>> Thanks & Regards,
>> Renu Yadav
>>
>


How to upgrade kafka client in spark_streaming_kafka 2.2

2021-03-12 Thread Renu Yadav
Hi Team,
 I am using spark -2.2 and spark_streamin_kafka 2.2  , which is pointing to
kafka-client 0.10 . How can I upgrade a kafka client to kafka 2.2.0 ?

Thanks & Regards,
Renu Yadav


Hbase in spark

2016-02-26 Thread Renu Yadav
Has anybody implemented bulk load into hbase using spark?

I need help to optimize its performance.

Please help.


Thanks & Regards,
Renu Yadav


Re: spark task scheduling delay

2016-01-20 Thread Renu Yadav
Any suggestions?

On Wed, Jan 20, 2016 at 6:50 PM, Renu Yadav <yren...@gmail.com> wrote:

> Hi ,
>
> I am facing spark   task scheduling delay issue in spark 1.4.
>
> suppose I have 1600 tasks running then 1550 tasks runs fine but for the
> remaining 50 i am facing task delay even if the input size of these task is
> same as the above 1550 tasks
>
> Please suggest some solution.
>
> Thanks & Regards
> Renu Yadav
>


Schedular delay in spark 1.4

2015-12-09 Thread Renu Yadav
Hi,
I am working on spark 1.4 .
I am running a  spark job on a yarn cluster .When number of other jobs are
less then my spark job completes very smoothly and when more number of
small job run on the cluster my spark job starts showing schedular delay at
the end on each stage.

PS:I am running my spark job in high priority queue.

PLEASE SUGGEST SOME SOLUTION

Thanks & Regards,
Renu Yadav


load multiple directory using dataframe load

2015-11-23 Thread Renu Yadav
Hi ,

I am using dataframe and want to load orc file using multiple directory
like this:
hiveContext.read.format.load("mypath/3660,myPath/3661")

but it is not working.

Please suggest how to achieve this

Thanks & Regards,
Renu Yadav


orc read issue n spark

2015-11-18 Thread Renu Yadav
Hi ,
I am using spark 1.4.1 and saving orc file using
df.write.format("orc").save("outputlocation")

outputloation size 440GB

and while reading df.read.format("orc").load("outputlocation").count


it has 2618 partitions .
the count operation runs fine uptil 2500 but starts delay scheduling after
that which results in slow performance.

*If anyone has any idea on this.Please do reply as I need this  very urgent*

Thanks in advance


Regards,
Renu Yadav


Data Locality Issue

2015-11-15 Thread Renu Yadav
Hi,

I am working on spark 1.4 and reading a orc table using dataframe and
converting that DF to RDD

I spark UI I observe that 50 % task are running on locality and ANY and
very few on LOCAL.

What would be the possible reason for this?

Please help. I have even changed locality settings


Thanks & Regards,
Renu Yadav


Re: Data Locality Issue

2015-11-15 Thread Renu Yadav
what are the parameters on which locality depends

On Sun, Nov 15, 2015 at 5:54 PM, Renu Yadav <yren...@gmail.com> wrote:

> Hi,
>
> I am working on spark 1.4 and reading a orc table using dataframe and
> converting that DF to RDD
>
> I spark UI I observe that 50 % task are running on locality and ANY and
> very few on LOCAL.
>
> What would be the possible reason for this?
>
> Please help. I have even changed locality settings
>
>
> Thanks & Regards,
> Renu Yadav
>


Re: spark 1.4 GC issue

2015-11-14 Thread Renu Yadav
I have tried with G1 GC .Please if anyone can provide their setting for GC.
At code level I am :
1.reading orc table usind dataframe
2.map df to rdd of my case class
3. changed that rdd to paired rdd
4.Applied combineByKey
5. saving the result to orc file

Please suggest

Regards,
Renu Yadav

On Fri, Nov 13, 2015 at 8:01 PM, Renu Yadav <yren...@gmail.com> wrote:

> am using spark 1.4 and my application is taking much time in GC around
> 60-70% of time for each task
>
> I am using parallel GC.
> please help somebody as soon as possible.
>
> Thanks,
> Renu
>


spark 1.4 GC issue

2015-11-13 Thread Renu Yadav
 am using spark 1.4 and my application is taking much time in GC around
60-70% of time for each task

I am using parallel GC.
please help somebody as soon as possible.

Thanks,
Renu


Rdd Partitions issue

2015-10-15 Thread Renu Yadav
I am reading parquet file from a dir which has 400 file of max 180M size
so while reading my partition should be 400 as split size is 256 M in my
case

But it is taking 787 partiition .Why is it so?

Please help.

Thanks,
Renu


Change Orc split size

2015-09-29 Thread Renu Yadav
Hi,

I am reading data from hive orc table using spark-sql  which is taking
256mb as split size.

How can i change this size

Thanks,
Renu


How does driver memory utilized

2015-09-15 Thread Renu Yadav
Hi

I have query regarding driver memory

what are the tasks in which driver memory is used?

Please Help


Fwd: Spark job failed

2015-09-14 Thread Renu Yadav
-- Forwarded message --
From: Renu Yadav <yren...@gmail.com>
Date: Mon, Sep 14, 2015 at 4:51 PM
Subject: Spark job failed
To: d...@spark.apache.org


I am getting below error while running spark job:

storage.DiskBlockObjectWriter: Uncaught exception while reverting partial
writes to file
/data/vol5/hadoop/yarn/local/usercache/renu_yadav/appcache/application_1438196554863_31545/spark-4686a622-82be-418e-a8b0-1653458bc8cb/22/temp_shuffle_8c437ba7-55d2-4520-80ec-adcfe932b3bd
java.io.FileNotFoundException:
/data/vol5/hadoop/yarn/local/usercache/renu_yadav/appcache/application_1438196554863_31545/spark-4686a622-82be-418e-a8b0-1653458bc8cb/22/temp_shuffle_8c437ba7-55d2-4520-80ec-adcfe932b3bd
(No such file or directory



I am running 1.3TB data
following are the transformation

read from hadoop->map(key/value).coalease(2000).groupByKey.
then sorting each record by server_ts and select most recent

saving data into parquet.


Following is the command
spark-submit --class com.test.Myapp--master yarn-cluster  --driver-memory
16g  --executor-memory 20g --executor-cores 5   --num-executors 150
--files /home/renu_yadav/fmyapp/hive-site.xml --conf
spark.yarn.preserve.staging.files=true --conf
spark.shuffle.memoryFraction=0.6  --conf spark.storage.memoryFraction=0.1
--conf SPARK_SUBMIT_OPTS="-XX:MaxPermSize=768m"  --conf
SPARK_SUBMIT_OPTS="-XX:MaxPermSize=768m"   --conf
spark.akka.timeout=40  --conf spark.locality.wait=10 --conf
spark.yarn.executor.memoryOverhead=8000   --conf
SPARK_JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"
--conf spark.reducer.maxMbInFlight=96  --conf
spark.shuffle.file.buffer.kb=64 --conf
spark.core.connection.ack.wait.timeout=120  --jars
/usr/hdp/2.2.6.0-2800/hive/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/2.2.6.0-2800/hive/lib/datanucleus-core-3.2.10.jar,/usr/hdp/2.2.6.0-2800/hive/lib/datanucleus-rdbms-3.2.9.jar
myapp_2.10-1.0.jar







Cluster configuration

20 Nodes
32 cores per node
125 GB ram per node


Please Help.

Thanks & Regards,
Renu Yadav