Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
t > microbatch id guarantees that the data produced in the batch is the always > the same no matter any recomputations (assuming all processing logic is > deterministic). So you can commit the batch id + batch data together. And > then async commit the batch id + offsets. > > On Wed

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
>> >> >> >> *Disclaimer:* The information provided is correct to the best of my >> knowledge but of course cannot be guaranteed . It is essential to note >> that, as with any advice, quote "one test result is worth one-thousand >> expert opinions (Werner

Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure streaming to get the microbatch end offsets to the checkpoint in our external checkpoint store ? Thanks in advance. Regards

[Spark-SQL] Dataframe write saveAsTable failed

2023-06-26 Thread Anil Dasari
Hi, We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table creation while writing dataframe as saveAsTable failed with below error. Can not create the managed table(``) The associated location('hdfs:') already exists. On high level our code does below before writing dataframe as

[Spark streaming]: Microbatch id in logs

2023-06-25 Thread Anil Dasari
Hi, I am using spark 3.3.1 distribution and spark stream in my application. Is there a way to add a microbatch id to all logs generated by spark and spark applications ? Thanks.

Re: Spark streaming pending mircobatches queue max length

2022-07-13 Thread Anil Dasari
Retry. From: Anil Dasari Date: Tuesday, July 12, 2022 at 3:42 PM To: user@spark.apache.org Subject: Spark streaming pending mircobatches queue max length Hello, Spark is adding entry to pending microbatches queue at periodic batch interval. Is there config to set the max size for pending

Spark streaming pending mircobatches queue max length

2022-07-12 Thread Anil Dasari
Hello, Spark is adding entry to pending microbatches queue at periodic batch interval. Is there config to set the max size for pending microbatches queue ? Thanks

Re: {EXT} Re: Spark sql slowness in Spark 3.0.1

2022-04-15 Thread Anil Dasari
? Anil Dasari wrote: > We are upgrading spark from 2.4.7 to 3.0.1. we use spark sql (hive) to > checkpoint data frames (intermediate data). DF write is very slow in > 3.0.1 compared to 2.4.7. > - To unsubscribe

Spark sql slowness in Spark 3.0.1

2022-04-14 Thread Anil Dasari
Hello, We are upgrading spark from 2.4.7 to 3.0.1. we use spark sql (hive) to checkpoint data frames (intermediate data). DF write is very slow in 3.0.1 compared to 2.4.7. Have read the release notes and there were no major changes except managed tables and adaptive scheduling. We are not

Re: {EXT} Re: Spark Parquet write OOM

2022-03-05 Thread Anil Dasari
, 2022 at 1:59 AM To: Anil Dasari Cc: Yang,Jie(INF) , user@spark.apache.org Subject: Re: {EXT} Re: Spark Parquet write OOM Hi Anil, any chance you tried setting the limit on the number of records to be written out at a time? Regards, Gourav On Thu, Mar 3, 2022 at 3:12 PM Anil Dasari mailto:adas

Re: {EXT} Re: Spark Parquet write OOM

2022-03-03 Thread Anil Dasari
Hi Gourav, Tried increasing shuffle partitions number and higher executor memory. Both didn’t work. Regards From: Gourav Sengupta Date: Thursday, March 3, 2022 at 2:24 AM To: Anil Dasari Cc: Yang,Jie(INF) , user@spark.apache.org Subject: Re: {EXT} Re: Spark Parquet write OOM Hi, I do

Re: {EXT} Re: Spark Parquet write OOM

2022-03-03 Thread Anil Dasari
Answers in the context. Thanks. From: Gourav Sengupta Date: Thursday, March 3, 2022 at 12:13 AM To: Anil Dasari Cc: Yang,Jie(INF) , user@spark.apache.org Subject: Re: {EXT} Re: Spark Parquet write OOM Hi Anil, I was trying to work out things for a while yesterday, but may need your kind

Re: {EXT} Re: Spark Parquet write OOM

2022-03-02 Thread Anil Dasari
2nd attempt.. Any suggestions to troubleshoot and fix the problem ? thanks in advance. Regards, Anil From: Anil Dasari Date: Wednesday, March 2, 2022 at 7:00 AM To: Gourav Sengupta , Yang,Jie(INF) Cc: user@spark.apache.org Subject: Re: {EXT} Re: Spark Parquet write OOM Hi Gourav and Yang

Re: {EXT} Re: Spark Parquet write OOM

2022-03-02 Thread Anil Dasari
are running ? [AD] No explicit transformations other than basic map transformations required to create dataframe from avor record rdd. Please let me know if yo have any questions. Regards, Anil From: Gourav Sengupta Date: Wednesday, March 2, 2022 at 1:07 AM To: Yang,Jie(INF) Cc: Anil Dasari , user

Spark Parquet write OOM

2022-03-01 Thread Anil Dasari
Hello everyone, We are writing Spark Data frame to s3 in parquet and it is failing with below exception. I wanted to try following to avoid OOM 1. increase the default sql shuffle partitions to reduce load on parquet writer tasks to avoid OOM and 2. Increase user memory (reduce memory

Spark 3.0 plugins

2021-12-19 Thread Anil Dasari
Hello everyone, I was going through Apache Spark Performance Monitoring in Spark 3.0 talk and wanted to collect IO metrics for my spark application. Couldn’t find Spark 3.0 built-in plugins for IO metrics like https://github.com/cerndb/SparkPlugins

Re: Spark Pair RDD write to Hive

2021-09-06 Thread Anil Dasari
2nd try From: Anil Dasari Date: Sunday, September 5, 2021 at 10:42 AM To: "user@spark.apache.org" Subject: Spark Pair RDD write to Hive Hello, I have a use case where users of group id are persisted to hive table. // pseudo code looks like below usersRDD = sc.parallelize(..) us

Spark Pair RDD write to Hive

2021-09-05 Thread Anil Dasari
Hello, I have a use case where users of group id are persisted to hive table. // pseudo code looks like below usersRDD = sc.parallelize(..) usersPairRDD = usersRDD.map(u => (u.groupId, u)) groupedUsers = usersPairRDD.groupByKey() Can I save groupedUsers RDD into hive tables where table name is

Shutdown Spark application with failed state

2021-07-26 Thread Anil Dasari
Hello all, I am using Spark 2.x streaming with kafka. I noticed that spark streaming is processing subsequent micro-batches in case of failure as it takes a while to notify the driver about the error and interrupt streaming-executor thread. This is creating a problem as we are checkpointing

Shutdown Spark application with failed state

2021-07-26 Thread Anil Dasari
Hi Team, I am using Spark 2.x streaming with kafka. I noticed that spark streaming is processing subsequent micro-batches in case of failure as it takes a while to notify the driver about the error and interrupt streaming-executor thread. This is creating problem as we are checkpointing the