Re: Increase batch interval in case of delay

2021-07-01 Thread Mich Talebzadeh
Just looking at this, what is your frequency interval ingesting ~1000 records per sec. By the rule of thumb your capacity planning should account for twice the normal ingestion rate. Regarding your point: "... Hence, ideally I'd like to increase the number of batches/records that are being

[Spark conf setting] spark.sql.parquet.cacheMetadata = true still invalidates cache in memory.

2021-07-01 Thread Parag Mohanty
Hi Team I am trying to read a parquet file, cache it and then do transformation and overwrite the parquet file in a session. But first count action doesn't cache the dataframe. It gets cached while caching the transformed dataframe. Even if the spark.sql.parquet.cacheMetadata = true still the

Re: Increase batch interval in case of delay

2021-07-01 Thread András Kolbert
After a 10 minutes delay, taking a 10 minutes batch will not take 10 times more than a 1-minute batch. It's mainly because of the I/O write operations to HDFS, and also because certain active users will be active in 1-minute batch, processing this customer only once (if we take 10 batches) will

Unsubscribe

2021-07-01 Thread kushagra deep

Re: Increase batch interval in case of delay

2021-07-01 Thread Sean Owen
Wouldn't this happen naturally? the large batches would just take a longer time to complete already. On Thu, Jul 1, 2021 at 6:32 AM András Kolbert wrote: > Hi, > > I have a spark streaming application which generally able to process the > data within the given time frame. However, in certain

Re: OutOfMemoryError

2021-07-01 Thread Sean Owen
You need to set driver memory before the driver starts, on the CLI or however you run your app, not in the app itself. By the time the driver starts to run your app, its heap is already set. On Thu, Jul 1, 2021 at 12:10 AM javaguy Java wrote: > Hi, > > I'm getting Java OOM errors even though

Re: Structuring a PySpark Application

2021-07-01 Thread Kartik Ohri
Hi Mich! The shell script indeed looks more robust now :D Yes, the current setup works fine. I am wondering whether it is the right way to set up things? That is, should I run the program which accepts requests from the queue independently and have it invoke spark-submit cli or something else?

Increase batch interval in case of delay

2021-07-01 Thread András Kolbert
Hi, I have a spark streaming application which generally able to process the data within the given time frame. However, in certain hours it starts increasing that causes a delay. In my scenario, the number of input records are not linearly increase the processing time. Hence, ideally I'd like to

Re: Structuring a PySpark Application

2021-07-01 Thread Mich Talebzadeh
Hi Kartik, I parameterized your shell script and tested on a stob python file and looks OK, ensuring that the shell script is more robust #!/bin/bash set -e #cd "$(dirname "${BASH_SOURCE[0]}")/../" pyspark_venv="pyspark_venv" source_zip_file="DSBQ.zip" [ -d ${pyspark_venv} ] && rm -r -d

Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Mich Talebzadeh
Hi Pralabh, You need to check the latest compatibility between Spark version that can successfully work as Hive execution engine This is my old file alluding to spark-1.3.1 as the execution engine set spark.home=/data6/hduser/spark-1.3.1-bin-hadoop2.6; --set

Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
Hi mich Thx for replying.your answer really helps. The comparison was done in 2016. I would like to know the latest comparison with spark 3.0 Also what you are suggesting is to migrate queries to Spark ,which is hivecontxt or hive on spark, which is what Facebook also did . Is that understanding

Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Mich Talebzadeh
Hi Prahabh, This question has been asked before :) Few years ago (late 2016), I made a presentation on running Hive Queries on the Spark execution engine for Hortonworks. https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations The issue you will

Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
Hi Dev I am having thousands of legacy hive queries . As a plan to move to Spark , we are planning to migrate Hive queries on Spark . Now there are two approaches 1. One is Hive on Spark , which is similar to changing the execution engine in hive queries like TEZ. 2. Another one is

Re: Structuring a PySpark Application

2021-07-01 Thread Kartik Ohri
Hi Gourav, Thanks for the suggestion, I'll check it out. Regards, Kartik On Thu, Jul 1, 2021 at 5:38 AM Gourav Sengupta wrote: > Hi, > > I think that reading Matei Zaharia's book "SPARK the definitive guide" > will be a good and best starting point. > > Regards, > Gourav Sengupta > > On Wed,