Just looking at this, what is your frequency interval ingesting ~1000
records per sec. By the rule of thumb your capacity planning should account
for twice the normal ingestion rate.
Regarding your point:
"... Hence, ideally I'd like to increase the number of batches/records
that are being
Hi Team
I am trying to read a parquet file, cache it and then do transformation and
overwrite the parquet file in a session.
But first count action doesn't cache the dataframe.
It gets cached while caching the transformed dataframe.
Even if the spark.sql.parquet.cacheMetadata = true still the
After a 10 minutes delay, taking a 10 minutes batch will not take 10 times
more than a 1-minute batch.
It's mainly because of the I/O write operations to HDFS, and also because
certain active users will be active in 1-minute batch, processing this
customer only once (if we take 10 batches) will
Wouldn't this happen naturally? the large batches would just take a longer
time to complete already.
On Thu, Jul 1, 2021 at 6:32 AM András Kolbert
wrote:
> Hi,
>
> I have a spark streaming application which generally able to process the
> data within the given time frame. However, in certain
You need to set driver memory before the driver starts, on the CLI or
however you run your app, not in the app itself. By the time the driver
starts to run your app, its heap is already set.
On Thu, Jul 1, 2021 at 12:10 AM javaguy Java wrote:
> Hi,
>
> I'm getting Java OOM errors even though
Hi Mich!
The shell script indeed looks more robust now :D
Yes, the current setup works fine. I am wondering whether it is the right
way to set up things? That is, should I run the program which accepts
requests from the queue independently and have it invoke spark-submit cli
or something else?
Hi,
I have a spark streaming application which generally able to process the
data within the given time frame. However, in certain hours it starts
increasing that causes a delay.
In my scenario, the number of input records are not linearly increase the
processing time. Hence, ideally I'd like to
Hi Kartik,
I parameterized your shell script and tested on a stob python file and
looks OK, ensuring that the shell script is more robust
#!/bin/bash
set -e
#cd "$(dirname "${BASH_SOURCE[0]}")/../"
pyspark_venv="pyspark_venv"
source_zip_file="DSBQ.zip"
[ -d ${pyspark_venv} ] && rm -r -d
Hi Pralabh,
You need to check the latest compatibility between Spark version that can
successfully work as Hive execution engine
This is my old file alluding to spark-1.3.1 as the execution engine
set spark.home=/data6/hduser/spark-1.3.1-bin-hadoop2.6;
--set
Hi mich
Thx for replying.your answer really helps. The comparison was done in 2016.
I would like to know the latest comparison with spark 3.0
Also what you are suggesting is to migrate queries to Spark ,which is
hivecontxt or hive on spark, which is what Facebook also did
. Is that understanding
Hi Prahabh,
This question has been asked before :)
Few years ago (late 2016), I made a presentation on running Hive Queries
on the Spark execution engine for Hortonworks.
https://www.slideshare.net/MichTalebzadeh1/query-engines-for-hive-mr-spark-tez-with-llap-considerations
The issue you will
Hi Dev
I am having thousands of legacy hive queries . As a plan to move to Spark
, we are planning to migrate Hive queries on Spark . Now there are two
approaches
1. One is Hive on Spark , which is similar to changing the execution
engine in hive queries like TEZ.
2. Another one is
Hi Gourav,
Thanks for the suggestion, I'll check it out.
Regards,
Kartik
On Thu, Jul 1, 2021 at 5:38 AM Gourav Sengupta
wrote:
> Hi,
>
> I think that reading Matei Zaharia's book "SPARK the definitive guide"
> will be a good and best starting point.
>
> Regards,
> Gourav Sengupta
>
> On Wed,
14 matches
Mail list logo