Yes, you can certainly use spark streaming, but reading from the original
source table may still be time consuming and resource intensive.
Having some context on the RDBMS platform, data size/volumes involved and the
tolerable lag (between changes being created and it being processed by Spark)
Just curious - is this HttpSink your own custom sink or Dropwizard
configuration?
If your own custom code, I would suggest looking/trying out the Dropwizard.
See
http://spark.apache.org/docs/latest/monitoring.html#metrics
https://metrics.dropwizard.io/4.0.0/
Also, from what I know, the metrics
See if https://spark.apache.org/docs/latest/monitoring.html helps.
Essentially whether you are running an app as spark-shell, via spark-submit
(local, Spark-Cluster, YARN, Kubernetes, mesos), the driver will provide a UI
on port 4040.
You can monitor via the UI and via a REST API
E.g. running
are still excessive.
From: Vitaliy Pisarev
Date: Thursday, November 15, 2018 at 1:58 PM
To: "Thakrar, Jayesh"
Cc: Shahbaz , user , David
Markovitz
Subject: Re: How to address seemingly low core utilization on a spark workload?
Small update, my initial estimate was incorrect. I have on
save.
From: Vitaliy Pisarev
Date: Thursday, November 15, 2018 at 1:03 PM
To: Shahbaz
Cc: "Thakrar, Jayesh" , user
, "dudu.markov...@microsoft.com"
Subject: Re: How to address seemingly low core utilization on a spark workload?
Agree, and I will try it. One clarificatio
ittle work.
Question is what can I do about it.
On Thu, Nov 15, 2018 at 5:29 PM Thakrar, Jayesh
mailto:jthak...@conversantmedia.com>> wrote:
Can you shed more light on what kind of processing you are doing?
One common pattern that I have seen for active core/executor utilization
dropping to zero
Can you shed more light on what kind of processing you are doing?
One common pattern that I have seen for active core/executor utilization
dropping to zero is while reading ORC data and the driver seems (I think) to be
doing schema validation.
In my case I would have hundreds of thousands of
Not sure I get what you mean….
I ran the query that you had – and don’t get the same hash as you.
From: Gokula Krishnan D
Date: Friday, September 28, 2018 at 10:40 AM
To: "Thakrar, Jayesh"
Cc: user
Subject: Re: [Spark SQL] why spark sql hash() are returns the same hash value
thoug
Cannot reproduce your situation.
Can you share Spark version?
Welcome to
__
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type
Disclaimer - I use Spark with Scala and not Python.
But I am guessing that Jorn's reference to modularization is to ensure that you
do the processing inside methods/functions and call those methods sequentially.
I believe that as long as an RDD/dataset variable is in scope, its memory may
not
com>
Date: Monday, May 21, 2018 at 10:20 PM
To: ayan guha <guha.a...@gmail.com>
Cc: "Thakrar, Jayesh" <jthak...@conversantmedia.com>, user
<user@spark.apache.org>
Subject: Re: How to skip nonexistent file when read files with spark?
Thanks ayan,
Also I have tried
Probably you can do some preprocessing/checking of the paths before you attempt
to read it via Spark.
Whether it is local or hdfs filesystem, you can try to check for existence and
other details by using the "FileSystem.globStatus" method from the Hadoop API.
From: JF Chen
And here's some more info on Spark Metrics
https://www.slideshare.net/JayeshThakrar/apache-bigdata2017sparkprofiling
From: Maximiliano Felice
Date: Monday, January 8, 2018 at 8:14 AM
To: Irtiza Ali
Cc:
Subject: Re: Spark
You can also get the metrics from the Spark application events log file.
See https://www.slideshare.net/JayeshThakrar/apache-bigdata2017sparkprofiling
From: "Qiao, Richard"
Date: Monday, December 4, 2017 at 6:09 PM
To: Nick Dimiduk ,
What you have is sequential and hence sequential processing.
Also Spark/Scala are not parallel programming languages.
But even if they were, statements are executed sequentially unless you exploit
the parallel/concurrent execution features.
Anyway, see if this works:
val (RDD1, RDD2) =
Could this be due to https://issues.apache.org/jira/browse/HIVE-6 ?
From: Patrik Medvedev
Date: Monday, June 12, 2017 at 2:31 AM
To: Jörn Franke , vaquar khan
Cc: Jean Georges Perrin , User
Roy - can you check if you have HADOOP_CONF_DIR and YARN_CONF_DIR set to the
directory containing the HDFS and YARN configuration files?
From: Sandeep Nemuri
Date: Monday, March 27, 2017 at 9:44 AM
To: Saisai Shao
Cc: Yong Zhang
management are unified. All memory fractions used in the old model are now
deprecated and no longer read. If you wish to use the old memory management,
you may explicitly enable `spark.memory.useLegacyMode` (not recommended).
On Mon, Feb 13, 2017 at 11:23 PM, Thakrar, Jayesh
<jtha
Nancy,
As your log output indicated, your executor 11 GB memory limit.
While you might want to address the root cause/data volume as suggested by Jon,
you can do an immediate test by changing your command as follows
spark-shell --master yarn --deploy-mode client --driver-memory 16G
Ben,
Also look at Phoenix (Apache project) which provides a better (one of the best)
SQL/JDBC layer on top of HBase.
http://phoenix.apache.org/
Cheers,
Jayesh
From: vincent gromakowski
Date: Monday, October 17, 2016 at 1:53 PM
To: Benjamin Kim
Yes, iterating over a dataframe and making changes is not uncommon.
Ofcourse RDDs, dataframes and datasets are immultable, but there is some
optimization in the optimizer that can potentially help to dampen the
effect/impact of creating a new rdd, df or ds.
Also, the use-case you cited is
21 matches
Mail list logo