Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Femi Anthony
If you're using just Spark you could try turning on the history server
 and try to glean
statistics from there.  But there is no one location or log file which
stores them all.
Databricks, which is a managed Spark solution, provides such features in an
Enterprise setting.
I am unsure whether AWS EMR or Google Data Proc does the same.

Femi



On Mon, Apr 8, 2024 at 5:34 AM casel.chen  wrote:

> Hello, I have a spark application with jdbc source and do some
> calculation.
> To monitor application healthy, I need db related metrics per database
> like number of connections, sql execution time and sql fired time
> distribution etc.
> Does anybody know how to get them? Thanks!
>
>

-- 
http://dataphantik.com

"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi,

I believe this is the package

https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala

And the code

case class FilePartition(index: Int, files: Array[PartitionedFile])
  extends Partition with InputPartition {
  override def preferredLocations(): Array[String] = {
// Computes total number of bytes that can be retrieved from each host.
val hostToNumBytes = mutable.HashMap.empty[String, Long]
files.foreach { file =>
  file.locations.filter(_ != "localhost").foreach { host =>
hostToNumBytes(host) = hostToNumBytes.getOrElse(host, 0L) +
file.length
  }
}

// Selects the first 3 hosts with the most data to be retrieved.
hostToNumBytes.toSeq.sortBy {
  case (host, numBytes) => numBytes
}.reverse.take(3).map {
  case (host, numBytes) => host
}.toArray
  }
}

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 8 Apr 2024 at 20:31, Ashley McManamon <
ashley.mcmana...@quantcast.com> wrote:

> Hi All,
>
> I've been diving into the source code to get a better understanding of how
> file splitting works from a user perspective. I've hit a deadend at
> `PartitionedFile`, for which I cannot seem to find a definition? It appears
> though it should be found at
> org.apache.spark.sql.execution.datasources but I find no definition in the
> entire source code. Am I missing something?
>
> I appreciate there may be an obvious answer here, apologies if I'm being
> naive.
>
> Thanks,
> Ashley McManamon
>
>


Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Mich Talebzadeh
Well you can do a fair bit with the available tools

The Spark UI, particularly the Staging and Executors tabs, do provide some
valuable insights related to database health metrics for applications using
a JDBC source.

Stage Overview:

This section provides a summary of all the stages executed during the
application's lifetime. It includes details such as the stage ID,
description, submission time, duration, and number of tasks.
Each Stage represents a set of tasks that perform the same computation,
typically applied to a partition of the input data. The Stages tab offers
insights into how these stages are executed and their associated metrics.
This tab may include a directed acyclic graph (DAG) visualization,
illustrating the logical and physical execution plan of the Spark
application.

Executors Tab:

The Executors tab provides detailed information about the executors running
in the Spark application. Executors are responsible for executing tasks on
behalf of the Spark application. The "Executors" tab offers insights into
the current state and resource usage of each executor.

In addition, the underlying database will have some instrumentation to
assist you with your work. say with Oracle (as an example), utilise tools
like OEM, VM StatPack, SQL*Plus scripts etc or third-party monitoring tools
to collect detailed database health metrics directly from the Oracle
database server.

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 8 Apr 2024 at 19:35, casel.chen  wrote:

> Hello, I have a spark application with jdbc source and do some
> calculation.
> To monitor application healthy, I need db related metrics per database
> like number of connections, sql execution time and sql fired time
> distribution etc.
> Does anybody know how to get them? Thanks!
>
>


[Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Ashley McManamon
Hi All,

I've been diving into the source code to get a better understanding of how
file splitting works from a user perspective. I've hit a deadend at
`PartitionedFile`, for which I cannot seem to find a definition? It appears
though it should be found at
org.apache.spark.sql.execution.datasources but I find no definition in the
entire source code. Am I missing something?

I appreciate there may be an obvious answer here, apologies if I'm being
naive.

Thanks,
Ashley McManamon


Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh
Hi,

First thanks everyone for their contributions

I was going to reply to @Enrico Minack   but
noticed additional info. As I understand for example,  Apache Uniffle is an
incubating project aimed at providing a pluggable shuffle service for
Spark. So basically, all these "external shuffle services" have in common
is to offload shuffle data management to external services, thus reducing
the memory and CPU overhead on Spark executors. That is great.  While
Uniffle and others enhance shuffle performance and scalability, it would be
great to integrate them with Spark UI. This may require additional
development efforts. I suppose  the interest would be to have these
external matrices incorporated into Spark with one look and feel. This may
require customizing the UI to fetch and display metrics or statistics from
the external shuffle services. Has any project done this?

Thanks

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 8 Apr 2024 at 14:19, Vakaris Baškirov 
wrote:

> I see that both Uniffle and Celebron support S3/HDFS backends which is
> great.
> In the case someone is using S3/HDFS, I wonder what would be the
> advantages of using Celebron or Uniffle vs IBM shuffle service plugin
>  or Cloud Shuffle Storage Plugin
> from AWS
> 
> ?
>
> These plugins do not require deploying a separate service. Are there any
> advantages to using Uniffle/Celebron in the case of using S3 backend, which
> would require deploying a separate service?
>
> Thanks
> Vakaris
>
> On Mon, Apr 8, 2024 at 10:03 AM roryqi  wrote:
>
>> Apache Uniffle (incubating) may be another solution.
>> You can see
>> https://github.com/apache/incubator-uniffle
>>
>> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>>
>> Mich Talebzadeh  于2024年4月8日周一 07:15写道:
>>
>>> Splendid
>>>
>>> The configurations below can be used with k8s deployments of Spark.
>>> Spark applications running on k8s can utilize these configurations to
>>> seamlessly access data stored in Google Cloud Storage (GCS) and Amazon S3.
>>>
>>> For Google GCS we may have
>>>
>>> spark_config_gcs = {
>>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>>> "service_account_name",
>>> "spark.hadoop.fs.gs.impl":
>>> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
>>> "spark.hadoop.google.cloud.auth.service.account.enable": "true",
>>> "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
>>> "/path/to/keyfile.json",
>>> }
>>>
>>> For Amazon S3 similar
>>>
>>> spark_config_s3 = {
>>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>>> "service_account_name",
>>> "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>>> "spark.hadoop.fs.s3a.access.key": "s3_access_key",
>>> "spark.hadoop.fs.s3a.secret.key": "secret_key",
>>> }
>>>
>>>
>>> To implement these configurations and enable Spark applications to
>>> interact with GCS and S3, I guess we can approach it this way
>>>
>>> 1) Spark Repository Integration: These configurations need to be added
>>> to the Spark repository as part of the supported configuration options for
>>> k8s deployments.
>>>
>>> 2) Configuration Settings: Users need to specify these configurations
>>> when submitting Spark applications to a Kubernetes cluster. They can
>>> include these configurations in the Spark application code or pass them as
>>> command-line arguments or environment variables during application
>>> submission.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>>
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov <
>>> vakaris.bashki...@gmail.com> wrote:
>>>
 There is an IBM shuffle 

Re: External Spark shuffle service for k8s

2024-04-08 Thread Vakaris Baškirov
I see that both Uniffle and Celebron support S3/HDFS backends which is
great.
In the case someone is using S3/HDFS, I wonder what would be the advantages
of using Celebron or Uniffle vs IBM shuffle service plugin
 or Cloud Shuffle Storage Plugin
from AWS

?

These plugins do not require deploying a separate service. Are there any
advantages to using Uniffle/Celebron in the case of using S3 backend, which
would require deploying a separate service?

Thanks
Vakaris

On Mon, Apr 8, 2024 at 10:03 AM roryqi  wrote:

> Apache Uniffle (incubating) may be another solution.
> You can see
> https://github.com/apache/incubator-uniffle
>
> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>
> Mich Talebzadeh  于2024年4月8日周一 07:15写道:
>
>> Splendid
>>
>> The configurations below can be used with k8s deployments of Spark. Spark
>> applications running on k8s can utilize these configurations to seamlessly
>> access data stored in Google Cloud Storage (GCS) and Amazon S3.
>>
>> For Google GCS we may have
>>
>> spark_config_gcs = {
>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>> "service_account_name",
>> "spark.hadoop.fs.gs.impl":
>> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
>> "spark.hadoop.google.cloud.auth.service.account.enable": "true",
>> "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
>> "/path/to/keyfile.json",
>> }
>>
>> For Amazon S3 similar
>>
>> spark_config_s3 = {
>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>> "service_account_name",
>> "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>> "spark.hadoop.fs.s3a.access.key": "s3_access_key",
>> "spark.hadoop.fs.s3a.secret.key": "secret_key",
>> }
>>
>>
>> To implement these configurations and enable Spark applications to
>> interact with GCS and S3, I guess we can approach it this way
>>
>> 1) Spark Repository Integration: These configurations need to be added to
>> the Spark repository as part of the supported configuration options for k8s
>> deployments.
>>
>> 2) Configuration Settings: Users need to specify these configurations
>> when submitting Spark applications to a Kubernetes cluster. They can
>> include these configurations in the Spark application code or pass them as
>> command-line arguments or environment variables during application
>> submission.
>>
>> HTH
>>
>> Mich Talebzadeh,
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov <
>> vakaris.bashki...@gmail.com> wrote:
>>
>>> There is an IBM shuffle service plugin that supports S3
>>> https://github.com/IBM/spark-s3-shuffle
>>>
>>> Though I would think a feature like this could be a part of the main
>>> Spark repo. Trino already has out-of-box support for s3 exchange (shuffle)
>>> and it's very useful.
>>>
>>> Vakaris
>>>
>>> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>

 Thanks for your suggestion that I take it as a workaround. Whilst this
 workaround can potentially address storage allocation issues, I was more
 interested in exploring solutions that offer a more seamless integration
 with large distributed file systems like HDFS, GCS, or S3. This would
 ensure better performance and scalability for handling larger datasets
 efficiently.


 Mich Talebzadeh,
 Technologist | Solutions Architect | Data Engineer  | Generative AI
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
 wrote:

> You can make a PVC on K8S call it 300GB
>
> make a folder in yours dockerfile
> 

How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread casel.chen
Hello, I have a spark application with jdbc source and do some calculation. 
To monitor application healthy, I need db related metrics per database like 
number of connections, sql execution time and sql fired time distribution etc.
Does anybody know how to get them? Thanks!



Re: External Spark shuffle service for k8s

2024-04-08 Thread roryqi
Apache Uniffle (incubating) may be another solution.
You can see
https://github.com/apache/incubator-uniffle
https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era

Mich Talebzadeh  于2024年4月8日周一 07:15写道:

> Splendid
>
> The configurations below can be used with k8s deployments of Spark. Spark
> applications running on k8s can utilize these configurations to seamlessly
> access data stored in Google Cloud Storage (GCS) and Amazon S3.
>
> For Google GCS we may have
>
> spark_config_gcs = {
> "spark.kubernetes.authenticate.driver.serviceAccountName":
> "service_account_name",
> "spark.hadoop.fs.gs.impl":
> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
> "spark.hadoop.google.cloud.auth.service.account.enable": "true",
> "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
> "/path/to/keyfile.json",
> }
>
> For Amazon S3 similar
>
> spark_config_s3 = {
> "spark.kubernetes.authenticate.driver.serviceAccountName":
> "service_account_name",
> "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
> "spark.hadoop.fs.s3a.access.key": "s3_access_key",
> "spark.hadoop.fs.s3a.secret.key": "secret_key",
> }
>
>
> To implement these configurations and enable Spark applications to
> interact with GCS and S3, I guess we can approach it this way
>
> 1) Spark Repository Integration: These configurations need to be added to
> the Spark repository as part of the supported configuration options for k8s
> deployments.
>
> 2) Configuration Settings: Users need to specify these configurations when
> submitting Spark applications to a Kubernetes cluster. They can include
> these configurations in the Spark application code or pass them as
> command-line arguments or environment variables during application
> submission.
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov 
> wrote:
>
>> There is an IBM shuffle service plugin that supports S3
>> https://github.com/IBM/spark-s3-shuffle
>>
>> Though I would think a feature like this could be a part of the main
>> Spark repo. Trino already has out-of-box support for s3 exchange (shuffle)
>> and it's very useful.
>>
>> Vakaris
>>
>> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>> Thanks for your suggestion that I take it as a workaround. Whilst this
>>> workaround can potentially address storage allocation issues, I was more
>>> interested in exploring solutions that offer a more seamless integration
>>> with large distributed file systems like HDFS, GCS, or S3. This would
>>> ensure better performance and scalability for handling larger datasets
>>> efficiently.
>>>
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
>>> wrote:
>>>
 You can make a PVC on K8S call it 300GB

 make a folder in yours dockerfile
 WORKDIR /opt/spark/work-dir
 RUN chmod g+w /opt/spark/work-dir

 start spark with adding this

 .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
 "300gb") \

 .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
 "/opt/spark/work-dir") \

 .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
 "False") \

 .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
 "300gb") \

 .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
 "/opt/spark/work-dir") \

 .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",