RE: Spark on Yarn with Java 17

2023-12-09 Thread Luca Canali
Jason, In case you need a pointer on how to run Spark with a version of Java 
different than the version used by the Hadoop processes, as indicated by 
Dongjoon, this is an example of what we do on our Hadoop clusters: 
https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_Set_Java_Home_Howto.md

Best,
Luca

From: Dongjoon Hyun 
Sent: Saturday, December 9, 2023 09:39
To: Jason Xu 
Cc: dev@spark.apache.org
Subject: Re: Spark on Yarn with Java 17

Please try Apache Spark 3.3+ (SPARK-33772) with Java 17 on your cluster simply, 
Jason.

I believe you can set up for your Spark 3.3+ jobs to run with Java 17 while 
your cluster(DataNode/NameNode/ResourceManager/NodeManager) is still sitting on 
Java 8.

Dongjoon.

On Fri, Dec 8, 2023 at 11:12 PM Jason Xu 
mailto:jasonxu.sp...@gmail.com>> wrote:
Dongjoon, thank you for the fast response!

Apache Spark 4.0.0 depends on only Apache Hadoop client library.

To better understand your answer, does that mean a Spark application built with 
Java 17 can successfully run on a Hadoop cluster on version 3.3 and Java 8 
runtime?

On Fri, Dec 8, 2023 at 4:33 PM Dongjoon Hyun 
mailto:dongj...@apache.org>> wrote:
Hi, Jason.

Apache Spark 4.0.0 depends on only Apache Hadoop client library.

You can track all `Apache Spark 4` activities including Hadoop dependency here.

https://issues.apache.org/jira/browse/SPARK-44111
(Prepare Apache Spark 4.0.0)

According to the release history, the original suggested timeline was June, 
2024.
- Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3)
- Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8)
- Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x)
- Spark 4: 2024.06 (4.0.0, NEW)

Thanks,
Dongjoon.

On 2023/12/08 23:50:15 Jason Xu wrote:
> Hi Spark devs,
>
> According to the Spark 3.5 release notes, Spark 4 will no longer support
> Java 8 and 11 (link
> 
> ).
>
> My company is using Spark on Yarn with Java 8 now. When considering a
> future upgrade to Spark 4, one issue we face is that the latest version of
> Hadoop (3.3) does not yet support Java 17. There is an open ticket (
> HADOOP-17177 ) for this
> issue, which has been open for over two years.
>
> My question is: Does the release of Spark 4 depend on the availability of
> Java 17 support in Hadoop? Additionally, do we have a rough estimate for
> the release of Spark 4? Thanks!
>
>
> Cheers,
>
> Jason Xu
>

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org


RE: On adding applyInArrow to groupBy and cogroup

2023-11-03 Thread Luca Canali
Hi Enrico,

+1 on supporting Arrow on par with Pandas. Besides the frameworks and libraries 
that you mentioned I add awkward array, a library used in High Energy Physics
(for those interested more details on how we tested awkward array with Spark 
from back when mapInArrow was introduced can be found at 
https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md
 )

Cheers,
Luca

From: Enrico Minack 
Sent: Thursday, October 26, 2023 15:33
To: dev 
Subject: On adding applyInArrow to groupBy and cogroup


Hi devs,

PySpark allows to transform a DataFrame via Pandas and Arrow API:

df.mapInArrow(map_arrow, schema="...")
df.mapInPandas(map_pandas, schema="...")

For df.groupBy(...) and df.groupBy(...).cogroup(...), there is only a Pandas 
interface, no Arrow interface:

df.groupBy("id").applyInPandas(apply_pandas, schema="...")

Providing a pure Arrow interface allows user code to use any Arrow-based data 
framework, not only Pandas, e.g. Polars. Adding Arrow interfaces reduces the 
need to add more framework-specific support.

We need your thoughts on whether PySpark should support Arrow on a par with 
Pandas, or not: https://github.com/apache/spark/pull/38624
Cheers,
Enrico


RE: Executor metrics are missing on Prometheus sink

2023-02-10 Thread Luca Canali
Hi Qian,

Indeed the metrics available with the Prometheus servlet sink (which is marked 
still as experimental) are limited, compared to the full instrumentation, and 
this is due to the way it is implemented with a servlet and cannot be easily 
extended from what I can see.
You can use another supported metrics sink (see 
https://spark.apache.org/docs/latest/monitoring.html#metrics ) if you want to 
collect all the metrics are exposed by Spark executors.
For example, I use the graphite sink and then collect metrics into an InfluxDB 
instance (see https://github.com/cerndb/spark-dashboard )
An additional comment is that there is room for having more sinks available for 
Apache Spark metrics, notably for InfluxDB and for Prometheus (gateway), if 
someone is interested in working on that.

Best,
Luca


From: Qian Sun 
Sent: Friday, February 10, 2023 05:05
To: dev ; user.spark 
Subject: Executor metrics are missing on prometheus sink


Setting up prometheus sink in this way:

-c spark.ui.prometheus.enabled=true

-c spark.executor.processTreeMetrics.enabled=true

-c spark.metrics.conf=/spark/conf/metric.properties

metric.properties:{}

*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet

*.sink.prometheusServlet.path=/metrics/prometheus

Result:

Both of these endpoints have some metrics

:4040/metrics/prometheus

:4040/metrics/executors/prometheus

But the executor one misses metrics under the executor namespace described here:
https://spark.apache.org/docs/latest/monitoring.html#component-instance--executor

How to expose executor metrics on spark exeuctors pod?

Any help will be appreciated.
--
Regards,
Qian Sun


RE: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.

2021-12-20 Thread Luca Canali
Hi Senthil,

 

I have just run a couple of quick tests for TPCDS Q4, using the TPCDS schema 
created at scale 1500 that I have on a Hadoop/YARN cluster, and was not able to 
reproduce the difference in execution time between Spark 2 and Spark 3 that you 
report in your mail.  

This is the Spark config I used:   

bin/spark-shell --master yarn --driver-memory 8g --executor-cores 10 
--executor-memory 50g --conf spark.dynamicAllocation.enabled=false 
--num-executors 20   

 

This is how I ran the tests:

 

```

val path="/project/spark/TPCDS/tpcds_1500_parquet_1.10.1/"

 

val 
tables=List("catalog_returns","catalog_sales","inventory","store_returns","store_sales","web_returns","web_sales",
 
"call_center","catalog_page","customer","customer_address","customer_demographics","date_dim","household_demographics","income_band","item","promotion","reason","ship_mode","store","time_dim","warehouse","web_page","web_site")

 

for (t <- tables) {

  println(s"Creating temporary view $t")

  spark.read.parquet(path + t).createOrReplaceTempView(t)

}

 

val q4="""…"""

// SQL from  

 
https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q4.sql

 

spark.time(sql(q4).collect) // note q4 result set is only 100 rows

```

 

Spark 2.4.5:

Time taken: 256812 ms  

Time taken: 226571 ms

Time taken: 305508 ms

 

Spark 3.1.2

spark.time(sql(q4).collect)

Time taken: 235356 ms

Time taken: 236284 ms  

 

Best, 

Luca

 

From: Senthil Kumar  
Sent: Monday, December 20, 2021 10:20
To: Rao, Abhishek (Nokia - IN/Bangalore) 
Cc: dev 
Subject: Re: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.

 

Also we checked that we have already backported 
https://issues.apache.org/jira/browse/SPARK-33557 jira. 

 

On Mon, Dec 20, 2021 at 11:08 AM Senthil Kumar mailto:sen...@gmail.com> > wrote:

@abhishek. We use spark 3.1*

 

On Mon, 20 Dec 2021, 09:50 Rao, Abhishek (Nokia - IN/Bangalore), 
mailto:abhishek@nokia.com> > wrote:

Hi Senthil,

 

Which version of Spark 3 are we using? We had this kind of observation with 
Spark 3.0.2 and 3.1.x, but then we figured out that we had configured big value 
for spark.network.timeout and this value was not taking effect in all releases 
prior to 3.0.2.

This was fixed as part of https://issues.apache.org/jira/browse/SPARK-33557. 
Because we had configured big value for spark.network.timeout, this was 
resulting in TPCDS queries taking long time when tried with Spark 3.0.2 and 
3.1.x. Once we corrected it, we observed that the queries were executed much 
faster. 

 

Thanks and Regards,

Abhishek

 

From: Senthil Kumar mailto:sen...@gmail.com> > 
Sent: Sunday, December 19, 2021 11:58 PM
To: dev mailto:dev@spark.apache.org> >
Subject: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.

 

Hi All,

We are comparing Spark 2.4.5 and Spark 3(without enabling spark 3 additional 
features) with TPCDS queries and found that Spark 3's performance is reduced to 
at-least 30-40% compared to Spark 2.4.5.

 

Eg. 

Data size used 1TB


Spark 2.4.5 finishes the Q4 in 1.5 min, but Spark 3.* takes at-least 2.5 min.

 

Note: We tested this in the same cluster with the same size of data. And we 
ensured that parameters we passed are one and the same for SPark 2.4* and Spark 
3*.

 

It will be helpful, if any one you also encountered the same issue in your 
benchmarking activities? If so, pls share your input on what could be the 
reason behind this poor performance.


 

-- 

Senthil kumar




 

-- 

Senthil kumar



RE: [DISCUSS][CORE] Exposing application status metrics via a source

2018-09-14 Thread Luca Canali
Hi Stavros, All,

Interesting topic, I add here some thoughts and personal opinions on it: I find 
too the metrics system quite useful for the use case of building Grafana 
dashboards as opposed to scraping logs and/or using the Event Listener 
infrastructure, as you mentioned in your mail.
A few additional points in favour of Dropwizard metrics for me are:

-  Regarding the metrics defined on the ExecutorSource, I believe they 
have better scalability compare to standard Task Metrics, as the Dropwizard 
metrics go directly from executors to sink(s) rather than passing via the 
driver through the ListenerBus.

-  Another advantage that I see is that Dropwizard metrics make it easy 
to expose information not available otherwise from the EveloLog/Listener 
events, such as executor.jvmCpuTime (SPARK-25228).

I ’d like to add some feedback and random thoughts based on recent work on 
SPARK-25228 and SPARK-22190, SPARK-25277, SPARK-25285:

-  the “Dropwizard metrics” space currently appears a bit “crowded”,  
we could probably profit from adding a few configuration parameters to turn 
some of the metrics on/off as needed (I see that this point is also raised in 
the discussion in your PR 22381).

-  Another point is that the metrics instrumentation is a bit scattered 
around the code, it would be nice to have a central point where the available 
metrics are exposed (maybe just in the documentation).

-  Testing of new metrics seems to be a bit of a manual process at the 
moment (at least it was for me) which could be improved. Related to that I 
believe that some recent work on adding new metrics has ended up with a minor 
side effect/issue, details in SPARK-25277.

Best regards,
Luca

From: Stavros Kontopoulos 
Sent: Wednesday, September 12, 2018 22:35
To: Dev 
Subject: [DISCUSS][CORE] Exposing application status metrics via a source

Hi all,

I have a PR https://github.com/apache/spark/pull/22381 that exposes application 
status
metrics (related jira: SPARK-25394).

So far metrics tooling needs to scrape the metrics rest api to get metrics like 
job delay, stages failed, stages completed etc.
From devops perspective it is good to standardize on a unified way of gathering 
metrics.
The need came up on the K8s side where jmx prometheus exporter is commonly used 
to scrape metrics for several components such as kafka, cassandra, but the need 
is not limited there.

Check comment 
here:
"The rest api is great for UI and consolidated analytics, but monitoring 
through it is not as straightforward as when the data emits directly from the 
source like this. There is all kinds of nice context that we get when the data 
from this spark node is collected directly from the node itself, and not 
proxied through another collector / reporter. It is easier to build a 
monitoring data model across the cluster when node, jmx, pod, resource 
manifests, and spark data all align by virtue of coming from the same 
collector. Building a similar view of the cluster just from the rest api, as a 
comparison, is simply harder and quite challenging to do in general purpose 
terms."

The PR is ok to be merged but the major concern here is the mirroring of the 
metrics. I think that mirroring is ok since people may dont want to check the 
ui and they just want to integrate with jmx only (my use case) and gather 
metrics in grafana (common case out there).

Does any of the committers or the community have an opinion on this?
Is there an agreement about moving on with this? Note that the addition does 
not change much and can always be refactored if we come up with a new plan for 
the metrics story in the future.

Thanks,
Stavros


Run an OS command or script supplied by the user at the start of each executor

2017-05-12 Thread Luca Canali
Hi,

I have recently experimented with a few ways to run OS commands from the 
executors (in a YARN deployment) for a specific use case where we want to 
interact with an external system of interest for our environment. From that 
experience I thought that having the possibility to spawn a script at the start 
of each executors can be quite handy in a few cases and maybe more people are 
interested. For example I am think of case when interacting with external 
systems/APIs, or for injecting custom configurations via scripts distributed to 
the executors and/or for spawning custom monitoring tasks, etc. 
They are probable all niche cases but the feature seems quite easy to implement.
I just wanted to check with the list if something like this has already come up 
in the past and/or there are thoughts about it or details that I have 
overlooked.
My simple proof of concept for implementing a "startup command" on the 
executors can be found at:
https://github.com/LucaCanali/spark/commit/e294a1f0d55af115f45fa6d2d7dcf81f751955fa
I can put all this in a Jira in case people here think it makes sense.

Thanks,
Luca


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org