Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-10 Thread L. C. Hsieh
+1

On Sun, Dec 10, 2023 at 6:15 PM Kent Yao  wrote:
>
> +1(non-binding
>
> Kent Yao
>
> Yuming Wang  于2023年12月11日周一 09:33写道:
> >
> > +1
> >
> > On Mon, Dec 11, 2023 at 5:55 AM Dongjoon Hyun  wrote:
> >>
> >> +1
> >>
> >> Dongjoon
> >>
> >> On 2023/12/08 21:41:00 Dongjoon Hyun wrote:
> >> > Please vote on releasing the following candidate as Apache Spark version
> >> > 3.3.4.
> >> >
> >> > The vote is open until December 15th 1AM (PST) and passes if a majority 
> >> > +1
> >> > PMC votes are cast, with a minimum of 3 +1 votes.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 3.3.4
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see https://spark.apache.org/
> >> >
> >> > The tag to be voted on is v3.3.4-rc1 (commit
> >> > 18db204995b32e87a650f2f09f9bcf047ddafa90)
> >> > https://github.com/apache/spark/tree/v3.3.4-rc1
> >> >
> >> > The release files, including signatures, digests, etc. can be found at:
> >> >
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/
> >> >
> >> >
> >> > Signatures used for Spark RCs can be found in this file:
> >> >
> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> >> > https://repository.apache.org/content/repositories/orgapachespark-1451/
> >> >
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> >
> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/
> >> >
> >> >
> >> > The list of bug fixes going into 3.3.4 can be found at the following URL:
> >> >
> >> > https://issues.apache.org/jira/projects/SPARK/versions/12353505
> >> >
> >> >
> >> > This release is using the release script of the tag v3.3.4-rc1.
> >> >
> >> >
> >> > FAQ
> >> >
> >> >
> >> > =
> >> >
> >> > How can I help test this release?
> >> >
> >> > =
> >> >
> >> >
> >> >
> >> > If you are a Spark user, you can help us test this release by taking
> >> >
> >> > an existing Spark workload and running on this release candidate, then
> >> >
> >> > reporting any regressions.
> >> >
> >> >
> >> >
> >> > If you're working in PySpark you can set up a virtual env and install
> >> >
> >> > the current RC and see if anything important breaks, in the Java/Scala
> >> >
> >> > you can add the staging repository to your projects resolvers and test
> >> >
> >> > with the RC (make sure to clean up the artifact cache before/after so
> >> >
> >> > you don't end up building with a out of date RC going forward).
> >> >
> >> >
> >> >
> >> > ===
> >> >
> >> > What should happen to JIRA tickets still targeting 3.3.4?
> >> >
> >> > ===
> >> >
> >> >
> >> >
> >> > The current list of open tickets targeted at 3.3.4 can be found at:
> >> >
> >> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> >> > Version/s" = 3.3.4
> >> >
> >> >
> >> > Committers should look at those and triage. Extremely important bug
> >> >
> >> > fixes, documentation, and API tweaks that impact compatibility should
> >> >
> >> > be worked on immediately. Everything else please retarget to an
> >> >
> >> > appropriate release.
> >> >
> >> >
> >> >
> >> > ==
> >> >
> >> > But my bug isn't fixed?
> >> >
> >> > ==
> >> >
> >> >
> >> >
> >> > In order to make timely releases, we will typically not hold the
> >> >
> >> > release unless the bug in question is a regression from the previous
> >> >
> >> > release. That being said, if there is something which is a regression
> >> >
> >> > that has not been correctly targeted please ping me or a committer to
> >> >
> >> > help target the issue.
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Algolia search on website is broken

2023-12-10 Thread Gengliang Wang
Hi Nick,

Thank you for reporting the issue with our web crawler.

I've found that the issue was due to a change(specifically, pull request
#40269 ) in the website's HTML
structure, where the JavaScript selector ".container-wrapper" is now
".container". I've updated the crawler accordingly, and it's working
properly now.

Gengliang

On Sun, Dec 10, 2023 at 8:15 AM Nicholas Chammas 
wrote:

> Pinging Gengliang and Xiao about this, per these docs
> 
> .
>
> It looks like to fix this problem you need access to the Algolia Crawler
> Admin Console.
>
>
> On Dec 5, 2023, at 11:28 AM, Nicholas Chammas 
> wrote:
>
> Should I report this instead on Jira? Apologies if the dev list is not the
> right place.
>
> Search on the website appears to be broken. For example, here is a search
> for “analyze”:
>
> [image: Image 12-5-23 at 11.26 AM.jpeg]
>
> And here is the same search using DDG
> 
> .
>
> Nick
>
>
>


Re: When and how does Spark use metastore statistics?

2023-12-10 Thread Nicholas Chammas
I’ve done some reading and have a slightly better understanding of statistics 
now.

Every implementation of LeafNode.computeStats 

 offers its own way to get statistics:

LocalRelation 

 estimates the size of the relation directly from the row count.
HiveTableRelation 

 pulls those statistics from the catalog or metastore.
DataSourceV2Relation 

 delegates the job of computing statistics to the underlying data source.
There are a lot of details I’m still fuzzy on, but I think that’s the gist of 
things.

Would it make sense to add a paragraph or two to the SQL performance tuning 
page  
covering statistics at a high level? Something that briefly explains:

what statistics are and how Spark uses them to optimize plans
the various ways Spark computes or loads statistics (catalog, data source, 
runtime, etc.)
how to gather catalog statistics (i.e. pointer to ANALYZE TABLE)
how to check statistics on an object (i.e. DESCRIBE EXTENDED) and as part of an 
optimized plan (i.e. .explain(mode="cost"))
what the cost-based optimizer does and how to enable it
Would this be a welcome addition to the project’s documentation? I’m happy to 
work on this.



> On Dec 5, 2023, at 12:12 PM, Nicholas Chammas  
> wrote:
> 
> I’m interested in improving some of the documentation relating to the table 
> and column statistics that get stored in the metastore, and how Spark uses 
> them.
> 
> But I’m not clear on a few things, so I’m writing to you with some questions.
> 
> 1. The documentation for spark.sql.autoBroadcastJoinThreshold 
>  implies 
> that it depends on table statistics to work, but it’s not clear. Is it 
> accurate to say that unless you have run ANALYZE on the tables participating 
> in a join, spark.sql.autoBroadcastJoinThreshold cannot impact the execution 
> plan?
> 
> 2. As a follow-on to the above question, the adaptive version of 
> autoBroadcastJoinThreshold, namely 
> spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because it 
> depends only on runtime statistics and not statistics in the metastore. Is 
> that correct? I am assuming that “runtime statistics” are gathered on the fly 
> by Spark, but I would like to mention this in the docs briefly somewhere.
> 
> 3. The documentation for spark.sql.inMemoryColumnarStorage.compressed 
>  mentions 
> “statistics”, but it’s not clear what kind of statistics we’re talking about. 
> Are those runtime statistics, metastore statistics (that depend on you 
> running ANALYZE), or something else?
> 
> 4. The documentation for ANALYZE TABLE 
>  
> states that the collected statistics help the optimizer "find a better query 
> execution plan”. I wish we could link to something from here with more 
> explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only 
> place where metastore statistics are explicitly referenced as impacting the 
> execution plan. Surely there must be other places, no? Would it be 
> appropriate to mention the cost-based optimizer framework 
>  somehow? It doesn’t 
> appear to have any public documentation outside of Jira.
> 
> Any pointers or information you can provide would be very helpful. Again, I 
> am interested in contributing some documentation improvements relating to 
> statistics, but there is a lot I’m not sure about.
> 
> Nick
> 



Disabling distributing local conf file during spark-submit

2023-12-10 Thread Eugene Miretsky
Hello,

It looks like local conf archives always get copied

to the target (HDFS) every time a job is submitted

   1.  Other files/archives don't get sent if they are local
   

-
   would it make sense to allow skipping upload of the local conf files as
   well?
   2. The archive seems to get copied on every 'distribute' call, which can
   happen multiple times per spark-submit job  (at least that's what I got
   from reading the code) - is that that intention?


The motivation for my questions is

   1. In some cases, spark-submit may not have direct access to HDFS, and
   hence cannot upload the files
   2. What would be the use-case for distributing the custom config to the
   YARN cluster. The cluster already has all the relevant YARN, HADOOP and
   Spark config. If anything, letting the end-user override the configs seems
   dangerous (if the override resource limits, etc. )

Cheers,
Eugene


Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-10 Thread Kent Yao
+1(non-binding

Kent Yao

Yuming Wang  于2023年12月11日周一 09:33写道:
>
> +1
>
> On Mon, Dec 11, 2023 at 5:55 AM Dongjoon Hyun  wrote:
>>
>> +1
>>
>> Dongjoon
>>
>> On 2023/12/08 21:41:00 Dongjoon Hyun wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 3.3.4.
>> >
>> > The vote is open until December 15th 1AM (PST) and passes if a majority +1
>> > PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.3.4
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see https://spark.apache.org/
>> >
>> > The tag to be voted on is v3.3.4-rc1 (commit
>> > 18db204995b32e87a650f2f09f9bcf047ddafa90)
>> > https://github.com/apache/spark/tree/v3.3.4-rc1
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> >
>> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/
>> >
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> >
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> >
>> > The staging repository for this release can be found at:
>> >
>> > https://repository.apache.org/content/repositories/orgapachespark-1451/
>> >
>> >
>> > The documentation corresponding to this release can be found at:
>> >
>> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/
>> >
>> >
>> > The list of bug fixes going into 3.3.4 can be found at the following URL:
>> >
>> > https://issues.apache.org/jira/projects/SPARK/versions/12353505
>> >
>> >
>> > This release is using the release script of the tag v3.3.4-rc1.
>> >
>> >
>> > FAQ
>> >
>> >
>> > =
>> >
>> > How can I help test this release?
>> >
>> > =
>> >
>> >
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> >
>> > an existing Spark workload and running on this release candidate, then
>> >
>> > reporting any regressions.
>> >
>> >
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> >
>> > the current RC and see if anything important breaks, in the Java/Scala
>> >
>> > you can add the staging repository to your projects resolvers and test
>> >
>> > with the RC (make sure to clean up the artifact cache before/after so
>> >
>> > you don't end up building with a out of date RC going forward).
>> >
>> >
>> >
>> > ===
>> >
>> > What should happen to JIRA tickets still targeting 3.3.4?
>> >
>> > ===
>> >
>> >
>> >
>> > The current list of open tickets targeted at 3.3.4 can be found at:
>> >
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> > Version/s" = 3.3.4
>> >
>> >
>> > Committers should look at those and triage. Extremely important bug
>> >
>> > fixes, documentation, and API tweaks that impact compatibility should
>> >
>> > be worked on immediately. Everything else please retarget to an
>> >
>> > appropriate release.
>> >
>> >
>> >
>> > ==
>> >
>> > But my bug isn't fixed?
>> >
>> > ==
>> >
>> >
>> >
>> > In order to make timely releases, we will typically not hold the
>> >
>> > release unless the bug in question is a regression from the previous
>> >
>> > release. That being said, if there is something which is a regression
>> >
>> > that has not been correctly targeted please ping me or a committer to
>> >
>> > help target the issue.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-10 Thread Yuming Wang
+1

On Mon, Dec 11, 2023 at 5:55 AM Dongjoon Hyun  wrote:

> +1
>
> Dongjoon
>
> On 2023/12/08 21:41:00 Dongjoon Hyun wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 3.3.4.
> >
> > The vote is open until December 15th 1AM (PST) and passes if a majority
> +1
> > PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.3.4
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/
> >
> > The tag to be voted on is v3.3.4-rc1 (commit
> > 18db204995b32e87a650f2f09f9bcf047ddafa90)
> > https://github.com/apache/spark/tree/v3.3.4-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> >
> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/
> >
> >
> > Signatures used for Spark RCs can be found in this file:
> >
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> >
> > The staging repository for this release can be found at:
> >
> > https://repository.apache.org/content/repositories/orgapachespark-1451/
> >
> >
> > The documentation corresponding to this release can be found at:
> >
> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/
> >
> >
> > The list of bug fixes going into 3.3.4 can be found at the following URL:
> >
> > https://issues.apache.org/jira/projects/SPARK/versions/12353505
> >
> >
> > This release is using the release script of the tag v3.3.4-rc1.
> >
> >
> > FAQ
> >
> >
> > =
> >
> > How can I help test this release?
> >
> > =
> >
> >
> >
> > If you are a Spark user, you can help us test this release by taking
> >
> > an existing Spark workload and running on this release candidate, then
> >
> > reporting any regressions.
> >
> >
> >
> > If you're working in PySpark you can set up a virtual env and install
> >
> > the current RC and see if anything important breaks, in the Java/Scala
> >
> > you can add the staging repository to your projects resolvers and test
> >
> > with the RC (make sure to clean up the artifact cache before/after so
> >
> > you don't end up building with a out of date RC going forward).
> >
> >
> >
> > ===
> >
> > What should happen to JIRA tickets still targeting 3.3.4?
> >
> > ===
> >
> >
> >
> > The current list of open tickets targeted at 3.3.4 can be found at:
> >
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > Version/s" = 3.3.4
> >
> >
> > Committers should look at those and triage. Extremely important bug
> >
> > fixes, documentation, and API tweaks that impact compatibility should
> >
> > be worked on immediately. Everything else please retarget to an
> >
> > appropriate release.
> >
> >
> >
> > ==
> >
> > But my bug isn't fixed?
> >
> > ==
> >
> >
> >
> > In order to make timely releases, we will typically not hold the
> >
> > release unless the bug in question is a regression from the previous
> >
> > release. That being said, if there is something which is a regression
> >
> > that has not been correctly targeted please ping me or a committer to
> >
> > help target the issue.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-10 Thread Dongjoon Hyun
+1

Dongjoon

On 2023/12/08 21:41:00 Dongjoon Hyun wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 3.3.4.
> 
> The vote is open until December 15th 1AM (PST) and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 3.3.4
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see https://spark.apache.org/
> 
> The tag to be voted on is v3.3.4-rc1 (commit
> 18db204995b32e87a650f2f09f9bcf047ddafa90)
> https://github.com/apache/spark/tree/v3.3.4-rc1
> 
> The release files, including signatures, digests, etc. can be found at:
> 
> https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/
> 
> 
> Signatures used for Spark RCs can be found in this file:
> 
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> 
> The staging repository for this release can be found at:
> 
> https://repository.apache.org/content/repositories/orgapachespark-1451/
> 
> 
> The documentation corresponding to this release can be found at:
> 
> https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/
> 
> 
> The list of bug fixes going into 3.3.4 can be found at the following URL:
> 
> https://issues.apache.org/jira/projects/SPARK/versions/12353505
> 
> 
> This release is using the release script of the tag v3.3.4-rc1.
> 
> 
> FAQ
> 
> 
> =
> 
> How can I help test this release?
> 
> =
> 
> 
> 
> If you are a Spark user, you can help us test this release by taking
> 
> an existing Spark workload and running on this release candidate, then
> 
> reporting any regressions.
> 
> 
> 
> If you're working in PySpark you can set up a virtual env and install
> 
> the current RC and see if anything important breaks, in the Java/Scala
> 
> you can add the staging repository to your projects resolvers and test
> 
> with the RC (make sure to clean up the artifact cache before/after so
> 
> you don't end up building with a out of date RC going forward).
> 
> 
> 
> ===
> 
> What should happen to JIRA tickets still targeting 3.3.4?
> 
> ===
> 
> 
> 
> The current list of open tickets targeted at 3.3.4 can be found at:
> 
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.4
> 
> 
> Committers should look at those and triage. Extremely important bug
> 
> fixes, documentation, and API tweaks that impact compatibility should
> 
> be worked on immediately. Everything else please retarget to an
> 
> appropriate release.
> 
> 
> 
> ==
> 
> But my bug isn't fixed?
> 
> ==
> 
> 
> 
> In order to make timely releases, we will typically not hold the
> 
> release unless the bug in question is a regression from the previous
> 
> release. That being said, if there is something which is a regression
> 
> that has not been correctly targeted please ping me or a committer to
> 
> help target the issue.
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark on Yarn with Java 17

2023-12-10 Thread Jason Xu
Doogjoon and Luca, it's great to learn that there is a way to run different
JVM versions for Spark and Hadoop binaries. I had concerns about Java
compatibility issues without this solution. Thank you!

Luca, thank you for providing a how-to guide for this. It's really helpful!

On Sat, Dec 9, 2023 at 1:39 AM Luca Canali  wrote:

> Jason, In case you need a pointer on how to run Spark with a version of
> Java different than the version used by the Hadoop processes, as indicated
> by Dongjoon, this is an example of what we do on our Hadoop clusters:
> https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_Set_Java_Home_Howto.md
>
>
>
> Best,
>
> Luca
>
>
>
> *From:* Dongjoon Hyun 
> *Sent:* Saturday, December 9, 2023 09:39
> *To:* Jason Xu 
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Spark on Yarn with Java 17
>
>
>
> Please try Apache Spark 3.3+ (SPARK-33772) with Java 17 on your cluster
> simply, Jason.
>
> I believe you can set up for your Spark 3.3+ jobs to run with Java 17
> while your cluster(DataNode/NameNode/ResourceManager/NodeManager) is still
> sitting on Java 8.
>
> Dongjoon.
>
>
>
> On Fri, Dec 8, 2023 at 11:12 PM Jason Xu  wrote:
>
> Dongjoon, thank you for the fast response!
>
>
>
> Apache Spark 4.0.0 depends on only Apache Hadoop client library.
>
> To better understand your answer, does that mean a Spark application built
> with Java 17 can successfully run on a Hadoop cluster on version 3.3 and
> Java 8 runtime?
>
>
>
> On Fri, Dec 8, 2023 at 4:33 PM Dongjoon Hyun  wrote:
>
> Hi, Jason.
>
> Apache Spark 4.0.0 depends on only Apache Hadoop client library.
>
> You can track all `Apache Spark 4` activities including Hadoop dependency
> here.
>
> https://issues.apache.org/jira/browse/SPARK-44111
> (Prepare Apache Spark 4.0.0)
>
> According to the release history, the original suggested timeline was
> June, 2024.
> - Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3)
> - Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8)
> - Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x)
> - Spark 4: 2024.06 (4.0.0, NEW)
>
> Thanks,
> Dongjoon.
>
> On 2023/12/08 23:50:15 Jason Xu wrote:
> > Hi Spark devs,
> >
> > According to the Spark 3.5 release notes, Spark 4 will no longer support
> > Java 8 and 11 (link
> > <
> https://spark.apache.org/releases/spark-release-3-5-0.html#upcoming-removal
> >
> > ).
> >
> > My company is using Spark on Yarn with Java 8 now. When considering a
> > future upgrade to Spark 4, one issue we face is that the latest version
> of
> > Hadoop (3.3) does not yet support Java 17. There is an open ticket (
> > HADOOP-17177 ) for
> this
> > issue, which has been open for over two years.
> >
> > My question is: Does the release of Spark 4 depend on the availability of
> > Java 17 support in Hadoop? Additionally, do we have a rough estimate for
> > the release of Spark 4? Thanks!
> >
> >
> > Cheers,
> >
> > Jason Xu
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Algolia search on website is broken

2023-12-10 Thread Nicholas Chammas
Pinging Gengliang and Xiao about this, per these docs 
.

It looks like to fix this problem you need access to the Algolia Crawler Admin 
Console.


> On Dec 5, 2023, at 11:28 AM, Nicholas Chammas  
> wrote:
> 
> Should I report this instead on Jira? Apologies if the dev list is not the 
> right place.
> 
> Search on the website appears to be broken. For example, here is a search for 
> “analyze”:
> 

> 
> And here is the same search using DDG 
> .
> 
> Nick
> 



unsubscribe

2023-12-10 Thread bruce COTTMAN



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



unsubscribe

2023-12-10 Thread Stevens, Clay
Clay


unsubscribe

2023-12-10 Thread Rajanikant V