from:"Sean Owen"

Re: [VOTE] Officialy Deprecate GraphX in Spark 4

2024-09-30 Thread Sean Owen

For reasons in the previous thread, yes +1 to deprecation

On Mon, Sep 30, 2024 at 1:02 PM Holden Karau  wrote:

> I think it has been de-facto deprecated, we haven’t updated it
> meaningfully in several years. I think removing the API would be excessive
> but deprecating it would give us the flexibility to remove it in the not
> too distant future.
>
> That being said this is not a vote to remove GraphX, I think that whenever
> that time comes (if it does) we should have a separate vote
>
> This VOTE will be open for a little more than one week, ending on October
> 8th*. To vote reply with:
> +1 Deprecate GraphX
> 0 I’m indifferent
> -1 Don’t deprecate GraphX because ABC
>
> If you have a binding vote to simplify you tallying at the end please mark
> your vote with a *.
>
> (*mostly because I’m going camping for my birthday)
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> 
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-09-30 Thread Sean Owen

I support deprecating GraphX because:

   - GraphFrames supersedes it, really
   - No maintainers and no reason to believe there will be - we can take
   the last 5+ years as thorough evidence
   - Low (but not trivial) docs hits compared to other modules:
   
https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?period=year&date=2024-09-29&idSite=40&category=General_Actions&subcategory=General_Pages
   - If it *exists* in 4.x then it has to live as long as 4.x does, and
   that's already a super long time (4+ years?); deprecating is just a step to
   removing it in 5.x. (Well, we *can* take a decision to remove it in some
   4.x version if it's really a problem, but deprecating well in advance is a
   prerequisite.

There is one problem: deprecated in favor of what? GraphFrames. But,
GraphFrames uses GraphX :)  But it is likewise in a similar bucket.
*Maintained* but no active development; not sure about usage. So I think
this is kind of "deprecated without replacement".

But we're only talking about deprecating here, which I think more
accurately communicates its state to users than not doing so.


On Mon, Sep 30, 2024 at 12:20 PM Mich Talebzadeh 
wrote:

> Hi,
>
> These are my Views:
>
> 1. Deprecation Consideration: I lean towards the idea of officially
> deprecating GraphX, given the lack of active development and community
> engagement over the past few years as you alluded. This would set clear
> expectations for users about its future and encourage them to explore
> alternatives that are actively maintained.
>
> 2. User Input: It would be prudent to gather feedback from those currently
> utilizing GraphX. Their insights could help us understand whether they find
> the functionality sufficient as-is or if they have specific needs that
> remain unaddressed.
>
> 3. Search for Maintainers: While I believe deprecation is a prudent step,
> I also think we should issue a call for new maintainers before making any
> final decisions. If there are individuals or teams willing to invest in
> GraphX, it may still have a place in our ecosystem.
>
> Ultimately, I feel that we should prioritize the health of the Spark
> ecosystem and ensure that we are investing resources into actively
> maintained components.
>
> HTH
>
> Mich Talebzadeh
>
> Architect | Data Engineer | Data Science | Financial Crime
> PhD  Imperial College
> London 
>
> London, United Kingdom
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Sun, 29 Sept 2024 at 21:39, Holden Karau 
> wrote:
>
>> Since we're getting close to cutting a 4.0 branch I'd like to float the
>> idea of officially deprecating Graph X. What that would mean (to me) is we
>> would update the docs to indicate that Graph X is deprecated and it's APIs
>> may be removed at anytime in the future.
>>
>> Alternatively, we could mark it as "unmaintained and in search of
>> maintainers" with a note that if no maintainers are found, we may remove it
>> in a future minor version.
>>
>> Looking at the source graph X, I don't see any meaningful active
>> development going back over three years*. There is even a thread on user@
>> from 2017 asking if graph X is maintained anymore, with no response from
>> the developers.
>>
>> Now I'm open to the idea that GraphX is stable and "works as is" and
>> simply doesn't require modifications but given the user thread I'm a little
>> concerned here about bringing this API with us into Spark 4 if we don't
>> have anyone signed up to maintain it.
>>
>> * Excluding globally applied changes
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> 
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>

Re: [VOTE] Deprecate SparkR

2024-08-21 Thread Sean Owen

+1

On Wed, Aug 21, 2024, 11:40 AM Shivaram Venkataraman <
shivaram.venkatara...@gmail.com> wrote:

> Hi all
>
> Based on the previous discussion thread [1], I hereby call a vote to
> deprecate the SparkR module in Apache Spark with the upcoming Spark 4
> release and remove it in the next major release Spark 5.
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don’t think this is a good idea because ..
>
> This vote will be open for the next 72 hours
>
> Thanks
> Shivaram
>
> [1] https://lists.apache.org/thread/qjgsgxklvpvyvbzsx1qr8o533j4zjlm5
>

Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Sean Owen

He did already; see the preceding thread here on dev@.

You can figure the size that moves out of the repo from the docs sizes:

9.9M ./0.6.0
 10M ./0.6.1
 10M ./0.6.2
 15M ./0.7.0
 16M ./0.7.2
 16M ./0.7.3
 20M ./0.8.0
 20M ./0.8.1
 38M ./0.9.0
 38M ./0.9.1
 38M ./0.9.2
 36M ./1.0.0
 38M ./1.0.1
 38M ./1.0.2
 48M ./1.1.0
 48M ./1.1.1
 73M ./1.2.0
 73M ./1.2.1
 74M ./1.2.2
 69M ./1.3.0
 73M ./1.3.1
 68M ./1.4.0
 70M ./1.4.1
 80M ./1.5.0
 78M ./1.5.1
 78M ./1.5.2
 87M ./1.6.0
 87M ./1.6.1
 87M ./1.6.2
 86M ./1.6.3
117M ./2.0.0
119M ./2.0.0-preview
118M ./2.0.1
118M ./2.0.2
121M ./2.1.0
121M ./2.1.1
122M ./2.1.2
122M ./2.1.3
130M ./2.2.0
131M ./2.2.1
132M ./2.2.2
131M ./2.2.3
141M ./2.3.0
141M ./2.3.1
141M ./2.3.2
142M ./2.3.3
142M ./2.3.4
145M ./2.4.0
146M ./2.4.1
145M ./2.4.2
144M ./2.4.3
145M ./2.4.4
143M ./2.4.5
143M ./2.4.6
143M ./2.4.7
143M ./2.4.8
197M ./3.0.0
185M ./3.0.0-preview
197M ./3.0.0-preview2
198M ./3.0.1
198M ./3.0.2
205M ./3.0.3
239M ./3.1.1
239M ./3.1.2
239M ./3.1.3
840M ./3.2.0
842M ./3.2.1
282M ./3.2.2
244M ./3.2.3
282M ./3.2.4
295M ./3.3.0
297M ./3.3.1
297M ./3.3.2
297M ./3.3.3
297M ./3.3.4
314M ./3.4.0
314M ./3.4.1
328M ./3.4.2
324M ./3.4.3
1.1G ./3.5.0
1.2G ./3.5.1
1.1G ./4.0.0-preview1

On Mon, Aug 12, 2024 at 5:16 PM Mich Talebzadeh 
wrote:

> Hi Kent,
>
> Can you if possible provide a heuristic estimate of space reduction your
> proposal is going to achieve?
>
> Thanks
>
> Mich Talebzadeh,
>
> Architect | Data Engineer | Data Science | Financial Crime
> PhD  Imperial College
> London 
> London, United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 12 Aug 2024 at 14:55, Mich Talebzadeh 
> wrote:
>
>> Hello,
>>
>> On the face of it, this email contains many references, making it
>> difficult to follow. Perhaps, a simpler explanation could improve voting
>> participation.
>>
>> The STAR methodology can be helpful in understanding and evaluating this
>> proposal. STAR stands for Situation, Task, Action, Result.
>>
>> Let us have a look at this
>>
>> *S*ituation:
>>
>>- The Spark website repository is reaching its storage limit on
>>GitHub-hosted runners.
>>
>> *T*ask:
>>
>>- Reduce storage usage without compromising access to documentation.
>>
>> *A*ction:(proposed)
>>
>>- Move documentation releases from the dev directory to the
>>release directory within the Apache distribution.
>>- Leverage the Apache Archives service to create permanent links for
>>the documentation.
>>- Upload older website-hosted documentation manually via SVN.
>>- Optionally, delete old documentation and update links/use
>>redirection as needed.
>>
>> *Result:*
>>
>>- Reduced storage usage on GitHub-hosted runners.
>>- Permanent, publicly accessible links for Spark documentation via
>>the Apache Archives.
>>- Potential need for manual upload of older documentation and link
>>updates.
>>
>>
>> Consider including an estimated storage reduction achieved through this
>> approach.
>> Overall, the proposal offers a viable solution for managing Spark
>> documentation while reducing storage concerns. However, addressing the
>> potential complexity of managing older documentation versions is crucial.
>>
>> +1 for me
>>
>> Mich Talebzadeh,
>>
>> Architect | Data Engineer | Data Science | Financial Crime
>> PhD  Imperial
>> College London 
>> London, United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 12 Aug 2024 at 10:09, Kent Yao  wrote:
>>
>>> Archive Spark Documentations in Apache Archives
>>>
>>> Hi dev,
>>>
>>> To address the issue of the Spark website repository size
>>> reaching the storage limit for GitHub-hosted runners [1], I suggest
>>> enhancing step [2] in our release process by relocating the
>>> documentation releases from the dev[3] directory to the release
>>> di

Re: [VOTE] Archive Spark Documentations in Apache Archives

2024-08-12 Thread Sean Owen

+1 with the following clarifications, for my benefit:

Once we upload to release, and it's copied by archive, we delete from
release right? I know we are meant to keep the files in release minimal as
they're mirrored to all ASF mirrors. But if we're uploading some batches
and deleting them after, that seems OK. I don't know of another way to
upload directly to archive.

We'd also need to create a new link to archived docs on the ASF website
somehow.

On Mon, Aug 12, 2024 at 4:09 AM Kent Yao  wrote:

> Archive Spark Documentations in Apache Archives
>
> Hi dev,
>
> To address the issue of the Spark website repository size
> reaching the storage limit for GitHub-hosted runners [1], I suggest
> enhancing step [2] in our release process by relocating the
> documentation releases from the dev[3] directory to the release
> directory[4]. Then it would captured by the Apache Archives
> service[5] to create permanent links, which would be alternative
> endpoints for our documentation, like
>
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/_site/index.html
> for
> https://spark.apache.org/docs/3.5.2/index.html
>
> Note that the previous example still uses the staging repository,
> which will become
> https://archive.apache.org/dist/spark/docs/3.5.2/index.html.
>
> For older releases hosted on the Spark website [6], we also need to
> upload them via SVN manually.
>
> After that, when we reach the threshold again, we can delete some of
> the old ones on page [6], and update their links on page [7] or use
> redirection.
>
> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-49209
>
> Please vote on the idea of  Archive Spark Documentations in
> Apache Archives for the next 72 hours:
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Bests,
> Kent Yao
>
> [1] https://lists.apache.org/thread/o0w4gqoks23xztdmjjj26jkp1yyg2bvq
> [2]
> https://spark.apache.org/release-process.html#upload-to-apache-release-directory
> [3] https://dist.apache.org/repos/dist/dev/spark/v3.5.2-rc5-docs/
> [4] https://dist.apache.org/repos/dist/release/spark/docs/3.5.2
> [5] https://archive.apache.org/dist/spark/
> [6] https://github.com/apache/spark-website/tree/asf-site/site/docs
> [7] https://spark.apache.org/documentation.html
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen

I don't think that's the issue - it's the size of what is cloned into a
container during the GitHub actions runs. Doesnt matter how it is stored.

They are not large files, either.

On Thu, Aug 8, 2024, 4:34 PM Mich Talebzadeh 
wrote:

>
> Maybe you should look into deploying GitHub Large File Storage
> <https://docs.github.com/en/repositories/working-with-files/managing-large-files/configuring-git-large-file-storage>
> (LFS).  If applicable, store large documentation files in LFS to reduce the
> repository size.
>
> HTH
>
> Mich Talebzadeh,
>
> Architect | Data Engineer | Data Science | Financial Crime
> PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
> London <https://en.wikipedia.org/wiki/Imperial_College_London>
> London, United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Thu, 8 Aug 2024 at 22:02, Sean Owen  wrote:
>
>> That seems a ltle bit too much to me. I could see people still on a
>> recent version that just want to see docs or compare/contrast docs for
>> changes.
>> Removing the versions that seem to have ~0 traffic would remove, it
>> seems, like 80% of the .html files (and replace them with a compressed
>> archive that's smaller), which seems 'enough' for the moment?
>>
>> But if docs releases are 1GB going forward, this will be an issue again
>> soon.
>>
>> On Thu, Aug 8, 2024 at 1:25 PM Wenchen Fan  wrote:
>>
>>> It makes sense to me to only keep the doc files for the latest
>>> maintenance release. i.e. remove the docs for 3.5.0 and only keep 3.5.1.
>>>
>>>>
>>>>
>>>>

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen

That seems a ltle bit too much to me. I could see people still on a
recent version that just want to see docs or compare/contrast docs for
changes.
Removing the versions that seem to have ~0 traffic would remove, it seems,
like 80% of the .html files (and replace them with a compressed archive
that's smaller), which seems 'enough' for the moment?

But if docs releases are 1GB going forward, this will be an issue again
soon.

On Thu, Aug 8, 2024 at 1:25 PM Wenchen Fan  wrote:

> It makes sense to me to only keep the doc files for the latest
> maintenance release. i.e. remove the docs for 3.5.0 and only keep 3.5.1.
>
>>
>>
>>

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen

Whoa! Is there any clear reason why 3.5 docs are so big? 1GB of docs / 10x
jump seems crazy. Maybe we need to investigate and fix that also.

I take it that the problem is the size of the repo once it's cloned into
the docker container. Removing the .html files helps that, but, then we
don't have .html docs in the published site!
We can generate them in the build process, but I presume it's waaay too
long to rebuild docs for every release every time.

I do support at *least* tarring up old .html docs from old releases (<3.0?)
and making them available somehow on the site, so that they're accessible
if needed.

Analytics says that page views for docs before 3.1 are quite minimal,
probably hundreds of views this year at best vs 10M total views:
https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?idSite=40&period=year&date=2024-08-07&category=General_Actions&subcategory=General_Pages

On Thu, Aug 8, 2024 at 12:42 PM Dongjoon Hyun 
wrote:

> The culprit seems to be PySpark 3.5 documentation which grows 11x times at
> 3.5+
>
> $ du -h 3.4.3/api/python | tail -n1
>  84M 3.4.3/api/python
>
> $ du -h 3.5.1/api/python | tail -n1
> 943M 3.5.1/api/python
>
> Since we will generate big documents for 3.5.x, 4.0.0-preview, 4.0.x,
> 4.1.x, the proposed tarball idea sounds promising to me too.
>
> $ ls -alh 3.5.1.tgz
> -rw-r--r--  1 dongjoon  staff   103M Aug  8 10:22 3.5.1.tgz
>
> Specifically, shall we keep HTML files for only the latest version of live
> releases, e.g. 3.4.3, 3.5.1, and 4.0.0-preview1?
>
> In other words, all 0.x ~ 3.4.2 and 3.5.1 will be tarball files in the
> current status.
>
> Dongjoon.
>
>
> On Thu, Aug 8, 2024 at 10:01 AM Sean Owen  wrote:
>
>> I agree with 'archiving', but what does that mean? delete from the repo
>> and site?
>> While I really doubt people are looking for docs for, say, 0.5.0, it'd be
>> a big jump to totally remove it.
>>
>> What if we made a compressed tarball of old docs and put that in the
>> repo, linked to it, and removed the docs files for many old releases?
>> It's still in the repo and will be in the container when docs are built,
>> but, compressed would be much smaller.
>> That could buy a significant amount of time.
>>
>> On Thu, Aug 8, 2024 at 7:06 AM Kent Yao  wrote:
>>
>>> Hi dev,
>>>
>>> The current size of the spark-website repository is approximately 16GB,
>>> exceeding the storage limit of GitHub-hosted runners.  The GitHub actions
>>> have been failing recently in the actions/checkout step caused by
>>> 'No space left on device' errors.
>>>
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> overlay  73G   58G   16G  80% /
>>> tmpfs64M 0   64M   0% /dev
>>> tmpfs   7.9G 0  7.9G   0% /sys/fs/cgroup
>>> shm  64M 0   64M   0% /dev/shm
>>> /dev/root73G   58G   16G  80% /__w
>>> tmpfs   1.6G  1.2M  1.6G   1% /run/docker.sock
>>> tmpfs   7.9G 0  7.9G   0% /proc/acpi
>>> tmpfs   7.9G 0  7.9G   0% /proc/scsi
>>> tmpfs   7.9G 0  7.9G   0% /sys/firmware
>>>
>>>
>>> The documentation for each version contributes the most volume. Since
>>> version
>>>  3.5.0, the documentation size has grown 3-4 times larger than the
>>> size of 3.4.x,
>>>  with more than 1GB.
>>>
>>>
>>> 9.9M ./0.6.0
>>>  10M ./0.6.1
>>>  10M ./0.6.2
>>>  15M ./0.7.0
>>>  16M ./0.7.2
>>>  16M ./0.7.3
>>>  20M ./0.8.0
>>>  20M ./0.8.1
>>>  38M ./0.9.0
>>>  38M ./0.9.1
>>>  38M ./0.9.2
>>>  36M ./1.0.0
>>>  38M ./1.0.1
>>>  38M ./1.0.2
>>>  48M ./1.1.0
>>>  48M ./1.1.1
>>>  73M ./1.2.0
>>>  73M ./1.2.1
>>>  74M ./1.2.2
>>>  69M ./1.3.0
>>>  73M ./1.3.1
>>>  68M ./1.4.0
>>>  70M ./1.4.1
>>>  80M ./1.5.0
>>>  78M ./1.5.1
>>>  78M ./1.5.2
>>>  87M ./1.6.0
>>>  87M ./1.6.1
>>>  87M ./1.6.2
>>>  86M ./1.6.3
>>> 117M ./2.0.0
>>> 119M ./2.0.0-preview
>>> 118M ./2.0.1
>>> 118M ./2.0.2
>>> 121M ./2.1.0
>>> 121M ./2.1.1
>>> 122M ./2.1.2
>>> 122M ./2.1.3
>>> 130M ./2.2.0
>>> 131M ./2.2.1
>>> 132M ./2.2.2
>>> 131M ./2.2.3
>>> 141M ./2.3.0
>>> 141M ./2.3.1
>>> 141M ./2.3.2
>>> 142

Re: [DISCUSS] Using Github Issues for Spark-Connect-Go _only_ issues.

2024-08-08 Thread Sean Owen

Oh nice if that has changed. Id personally prefer switching all of Spark to
GitHub issues for simplicity but maybe that's a big lift. And a separate
question.

On Thu, Aug 8, 2024, 9:12 AM Martin Grund  wrote:

> Mich, yes, the goal is to make it easier for folks to contribute to the Go
> client, and my discussion is related to the
> https://github.com/apache/spark-connect-go repository only and thanks a
> lot for the feedback. My assumption is that we will monitor the GH issues
> in the same way as we do for the Jira issues (e.g. they can go to the same
> mailing list etc) The feedback that you provided is really helpful as it
> outlines what we have to do for release management of the Go client as
> well.
>
> Sean, according to https://infra.apache.org/services.html#issue-tracking
> projects can use JIRA and GH issues, see as well here for the self-serve
> instructions - https://infra.apache.org/request-bug-tracker.html
>
> Please keep the feedback coming.
>
> On Thu, Aug 8, 2024 at 2:43 PM Sean Owen  wrote:
>
>> This is still part of the Apache Spark project, conceptually?
>> IIRC Apache projects still need to use JIRA, so we can't do this.
>>
>> On Thu, Aug 8, 2024 at 5:08 AM Mich Talebzadeh 
>> wrote:
>>
>>> Hi Martin,
>>>
>>> If I understood it correctly, your  proposal suggests centralizing issue
>>> tracking for the Spark Connect Go client on GitHub Issues, instead of using
>>> both Jira and GitHub.? The primary motivation is to simplify the
>>> contribution process for developers?
>>>
>>> Few points if I may:
>>>
>>>
>>>- How will critical or high-priority issues be handled within GitHub
>>>Issues?
>>>- What mechanisms will be in place to ensure timely response and
>>>resolution of issues?
>>>- How will the participants measure and track issue resolution and
>>>development progress?
>>>- What is the plan for migrating existing Jira issues to GitHub
>>>Issues?
>>>
>>>
>>> HTH,
>>>
>>> Mich Talebzadeh,
>>> Architect | Data Engineer | Data Science | Financial Crime
>>>
>>> PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial
>>> College London <https://en.wikipedia.org/wiki/Imperial_College_London>
>>> London, United Kingdom
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Thu, 8 Aug 2024 at 06:54, Martin Grund 
>>> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> I wanted to start a discussion for the following proposal: To make it
>>>> easier for folks to contribute to the Spark Connect Go client, I was
>>>> contemplating not requiring them to deal with two accounts (one for Jira)
>>>> and one for Gihutb but allow using GitHub Issues for bugs and issues that
>>>> are specific to *only* the Spark Connect Go client.
>>>>
>>>> Jira will still be used for issues that span core Spark. This allows us
>>>> to easily label issues for starter tasks in one place, for example.
>>>>
>>>> I am explicitly not proposing to change the behavior for the main Spark
>>>> repository; here, the existing procedure remains.
>>>>
>>>> What do you think?
>>>>
>>>> Martin
>>>>
>>>

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen

I agree with 'archiving', but what does that mean? delete from the repo and
site?
While I really doubt people are looking for docs for, say, 0.5.0, it'd be a
big jump to totally remove it.

What if we made a compressed tarball of old docs and put that in the repo,
linked to it, and removed the docs files for many old releases?
It's still in the repo and will be in the container when docs are built,
but, compressed would be much smaller.
That could buy a significant amount of time.

On Thu, Aug 8, 2024 at 7:06 AM Kent Yao  wrote:

> Hi dev,
>
> The current size of the spark-website repository is approximately 16GB,
> exceeding the storage limit of GitHub-hosted runners.  The GitHub actions
> have been failing recently in the actions/checkout step caused by
> 'No space left on device' errors.
>
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  73G   58G   16G  80% /
> tmpfs64M 0   64M   0% /dev
> tmpfs   7.9G 0  7.9G   0% /sys/fs/cgroup
> shm  64M 0   64M   0% /dev/shm
> /dev/root73G   58G   16G  80% /__w
> tmpfs   1.6G  1.2M  1.6G   1% /run/docker.sock
> tmpfs   7.9G 0  7.9G   0% /proc/acpi
> tmpfs   7.9G 0  7.9G   0% /proc/scsi
> tmpfs   7.9G 0  7.9G   0% /sys/firmware
>
>
> The documentation for each version contributes the most volume. Since
> version
>  3.5.0, the documentation size has grown 3-4 times larger than the
> size of 3.4.x,
>  with more than 1GB.
>
>
> 9.9M ./0.6.0
>  10M ./0.6.1
>  10M ./0.6.2
>  15M ./0.7.0
>  16M ./0.7.2
>  16M ./0.7.3
>  20M ./0.8.0
>  20M ./0.8.1
>  38M ./0.9.0
>  38M ./0.9.1
>  38M ./0.9.2
>  36M ./1.0.0
>  38M ./1.0.1
>  38M ./1.0.2
>  48M ./1.1.0
>  48M ./1.1.1
>  73M ./1.2.0
>  73M ./1.2.1
>  74M ./1.2.2
>  69M ./1.3.0
>  73M ./1.3.1
>  68M ./1.4.0
>  70M ./1.4.1
>  80M ./1.5.0
>  78M ./1.5.1
>  78M ./1.5.2
>  87M ./1.6.0
>  87M ./1.6.1
>  87M ./1.6.2
>  86M ./1.6.3
> 117M ./2.0.0
> 119M ./2.0.0-preview
> 118M ./2.0.1
> 118M ./2.0.2
> 121M ./2.1.0
> 121M ./2.1.1
> 122M ./2.1.2
> 122M ./2.1.3
> 130M ./2.2.0
> 131M ./2.2.1
> 132M ./2.2.2
> 131M ./2.2.3
> 141M ./2.3.0
> 141M ./2.3.1
> 141M ./2.3.2
> 142M ./2.3.3
> 142M ./2.3.4
> 145M ./2.4.0
> 146M ./2.4.1
> 145M ./2.4.2
> 144M ./2.4.3
> 145M ./2.4.4
> 143M ./2.4.5
> 143M ./2.4.6
> 143M ./2.4.7
> 143M ./2.4.8
> 197M ./3.0.0
> 185M ./3.0.0-preview
> 197M ./3.0.0-preview2
> 198M ./3.0.1
> 198M ./3.0.2
> 205M ./3.0.3
> 239M ./3.1.1
> 239M ./3.1.2
> 239M ./3.1.3
> 840M ./3.2.0
> 842M ./3.2.1
> 282M ./3.2.2
> 244M ./3.2.3
> 282M ./3.2.4
> 295M ./3.3.0
> 297M ./3.3.1
> 297M ./3.3.2
> 297M ./3.3.3
> 297M ./3.3.4
> 314M ./3.4.0
> 314M ./3.4.1
> 328M ./3.4.2
> 324M ./3.4.3
> 1.1G ./3.5.0
> 1.2G ./3.5.1
> 1.1G ./4.0.0-preview1
>
> I'm concerned about publishing the documentation for version 3.5.2
> to the asf-site. So, I have merged PR[2] to eliminate this potential
> blocker.
>
> Considering that the problem still exists, should we temporarily archive
> some of the outdated version documents? For example, only keep
> the latest version for each feature release in the asf-site branch. Or,
> Do you have any other suggestions?
>
>
> Bests,
> Kent Yao
>
>
> [1]
> https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories
> [2] https://github.com/apache/spark-website/pull/543
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Using Github Issues for Spark-Connect-Go _only_ issues.

2024-08-08 Thread Sean Owen

This is still part of the Apache Spark project, conceptually?
IIRC Apache projects still need to use JIRA, so we can't do this.

On Thu, Aug 8, 2024 at 5:08 AM Mich Talebzadeh 
wrote:

> Hi Martin,
>
> If I understood it correctly, your  proposal suggests centralizing issue
> tracking for the Spark Connect Go client on GitHub Issues, instead of using
> both Jira and GitHub.? The primary motivation is to simplify the
> contribution process for developers?
>
> Few points if I may:
>
>
>- How will critical or high-priority issues be handled within GitHub
>Issues?
>- What mechanisms will be in place to ensure timely response and
>resolution of issues?
>- How will the participants measure and track issue resolution and
>development progress?
>- What is the plan for migrating existing Jira issues to GitHub Issues?
>
>
> HTH,
>
> Mich Talebzadeh,
> Architect | Data Engineer | Data Science | Financial Crime
>
> PhD  Imperial College
> London 
> London, United Kingdom
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Thu, 8 Aug 2024 at 06:54, Martin Grund 
> wrote:
>
>> Hi folks,
>>
>> I wanted to start a discussion for the following proposal: To make it
>> easier for folks to contribute to the Spark Connect Go client, I was
>> contemplating not requiring them to deal with two accounts (one for Jira)
>> and one for Gihutb but allow using GitHub Issues for bugs and issues that
>> are specific to *only* the Spark Connect Go client.
>>
>> Jira will still be used for issues that span core Spark. This allows us
>> to easily label issues for starter tasks in one place, for example.
>>
>> I am explicitly not proposing to change the behavior for the main Spark
>> repository; here, the existing procedure remains.
>>
>> What do you think?
>>
>> Martin
>>
>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

2024-07-29 Thread Sean Owen

Also from ASF community perspective -

I think all are agreed this was merged too fast. But, I'm missing where
this is somehow due to the needs of a single vendor. Where is this related
to file systems or keys?
did I miss it from another discussion or PR, or is this actually about a
different issue?

Otherwise I don't see what this lecture is about. The issue that was raised
(existing Spark Session) is, I agree, not an issue.


On Mon, Jul 29, 2024 at 12:43 PM Steve Loughran 
wrote:

>
> I'm going to join in from an ASF community perspective.
>
> Nobody should be making fundamental changes to an ASF code base with a PR
> up and then merged two hours later because of the needs of a single vendor
> of a downstream product. This doesn't even give people in different time
> zones the chance to review it. It goes completely against the concept of
> "community" and replaces it with private problems, not shared with anyone,
> and large pieces of development work to address them without any
> opportunity for others to improve. Pieces of work which presumably must
> have been ongoing for some days.
>
> I know doing stuff in public is time-consuming as you have to spend a lot
> of time chasing reviews, but collaboration is essential as it ensures that
> changes meet the needs of a broader community than one single vendor.
> Avoiding that is exclusively and unhealthy for a project.
>
> If the databricks products have some problem resolving user:key secrets in
> paths in the virtual file system, that will be good to know, especially the
> what and the why -as others may encounter it too. At the very least: others
> should know what to do so as to avoid getting into the same situation.
>
> If you want more nimble development, well, closed source gives you that.
> Switching to commit-then-review on specific ASF repos is also allowed,
> despite the inherent risks. We use it for some of her hadoop release
> packaging/testing for a rapid iteration of release process automation and
> validation code.
>
> Anyway, the patch has been reverted and discussions are now ongoing, as
> they should have been from the outset.
>
> Steve
>
>
> On Wed, 24 Jul 2024 at 01:29, Hyukjin Kwon  wrote:
>
>> There is always a running session. I replied in the PR.
>>
>> On Tue, 23 Jul 2024 at 23:32, Dongjoon Hyun  wrote:
>>
>>> I'm bumping up this thread because the overhead bites us back already.
>>> Here is a commit merged 3 hours ago.
>>>
>>> https://github.com/apache/spark/pull/47453
>>> [SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in
>>> spark ML reader/writer
>>>
>>> In short, unlike the original PRs' claims, this commit starts to create
>>> `SparkSession` in this layer. Although I understand the reason why Hyukjin
>>> and Martin claims that `SparkSession` will be there in any way, this is an
>>> architectural change which we need to decide explicitly, not implicitly.
>>>
>>> > On 2024/07/13 05:33:32 Hyukjin Kwon wrote:
>>> > We actually get the active Spark session so it doesn't cause overhead.
>>> Also
>>> > even we create, it will create once which should be pretty trivial
>>> overhead.
>>>
>>> If this architectural change is required inevitably and needs to happen
>>> in Apache Spark 4.0.0. Can we have a dev-document about this? If there is
>>> no proper place, we can add it to the ML migration guide simply.
>>>
>>> Dongjoon.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread Sean Owen

Yeah let's get that fix in, but it seems to be a minor test only issue so
should not block release.

On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:

> Very sorry. When I was fixing `SPARK-45242 (
> https://github.com/apache/spark/pull/43594)`
> <https://github.com/apache/spark/pull/43594)>, I noticed that its
> `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
> didn't realize that it had also been merged into branch-3.5, so I didn't
> advocate for SPARK-45357 to be backported to branch-3.5.
>
>
>
> As far as I know, the condition to trigger this test failure is: when
> using Maven to test the `connect` module, if  `sparkTestRelation` in
> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
> then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
> is indeed related to the order in which Maven executes the test cases in
> the `connect` module.
>
>
>
> I have submitted a backport PR
> <https://github.com/apache/spark/pull/45141> to branch-3.5, and if
> necessary, we can merge it to fix this test issue.
>
>
>
> Jie Yang
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年2月16日 星期五 22:15
> *收件人**: *Sean Owen , Rui Wang 
> *抄送**: *dev 
> *主题**: *Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
>
>
>
> I traced back relevant changes and got a sense of what happened.
>
>
>
> Yangjie figured out the issue via link
> <https://mailshield.baidu.com/check?q=8dOSfwXDFpe5HSp%2b%2bgCPsNQ52B7S7TAFG56Vj3tiFgMkCyOrQEGbg03AVWDX5bwwyIW7sZx3JZox3w8Jz1iw%2bPjaOZYmLWn2>.
> It's a tricky issue according to the comments from Yangjie - the test is
> dependent on ordering of execution for test suites. He said it does not
> fail in sbt, hence CI build couldn't catch it.
>
> He fixed it via link
> <https://mailshield.baidu.com/check?q=ojK3dg%2fDFf3xmQ8SPzsIou3EKaE1ZePctdB%2fUzhWmewnZb5chnQM1%2f8D1JDJnkxF>,
> but we missed that the offending commit was also ported back to 3.5 as
> well, hence the fix wasn't ported back to 3.5.
>
>
>
> Surprisingly, I can't reproduce locally even with maven. In my attempt to
> reproduce, SparkConnectProtoSuite was executed at
> third, SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite,
> and then SparkConnectProtoSuite. Maybe very specific to the environment,
> not just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I
> used build/mvn (Maven 3.8.8).
>
>
>
> I'm not 100% sure this is something we should fail the release as it's a
> test only and sounds very environment dependent, but I'll respect your call
> on vote.
>
>
>
> Btw, looks like Rui also made a relevant fix via link
> <https://mailshield.baidu.com/check?q=TUbVzroxG%2fbi2P4qN0kbggzXuPzSN%2bKDoUFGhS9xMet8aXVw6EH0rMr1MKJqp2E2>
>  (not
> to fix the failing test but to fix other issues), but this also wasn't
> ported back to 3.5. @Rui Wang  Do you think this is
> a regression issue and warrants a new RC?
>
>
>
>
>
> On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:
>
> Is anyone seeing this Spark Connect test failure? then again, I have some
> weird issue with this env that always fails 1 or 2 tests that nobody else
> can replicate.
>
>
>
> - Test observe *** FAILED ***
>   == FAIL: Plans do not match ===
>   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
> max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
> [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
> 44
>+- LocalRelation , [id#0, name#0]
>   +- LocalRelation , [id#0, name#0]
> (PlanTest.scala:179)
>
>
>
> On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim 
> wrote:
>
> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
> out doc generation issue after tagging RC1.
>
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.5.1.
>
> The vote is open until February 18th 9AM (PST) and passes if a majority +1
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
> <https://mailshield.baidu.com/check?q=iR6md5rYrz%2bpTPJlEXXlR6NN3aGjunZT0DADO3Pcgs0%3d>
>
> The tag to be voted on is v3.5.1-rc2 (commit
> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
> https://github.com/apache/spark/tree/v3.5.1-rc2
> <https://mailshield.baidu.com/check?q=BMfFodF3wXGjeH1b9pbW8V4xeWam1vqNNCMtg1lcpC0d4WtLLiIr8UPiFKSwNMjbEy0AJw%3d%3d>

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-15 Thread Sean Owen

Is anyone seeing this Spark Connect test failure? then again, I have some
weird issue with this env that always fails 1 or 2 tests that nobody else
can replicate.

- Test observe *** FAILED ***
  == FAIL: Plans do not match ===
  !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
[min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
44
   +- LocalRelation , [id#0, name#0]
+- LocalRelation , [id#0, name#0]
(PlanTest.scala:179)

On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim 
wrote:

> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
> out doc generation issue after tagging RC1.
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.5.1.
>
> The vote is open until February 18th 9AM (PST) and passes if a majority +1
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.5.1-rc2 (commit
> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
> https://github.com/apache/spark/tree/v3.5.1-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1452/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-docs/
>
> The list of bug fixes going into 3.5.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353495
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC via "pip install
> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/pyspark-3.5.1.tar.gz
> "
> and see if anything important breaks.
> In the Java/Scala, you can add the staging repository to your projects
> resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.5.1?
> ===
>
> The current list of open tickets targeted at 3.5.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: Removing Kinesis in Spark 4

2024-01-20 Thread Sean Owen

I'm not aware of much usage. but that doesn't mean a lot.

FWIW, in the past month or so, the Kinesis docs page got about 700 views,
compared to about 1400 for Kafka
https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?idSite=40&period=range&date=2023-12-15,2024-01-20&category=General_Actions&subcategory=Actions_SubmenuPageTitles

Those are "low" in general, compared to the views for streaming pages,
which got tens of thousands of views.

I do feel like it's unmaintained, and do feel like it might be a stretch to
leave it lying around until Spark 5.
It's not exactly unused though.

I would not object to removing it unless there is some voice of support
here.

On Sat, Jan 20, 2024 at 10:38 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> From the dev thread: What else could be removed in Spark 4?
> 
>
> On Aug 17, 2023, at 1:44 AM, Yang Jie  wrote:
>
> I would like to know how we should handle the two Kinesis-related modules
> in Spark 4.0. They have a very low frequency of code updates, and because
> the corresponding tests are not continuously executed in any GitHub Actions
> pipeline, so I think they significantly lack quality assurance. On top of
> that, I am not certain if the test cases, which require AWS credentials in
> these modules, get verified during each Spark version release.
>
>
> Did we ever reach a decision about removing Kinesis in Spark 4?
>
> I was cleaning up some docs related to Kinesis and came across a reference
> to some Java API docs that I could not find
> . And
> looking around I came across both this email thread and this thread on
> JIRA
> 
>  about
> potentially removing Kinesis.
>
> But as far as I can tell we haven’t made a clear decision one way or the
> other.
>
> Nick
>
>

Re: Regression? - UIUtils::formatBatchTime - [SPARK-46611][CORE] Remove ThreadLocal by replace SimpleDateFormat with DateTimeFormatter

2024-01-08 Thread Sean Owen

Agreed, that looks wrong. From the code, it seems that "timezone" is only
used for testing, though apparently no test caught this. I'll submit a PR
to patch it in any event: https://github.com/apache/spark/pull/44619

On Mon, Jan 8, 2024 at 1:33 AM Janda Martin  wrote:

> I think that
>  [SPARK-46611][CORE] Remove ThreadLocal by replace SimpleDateFormat with
> DateTimeFormatter
>
>   introduced regression in UIUtils::formatBatchTime when timezone is
> defined.
>
> DateTimeFormatter is thread-safe and immutable according to JavaDoc so
> method DateTimeFormatter::withZone returns new instance when zone is
> changed.
>
> Following code has no effect:
>   val oldTimezones = (batchTimeFormat.getZone,
> batchTimeFormatWithMilliseconds.getZone)
>   if (timezone != null) {
>   val zoneId = timezone.toZoneId
>   batchTimeFormat.withZone(zoneId)
>   batchTimeFormatWithMilliseconds.withZone(zoneId)
> }
>
> Suggested fix:
> introduce local variables for "batchTimeFormat" and
> "batchTimeFormatWithMilliseconds" and remove "oldTimezones" and "finally"
> block.
>
>   I hope that I'm right. I just read the code. I didn't make any tests.
>
>  Thank you
>Martin
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Should Spark 4.x use Java modules (those you define with module-info.java sources)?

2023-12-04 Thread Sean Owen

It already does. I think that's not the same idea?

On Mon, Dec 4, 2023, 8:12 PM Almog Tavor  wrote:

> I think Spark should start shading it’s problematic deps similar to how
> it’s done in Flink
>
> On Mon, 4 Dec 2023 at 2:57 Sean Owen  wrote:
>
>> I am not sure we can control that - the Scala _x.y suffix has particular
>> meaning in the Scala ecosystem for artifacts and thus the naming of .jar
>> files. And we need to work with the Scala ecosystem.
>>
>> What can't handle these files, Spring Boot? does it somehow assume the
>> .jar file name relates to Java modules?
>>
>> By the by, Spark 4 is already moving to the jakarta.* packages for
>> similar reasons.
>>
>> I don't think Spark does or can really leverage Java modules. It started
>> waaay before that and expect that it has some structural issues that are
>> incompatible with Java modules, like multiple places declaring code in the
>> same Java package.
>>
>> As in all things, if there's a change that doesn't harm anything else and
>> helps support for Java modules, sure, suggest it. If it has the conflicts I
>> think it will, probably not possible and not really a goal I think.
>>
>>
>> On Sun, Dec 3, 2023 at 11:30 AM Marc Le Bihan 
>> wrote:
>>
>>> Hello,
>>>
>>> Last month, I've attempted the experience of upgrading my
>>> Spring-Boot 2 Java project, that relies heavily on Spark 3.4.2, to
>>> Spring-Boot 3. It didn't succeed yet, but was informative.
>>>
>>> Spring-Boot 2 → 3 means especially javax.* becoming jakarka.* :
>>> javax.activation, javax.ws.rs, javax.persistence, javax.validation,
>>> javax.servlet... all of these have to change their packages and
>>> dependencies.
>>> Apart of that, they were some trouble with ANTLR 4 against ANTLR 3,
>>> and few things with SFL4 and Log4J.
>>>
>>> It was not easy, and I guessed that going into modules could be a
>>> key. But when I'm near the Spark submodules of my project, it fail with
>>> messages such as:
>>> package org.apache.spark.sql.types is declared in the unnamed
>>> module, but module fr.ecoemploi.outbound.spark.core does not read it
>>>
>>> But I can't handle the spark dependencies easily, because they have
>>> an "invalid name" for Java. It's a matter that it doesn't want the "_" that
>>> is in the "_2.13" suffix of the jars.
>>> [WARNING] Can't extract module name from
>>> breeze-macros_2.13-2.1.0.jar: breeze.macros.2.13: Invalid module name: '2'
>>> is not a Java identifier
>>> [WARNING] Can't extract module name from
>>> spark-tags_2.13-3.4.2.jar: spark.tags.2.13: Invalid module name: '2' is not
>>> a Java identifier
>>> [WARNING] Can't extract module name from
>>> spark-unsafe_2.13-3.4.2.jar: spark.unsafe.2.13: Invalid module name: '2' is
>>> not a Java identifier
>>> [WARNING] Can't extract module name from
>>> spark-mllib_2.13-3.4.2.jar: spark.mllib.2.13: Invalid module name: '2' is
>>> not a Java identifier
>>> [... around 30 ...]
>>>
>>> I think that changing the naming pattern of the Spark jars for the
>>> 4.x could be a good idea,
>>> but beyond that, what about attempting to integrate Spark into
>>> modules, it's submodules defining module-info.java?
>>>
>>> Is it something that you think that [must | should | might | should
>>> not | must not] be done?
>>>
>>> Regards,
>>>
>>> Marc Le Bihan
>>>
>>

Re: Should Spark 4.x use Java modules (those you define with module-info.java sources)?

2023-12-03 Thread Sean Owen

I am not sure we can control that - the Scala _x.y suffix has particular
meaning in the Scala ecosystem for artifacts and thus the naming of .jar
files. And we need to work with the Scala ecosystem.

What can't handle these files, Spring Boot? does it somehow assume the .jar
file name relates to Java modules?

By the by, Spark 4 is already moving to the jakarta.* packages for similar
reasons.

I don't think Spark does or can really leverage Java modules. It started
waaay before that and expect that it has some structural issues that are
incompatible with Java modules, like multiple places declaring code in the
same Java package.

As in all things, if there's a change that doesn't harm anything else and
helps support for Java modules, sure, suggest it. If it has the conflicts I
think it will, probably not possible and not really a goal I think.

On Sun, Dec 3, 2023 at 11:30 AM Marc Le Bihan  wrote:

> Hello,
>
> Last month, I've attempted the experience of upgrading my Spring-Boot
> 2 Java project, that relies heavily on Spark 3.4.2, to Spring-Boot 3. It
> didn't succeed yet, but was informative.
>
> Spring-Boot 2 → 3 means especially javax.* becoming jakarka.* :
> javax.activation, javax.ws.rs, javax.persistence, javax.validation,
> javax.servlet... all of these have to change their packages and
> dependencies.
> Apart of that, they were some trouble with ANTLR 4 against ANTLR 3,
> and few things with SFL4 and Log4J.
>
> It was not easy, and I guessed that going into modules could be a key.
> But when I'm near the Spark submodules of my project, it fail with messages
> such as:
> package org.apache.spark.sql.types is declared in the unnamed
> module, but module fr.ecoemploi.outbound.spark.core does not read it
>
> But I can't handle the spark dependencies easily, because they have an
> "invalid name" for Java. It's a matter that it doesn't want the "_" that is
> in the "_2.13" suffix of the jars.
> [WARNING] Can't extract module name from
> breeze-macros_2.13-2.1.0.jar: breeze.macros.2.13: Invalid module name: '2'
> is not a Java identifier
> [WARNING] Can't extract module name from
> spark-tags_2.13-3.4.2.jar: spark.tags.2.13: Invalid module name: '2' is not
> a Java identifier
> [WARNING] Can't extract module name from
> spark-unsafe_2.13-3.4.2.jar: spark.unsafe.2.13: Invalid module name: '2' is
> not a Java identifier
> [WARNING] Can't extract module name from
> spark-mllib_2.13-3.4.2.jar: spark.mllib.2.13: Invalid module name: '2' is
> not a Java identifier
> [... around 30 ...]
>
> I think that changing the naming pattern of the Spark jars for the 4.x
> could be a good idea,
> but beyond that, what about attempting to integrate Spark into
> modules, it's submodules defining module-info.java?
>
> Is it something that you think that [must | should | might | should
> not | must not] be done?
>
> Regards,
>
> Marc Le Bihan
>

Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Sean Owen

I think it's the same, and always has been - yes you don't have a
guaranteed ordering unless an operation produces a specific ordering. Could
be the result of order by, yes; I believe you would be guaranteed that
reading input files results in data in the order they appear in the file,
etc. 1:1 operations like map() don't change ordering. But not the result of
a shuffle, for example. So yeah anything like limit or head might give
different results in the future (or simply on different cluster setups with
different parallelism, etc). The existence of operations like offset
doesn't contradict that. Maybe that's totally fine in some situations (ex:
I just want to display some sample rows) but otherwise yeah you've always
had to state your ordering for "first" or "nth" to have a guaranteed result.

On Mon, Sep 18, 2023 at 10:48 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I’ve always considered DataFrames to be logically equivalent to SQL tables
> or queries.
>
> In SQL, the result order of any query is implementation-dependent without
> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
> table;` 10 times in a row and get 10 different orderings.
>
> I thought the same applied to DataFrames, but the docstring for the
> recently added method DataFrame.offset
> 
>  implies
> otherwise.
>
> This example will work fine in practice, of course. But if DataFrames are
> technically unordered without an explicit ordering clause, then in theory a
> future implementation change may result in “Bob" being the “first” row in
> the DataFrame, rather than “Tom”. That would make the example incorrect.
>
> Is that not the case?
>
> Nick
>
>

Re: [DISCUSS] SPIP: Python Stored Procedures

2023-08-31 Thread Sean Owen

I think you're talking past Hyukjin here.

I think the response is: none of that is managed by Pyspark now, and this
proposal does not change that. Your current interpreter and environment is
used to execute the stored procedure, which is just Python code. It's on
you to bring an environment that runs the code correctly. This is just the
same as how running any python code works now.

I think you have exactly the same problems with UDFs now, and that's all a
real problem, just not something Spark has ever tried to solve for you.
Think of this as exactly like: I have a bit of python code I import as a
function and share across many python workloads. Just, now that chunk is
stored as a 'stored procedure'.

I agree this raises the same problem in new ways - now, you are storing and
sharing a chunk of code across many workloads. There is more potential for
compatibility and environment problems, as all of that is simply punted to
the end workloads. But, it's not different from importing common code and
the world doesn't fall apart.

On Wed, Aug 30, 2023 at 11:16 PM Alexander Shorin  wrote:

>
> Which Python version will run that stored procedure?
>>
>> All Python versions supported in PySpark
>>
>
> Where in stored procedure defines the exact python version which will run
> the code? That was the question.
>
>
>> How to manage external dependencies?
>>
>> Existing way we have
>> https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
>> .
>> In fact, this will use the external dependencies within your Python
>> interpreter so you can use all existing conda or venvs.
>>
> Current proposal solves this issue nohow (the stored code doesn't provide
> any manifest about its dependencies and what is required to run it). So
> feels like it's better to stay with UDF since they are under control and
> their behaviour is predictable. Did I miss something?
>
> How to test it via a common CI process?
>>
>> Existing way of PySpark unittests, see
>> https://github.com/apache/spark/tree/master/python/pyspark/tests
>>
> Sorry, but this wouldn't work since stored procedure thing requires some
> specific definition and this code will not be stored as regular python
> code. Do you have any examples how to test stored python procedures as a
> unit e.g. without spark?
>
> How to manage versions and do upgrades? Migrations?
>>
>> This is a new feature so no migration is needed. We will keep the
>> compatibility according to the sember we follow.
>>
> Question was not about spark, but about stored procedures itself. Any
> guidelines which will not copy flaws of other systems?
>
> Current Python UDF solution handles these problems in a good way since
>> they delegate them to project level.
>>
>> Current UDF solution cannot handle stored procedures because UDF is on
>> the worker side. This is Driver side.
>>
> How so? Currently it works and we never faced such issue. May be you
> should have the same Python code also on the driver side? But such trivial
> idea doesn't require new feature on Spark since you already have to ship
> that code somehow.
>
> --
> ,,,^..^,,,
>

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Sean Owen

It worked fine after I ran it again I included "package test" instead of
"test" (I had previously run "install") +1

On Wed, Aug 30, 2023 at 6:06 AM yangjie01  wrote:

> Hi, Sean
>
>
>
> I have performed testing with Java 17 and Scala 2.13 using maven (`mvn
> clean install` and `mvn package test`), and have not encountered the issue
> you mentioned.
>
>
>
> The test for the connect module depends on the `spark-protobuf` module to
> complete the `package,` was it successful? Or could you provide the test
> command for me to verify?
>
>
>
> Thanks,
>
> Jie Yang
>
>
>
> *发件人**: *Dipayan Dev 
> *日期**: *2023年8月30日 星期三 17:01
> *收件人**: *Sean Owen 
> *抄送**: *Yuanjian Li , Spark dev list <
> dev@spark.apache.org>
> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC3)
>
>
>
> Can we fix this bug in Spark 3.5.0?
>
> https://issues.apache.org/jira/browse/SPARK-44884
> <https://mailshield.baidu.com/check?q=cuZ00%2b0zbrN1TxhY0HTgyAub3lGN0J5FSjbfsBPL0yoIU71LdJTYoAVapkFmUjxgZT0WPdJBLus%3d>
>
>
>
>
> On Wed, Aug 30, 2023 at 11:51 AM Sean Owen  wrote:
>
> It looks good except that I'm getting errors running the Spark Connect
> tests at the end (Java 17, Scala 2.13) It looks like I missed something
> necessary to build; is anyone getting this?
>
>
>
> [ERROR] [Error]
> /tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apache/spark/sql/protobuf/protos/TestProto.java:9:46:
>  error: package org.sparkproject.spark_protobuf.protobuf does not exist
>
>
>
> On Tue, Aug 29, 2023 at 11:25 AM Yuanjian Li 
> wrote:
>
> Please vote on releasing the following candidate(RC3) as Apache Spark
> version 3.5.0.
>
>
>
> The vote is open until 11:59pm Pacific time *Aug 31st* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
>
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
> <https://mailshield.baidu.com/check?q=eJcUboQ1HRRomPZKEwRzpl69wA8DbI%2fNIiRNsQ%3d%3d>
>
>
>
> The tag to be voted on is v3.5.0-rc3 (commit
> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>
> https://github.com/apache/spark/tree/v3.5.0-rc3
> <https://mailshield.baidu.com/check?q=M8bk44BhojXSL5a%2bfp%2fAiXPgzvf1z8IY9RiBF4qXAQxEMaMvBeSTzrTW4aDYfv61SNEvZQ%3d%3d>
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
> <https://mailshield.baidu.com/check?q=Y5B1AfmG5NfNnTciPizGUdNVAVSofSiQkkSPsdSlVX%2fPPccSlHQtGK4nriJZRzVyOyOEL1evkXHLFUDt%2fF%2fl9Q%3d%3d>
>
>
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> <https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d>
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1447
> <https://mailshield.baidu.com/check?q=RKosLPjotKC8t%2fbhRUl%2fPI4aNpBuK2BpNhu6N7dXyO7vfBBIc2nx2st8hHY8kR%2f%2byciK%2bMWsc9QPqZCv6O3A2prmaWrVFOSOjhTPWA%3d%3d>
>
>
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
> <https://mailshield.baidu.com/check?q=UisDsKXdd3IJ4Kv657YN4LyF4nLuG%2bzB3bin1GDxnnjSLLtyS4sJmD%2f3asF8Ihv6p62TDzMlUG%2fg5wYGfJ0EfUSOJL0%3d>
>
>
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
> <https://mailshield.baidu.com/check?q=rOHxO3EFdnYTS41rF0m9qsTrteyGHUmLHghEJgmTMLY2%2bhbNu4VZqqsL4J8TXbsKbVjS4fDayxhT%2fqjJjgSX8zM00bc%3d>
>
>
>
> This release is using the release script of the tag v3.5.0-rc3.
>
>
>
> FAQ
>
>
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
>
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Sean Owen

It looks good except that I'm getting errors running the Spark Connect
tests at the end (Java 17, Scala 2.13) It looks like I missed something
necessary to build; is anyone getting this?

[ERROR] [Error]
/tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apache/spark/sql/protobuf/protos/TestProto.java:9:46:
 error: package org.sparkproject.spark_protobuf.protobuf does not exist

On Tue, Aug 29, 2023 at 11:25 AM Yuanjian Li  wrote:

> Please vote on releasing the following candidate(RC3) as Apache Spark
> version 3.5.0.
>
> The vote is open until 11:59pm Pacific time Aug 31st and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.5.0-rc3 (commit
> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>
> https://github.com/apache/spark/tree/v3.5.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1447
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>
> This release is using the release script of the tag v3.5.0-rc3.
>
>
> FAQ
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.0
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
> Thanks,
>
> Yuanjian Li
>

Re: [VOTE] Release Apache Spark 3.5.0 (RC2)

2023-08-19 Thread Sean Owen

+1 this looks better to me. Works with Scala 2.13 / Java 17 for me.

On Sat, Aug 19, 2023 at 3:23 AM Yuanjian Li  wrote:

> Please vote on releasing the following candidate(RC2) as Apache Spark
> version 3.5.0.
>
> The vote is open until 11:59pm Pacific time Aug 23th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.5.0-rc2 (commit
> 010c4a6a05ff290bec80c12a00cd1bdaed849242):
>
> https://github.com/apache/spark/tree/v3.5.0-rc2
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1446
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc2-docs/
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>
> This release is using the release script of the tag v3.5.0-rc2.
>
>
> FAQ
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.0
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
> Thanks,
>
> Yuanjian Li
>

Re: Question about ARRAY_INSERT between Spark and Databricks

2023-08-13 Thread Sean Owen

There shouldn't be any difference here. In fact, I get the results you list
for 'spark' from Databricks. It's possible the difference is a bug fix
along the way that is in the Spark version you are using locally but not in
the DBR you are using. But, yeah seems to work as. you say.

If you're asking about the Spark semantics being 1-indexed vs 0-indexed?
some comments here:
https://github.com/apache/spark/pull/38867#discussion_r1097054656

On Sun, Aug 13, 2023 at 7:28 AM Ran Tao  wrote:

> Hi, devs.
>
> I found that the  ARRAY_INSERT[1] function (from spark 3.4.0) has
> different semantics with databricks[2].
>
> e.g.
>
> // spark
> SELECT array_insert(array('a', 'b', 'c'), -1, 'z');
>  ["a","b","z","c"]
>
> // databricks
> SELECT array_insert(array('a', 'b', 'c'), -1, 'z');
>  ["a","b","c","z"]
>
> // spark
> SELECT array_insert(array('a', 'b', 'c'), -5, 'z');
> ["z",null,null,"a","b","c"]
>
> // databricks
> SELECT array_insert(array('a', 'b', 'c'), -5, 'z');
>  ["z",NULL,"a","b","c"]
>
> It looks like that inserting negative index is more reasonable in
> Databricks.
>
> Of cause, I read the source code of spark, and I can understand the logic
> of spark, but my question is whether spark is designed like this on purpose?
>
>
> [1] https://spark.apache.org/docs/latest/api/sql/index.html#array_insert
> [2]
> https://docs.databricks.com/en/sql/language-manual/functions/array_insert.html
>
>
> Best Regards,
> Ran Tao
> https://github.com/chucheng92
>

What else could be removed in Spark 4?

2023-08-07 Thread Sean Owen

While we're noodling on the topic, what else might be worth removing in
Spark 4?

For example, looks like we're finally hitting problems supporting Java 8
through 21 all at once, related to Scala 2.13.x updates. It would be
reasonable to require Java 11, or even 17, as a baseline for the multi-year
lifecycle of Spark 4.

Dare I ask: drop Scala 2.12? supporting 2.12 / 2.13 / 3.0 might get hard
otherwise.

There was a good discussion about whether old deprecated methods should be
removed. They can't be removed at other times, but, doesn't mean they all
*should* be. createExternalTable was brought up as a first example. What
deprecated methods are worth removing?

There's Mesos support, long since deprecated, which seems like something to
prune.

Are there old Hive/Hadoop version combos we should just stop supporting?

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-06 Thread Sean Owen

Let's keep testing 3.5.0 of course while that change is going in. (See
https://github.com/apache/spark/pull/42364#issuecomment-1666878287 )

Otherwise testing is pretty much as usual, except I get this test failure
in Connect, which is new. Anyone else? this is Java 8, Scala 2.13, Debian
12.

- from_protobuf_messageClassName_options *** FAILED ***
  org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS]
Could not load Protobuf class with name
org.apache.spark.connect.proto.StorageLevel.
org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf
Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar
with Protobuf classes needs to be shaded (com.google.protobuf.* -->
org.sparkproject.spark_protobuf.protobuf.*).
  at
org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3554)
  at
org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:198)
  at
org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:156)
  at
org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
  at
org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
  at
org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
  at
org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
  at
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
  at
org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:73)
  at scala.collection.immutable.List.map(List.scala:246)

On Sat, Aug 5, 2023 at 5:42 PM Sean Owen  wrote:

> I'm still testing other combinations, but it looks like tests fail on Java
> 17 after building with Java 8, which should be a normal supported
> configuration.
> This is described at https://github.com/apache/spark/pull/41943 and looks
> like it is resolved by moving back to Scala 2.13.8 for now.
> Unless I'm missing something we need to fix this for 3.5 or it's not clear
> the build will run on Java 17.
>
> On Fri, Aug 4, 2023 at 5:45 PM Yuanjian Li  wrote:
>
>> Please vote on releasing the following candidate(RC1) as Apache Spark
>> version 3.5.0.
>>
>> The vote is open until 11:59pm Pacific time Aug 9th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.0
>>
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.5.0-rc1 (commit
>> 7e862c01fc9a1d3b47764df8b6a4b5c4cafb0807):
>>
>> https://github.com/apache/spark/tree/v3.5.0-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>>
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1444
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-docs/
>>
>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>
>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>
>> This release is using the release script of the tag v3.5.0-rc1.
>>
>>
>> FAQ
>>
>> =
>>
>> How can I help test this release?
>>
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>>
>> an existing Spark workload and running on this release candidate, then
>>
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>>
>> the current RC and see if anything important breaks, in the Java/Scala
>>
>> you can add the staging repository to your projects resolvers and test
>>
>> with the RC (make sure to clean up the artifact cache before/after so
>>
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>>
>> What should happen to JIRA tickets still targeting 3.5.0?
>>
>> ===
>>
>> The current list of open

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-05 Thread Sean Owen

I'm still testing other combinations, but it looks like tests fail on Java
17 after building with Java 8, which should be a normal supported
configuration.
This is described at https://github.com/apache/spark/pull/41943 and looks
like it is resolved by moving back to Scala 2.13.8 for now.
Unless I'm missing something we need to fix this for 3.5 or it's not clear
the build will run on Java 17.

On Fri, Aug 4, 2023 at 5:45 PM Yuanjian Li  wrote:

> Please vote on releasing the following candidate(RC1) as Apache Spark
> version 3.5.0.
>
> The vote is open until 11:59pm Pacific time Aug 9th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.5.0-rc1 (commit
> 7e862c01fc9a1d3b47764df8b6a4b5c4cafb0807):
>
> https://github.com/apache/spark/tree/v3.5.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1444
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-docs/
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>
> This release is using the release script of the tag v3.5.0-rc1.
>
>
> FAQ
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.0
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
> Thanks,
>
> Yuanjian Li
>
>

Re: [VOTE] SPIP: XML data source support

2023-07-28 Thread Sean Owen

+1 I think that porting the package 'as is' into Spark is probably
worthwhile.
That's relatively easy; the code is already pretty battle-tested and not
that big and even originally came from Spark code, so is more or less
similar already.

One thing it never got was DSv2 support, which means XML reading would
still be somewhat behind other formats. (I was not able to implement it.)
This isn't a necessary goal right now, but would be possibly part of the
logic of moving it into the Spark code base.

On Fri, Jul 28, 2023 at 5:38 PM Sandip Agarwala
 wrote:

> Dear Spark community,
>
> I would like to start the vote for "SPIP: XML data source support".
>
> XML is a widely used data format. An external spark-xml package (
> https://github.com/databricks/spark-xml) is available to read and write
> XML data in spark. Making spark-xml built-in will provide a better user
> experience for Spark SQL and structured streaming. The proposal is to
> inline code from the spark-xml package.
>
> SPIP link:
>
> https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing
>
> JIRA:
> https://issues.apache.org/jira/browse/SPARK-44265
>
> Discussion Thread:
> https://lists.apache.org/thread/q32hxgsp738wom03mgpg9ykj9nr2n1fh
>
> Please vote on the SPIP for the next 72 hours:
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because __.
>
> Thanks, Sandip
>

Re: Spark 3.0.0 EOL

2023-07-26 Thread Sean Owen

There aren't "LTS" releases, though you might expect the last 3.x release
will see maintenance releases longer. See end of
https://spark.apache.org/versioning-policy.html

On Wed, Jul 26, 2023 at 3:56 AM Manu Zhang  wrote:

> Will Apache Spark 3.5 be a LTS version?
>
> Thanks,
> Manu
>
> On Mon, Jul 24, 2023 at 4:26 PM Dongjoon Hyun 
> wrote:
>
>> As Hyukjin replied, Apache Spark 3.0.0 is already in EOL status.
>>
>> To Pralabh, FYI, in the community,
>>
>> - Apache Spark 3.2 also reached the EOL already.
>>   https://lists.apache.org/thread/n4mdfwr5ksgpmrz0jpqp335qpvormos1
>>
>> If you are considering Apache Spark 4, here is the other 3.x timeline,
>>
>> - Apache Spark 3.3 => December, 2023.
>> - Apache Spark 3.4 => October, 2024
>> - Upcoming Apache Spark 3.5 => 18 months from the release
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Mon, Jul 24, 2023 at 12:21 AM Hyukjin Kwon 
>> wrote:
>>
>>>
>>> It's already EOL
>>>
>>> On Mon, Jul 24, 2023 at 4:17 PM Pralabh Kumar 
>>> wrote:
>>>
 Hi Dev Team

 If possible , can you please provide the Spark 3.0.0 EOL timelines .

 Regards
 Pralabh Kumar

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen

On Fri, Jun 16, 2023 at 3:58 PM Dongjoon Hyun 
wrote:

> I started the thread about already publicly visible version issues
> according to the ASF PMC communication guideline. It's no confidential,
> personal, or security-related stuff. Are you insisting this is confidential?
>

Discussion about a particular company should be on private@ - this is IMHO
like "personnel matters", in the doc you link. The principle is that
discussing whether an entity is doing something right or wrong is better in
private, because, hey, if the conclusion is "nothing's wrong here" then you
avoid disseminating any implication to the contrary.

I agreed with you, there's some value in discussing the general issue on
dev@. (I even said who the company was, though, it was I think clear before)

But, your thread title here is: "Apache Spark PMC asks Databricks to
differentiate its Spark version string"
(You separately claim this vote is about whether the PMC has a role here,
but, that's plainly not how this thread begins.)

Given that this has stopped being about ASF policy, and seems to be about
taking some action related to a company, I find it inappropriate again for
dev@, for exactly the reason I gave above. We have a PMC member repeating
this claim over and over, without support. This is why we don't do this in
public.

> May I ask which relevant context you are insisting not to receive
> specifically? I gave the specific examples (UI/logs/screenshot), and got
> the specific legal advice from `legal-discuss@` and replied why the
> version should be different.
>

It is the thread I linked in my reply:
https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb
This has already been discussed at length, and you're aware of it, but,
didn't mention it. I think that's critical; your text contains no problem
statement at all by itself.

Since we're here, fine: I vote -1, simply because this states no reason for
the action at all.
If we assume the thread ^^^ above is the extent of the logic, then, -1 for
the following reasons:
- Relevant ASF policy seems to say this is fine, as argued at
https://lists.apache.org/thread/p15tc772j9qwyvn852sh8ksmzrol9cof
- There is no argument any of this has caused a problem for the community
anyway; there is just nothing to 'fix'

I would again ask we not simply repeat the same thread again.

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen

As we noted in the last thread, this discussion should have been on private@
to begin with, but, the ship has sailed.

You are suggesting that non-PMC members vote on whether the PMC has to do
something? No, that's not how anything works here.
It's certainly the PMC that decides what to put in the board report, or
take action on behalf of the project.

This doesn't make sense here. Frankly, repeating this publicly without
relevant context, and avoiding the response you already got, is
inappropriate.

You may call a PMC vote on whether there's even an issue here, sure. If you
pursue it, you should explain specifically what the issue is w.r.t. policy,
and argue against the response you've already received.
We put valid issues in the board report, for sure. We do not include
invalid issues in the board report. That part needs no decision from anyone.

On Fri, Jun 16, 2023 at 3:08 PM Dongjoon Hyun 
wrote:

> No, this is a vote on dev@ intentionally as a part of our previous
> thread, "ASF policy violation and Scala version issues" (
> https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb)
>
> > did you mean this for the PMC list?
>
> I clearly started the thread with the following.
> > - Apache Spark PMC should include this incident report and the result in
> the next Apache Spark Quarterly Report (August).
>
> However, there is a perspective that this is none of Apache Spark PMC's
> role here.
>
> That's the rationale of this vote.
>
> This vote is whether this is Apache Spark PMC's role or not.
>
> Dongjoon.
>

Re: [VOTE] Apache Spark PMC asks Databricks to differentiate its Spark version string

2023-06-16 Thread Sean Owen

What does a vote on dev@ mean? did you mean this for the PMC list?

Dongjoon - this offers no rationale about "why". The more relevant thread
begins here:
https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb but it
likewise never got to connecting a specific observation to policy. Could
you explain your logic more concretely? otherwise this is still going
nowhere.

On Fri, Jun 16, 2023 at 2:53 PM Dongjoon Hyun  wrote:

> Please vote on the following statement. The vote is open until June 23th
> 1AM (PST) and passes if a majority +1 PMC votes are cast, with a minimum of
> 3 +1 votes.
>
> Apache Spark PMC asks Databricks to differentiate its Spark
> version string to avoid confusions because Apache Spark PMC
> is responsible for ensuring to follow ASF requirements[1] and
> respects ASF's legal advice [2, 3],
>
> [ ] +1 Yes
> [ ] -1 No because ...
>
> 
> 1. https://www.apache.org/foundation/governance/pmcs#organization
> 2. https://lists.apache.org/thread/mzhggd0rpz8t4d7vdsbhkp38mvd3lty4
> 3. https://www.apache.org/foundation/marks/downstream.html#source
>

Re: JDK version support policy?

2023-06-08 Thread Sean Owen

Noted, but for that you'd simply run your app on Java 17. If Spark works,
and your app's dependencies work on Java 17 because you compile it for 17
(and jakarta.* classes for example) then there's no issue.

On Thu, Jun 8, 2023 at 3:13 AM Martin Andersson 
wrote:

> There are some reasons to drop Java 11 as well. Java 17 included a large
> change, breaking backwards compatibility with their transition from Java
> EE to Jakarta EE
> <https://blogs.oracle.com/javamagazine/post/transition-from-java-ee-to-jakarta-ee>.
> This means that any users using Spark 4.0 together with Spring 6.x or any
> recent version of servlet containers such as Tomcat or Jetty will
> experience issues. (For security reasons it's beneficial to float your
> dependencies to the latest version of these libraries/frameworks)
>
> I'm not explicitly saying Java 11 should be dropped in Spark 4, just
> thought I'd bring this issue to your attention.
>
> Best Regards, Martin
> --
> *From:* Jungtaek Lim 
> *Sent:* Wednesday, June 7, 2023 23:19
> *To:* Sean Owen 
> *Cc:* Dongjoon Hyun ; Holden Karau <
> hol...@pigscanfly.ca>; dev 
> *Subject:* Re: JDK version support policy?
>
>
> EXTERNAL SENDER. Do not click links or open attachments unless you
> recognize the sender and know the content is safe. DO NOT provide your
> username or password.
>
> +1 to drop Java 8 but +1 to set the lowest support version to Java 11.
>
> Considering the phase for only security updates, 11 LTS would not be EOLed
> in very long time. Unless that’s coupled with other deps which require
> bumping JDK version (hope someone can bring up lists), it doesn’t seem to
> buy much. And given the strong backward compatibility JDK provides, that’s
> less likely.
>
> Purely from the project’s source code view, does anyone know how much
> benefits we can leverage for picking up 17 rather than 11? I lost the
> track, but some of their proposals are more likely catching up with other
> languages, which don’t make us be happy since Scala provides them for years.
>
> 2023년 6월 8일 (목) 오전 2:35, Sean Owen 님이 작성:
>
> I also generally perceive that, after Java 9, there is much less breaking
> change. So working on Java 11 probably means it works on 20, or can be
> easily made to without pain. Like I think the tweaks for Java 17 were quite
> small.
>
> Targeting Java >11 excludes Java 11 users and probably wouldn't buy much.
> Keeping the support probably doesn't interfere with working on much newer
> JVMs either.
>
> On Wed, Jun 7, 2023, 12:29 PM Holden Karau  wrote:
>
> So JDK 11 is still supported in open JDK until 2026, I'm not sure if we're
> going to see enough folks moving to JRE17 by the Spark 4 release unless we
> have a strong benefit from dropping 11 support I'd be inclined to keep it.
>
> On Tue, Jun 6, 2023 at 9:08 PM Dongjoon Hyun  wrote:
>
> I'm also +1 on dropping both Java 8 and 11 in Apache Spark 4.0, too.
>
> Dongjoon.
>
> On 2023/06/07 02:42:19 yangjie01 wrote:
> > +1 on dropping Java 8 in Spark 4.0, and I even hope Spark 4.0 can only
> support Java 17 and the upcoming Java 21.
> >
> > 发件人: Denny Lee 
> > 日期: 2023年6月7日 星期三 07:10
> > 收件人: Sean Owen 
> > 抄送: David Li , "dev@spark.apache.org" <
> dev@spark.apache.org>
> > 主题: Re: JDK version support policy?
> >
> > +1 on dropping Java 8 in Spark 4.0, saying this as a fan of the
> fast-paced (positive) updates to Arrow, eh?!
> >
> > On Tue, Jun 6, 2023 at 4:02 PM Sean Owen  sro...@gmail.com>> wrote:
> > I haven't followed this discussion closely, but I think we could/should
> drop Java 8 in Spark 4.0, which is up next after 3.5?
> >
> > On Tue, Jun 6, 2023 at 2:44 PM David Li  lidav...@apache.org>> wrote:
> > Hello Spark developers,
> >
> > I'm from the Apache Arrow project. We've discussed Java version support
> [1], and crucially, whether to continue supporting Java 8 or not. As Spark
> is a big user of Arrow in Java, I was curious what Spark's policy here was.
> >
> > If Spark intends to stay on Java 8, for instance, we may also want to
> stay on Java 8 or otherwise provide some supported version of Arrow for
> Java 8.
> >
> > We've seen dependencies dropping or planning to drop support. gRPC may
> drop Java 8 at any time [2], possibly this September [3], which may affect
> Spark (due to Spark Connect). And today we saw that Arrow had issues
> running tests with Mockito on Java 20, but we couldn't update Mockito since
> it had dropped Java 8 support. (We pinned the JDK version in that CI
> pipeline fo

Re: JDK version support policy?

2023-06-07 Thread Sean Owen

I also generally perceive that, after Java 9, there is much less breaking
change. So working on Java 11 probably means it works on 20, or can be
easily made to without pain. Like I think the tweaks for Java 17 were quite
small.

Targeting Java >11 excludes Java 11 users and probably wouldn't buy much.
Keeping the support probably doesn't interfere with working on much newer
JVMs either.

On Wed, Jun 7, 2023, 12:29 PM Holden Karau  wrote:

> So JDK 11 is still supported in open JDK until 2026, I'm not sure if we're
> going to see enough folks moving to JRE17 by the Spark 4 release unless we
> have a strong benefit from dropping 11 support I'd be inclined to keep it.
>
> On Tue, Jun 6, 2023 at 9:08 PM Dongjoon Hyun  wrote:
>
>> I'm also +1 on dropping both Java 8 and 11 in Apache Spark 4.0, too.
>>
>> Dongjoon.
>>
>> On 2023/06/07 02:42:19 yangjie01 wrote:
>> > +1 on dropping Java 8 in Spark 4.0, and I even hope Spark 4.0 can only
>> support Java 17 and the upcoming Java 21.
>> >
>> > 发件人: Denny Lee 
>> > 日期: 2023年6月7日 星期三 07:10
>> > 收件人: Sean Owen 
>> > 抄送: David Li , "dev@spark.apache.org" <
>> dev@spark.apache.org>
>> > 主题: Re: JDK version support policy?
>> >
>> > +1 on dropping Java 8 in Spark 4.0, saying this as a fan of the
>> fast-paced (positive) updates to Arrow, eh?!
>> >
>> > On Tue, Jun 6, 2023 at 4:02 PM Sean Owen > sro...@gmail.com>> wrote:
>> > I haven't followed this discussion closely, but I think we could/should
>> drop Java 8 in Spark 4.0, which is up next after 3.5?
>> >
>> > On Tue, Jun 6, 2023 at 2:44 PM David Li > lidav...@apache.org>> wrote:
>> > Hello Spark developers,
>> >
>> > I'm from the Apache Arrow project. We've discussed Java version support
>> [1], and crucially, whether to continue supporting Java 8 or not. As Spark
>> is a big user of Arrow in Java, I was curious what Spark's policy here was.
>> >
>> > If Spark intends to stay on Java 8, for instance, we may also want to
>> stay on Java 8 or otherwise provide some supported version of Arrow for
>> Java 8.
>> >
>> > We've seen dependencies dropping or planning to drop support. gRPC may
>> drop Java 8 at any time [2], possibly this September [3], which may affect
>> Spark (due to Spark Connect). And today we saw that Arrow had issues
>> running tests with Mockito on Java 20, but we couldn't update Mockito since
>> it had dropped Java 8 support. (We pinned the JDK version in that CI
>> pipeline for now.)
>> >
>> > So at least, I am curious if Arrow could start the long process of
>> migrating Java versions without impacting Spark, or if we should continue
>> to cooperate. Arrow Java doesn't see quite so much activity these days, so
>> it's not quite critical, but it's possible that these dependency issues
>> will start to affect us more soon. And looking forward, Java is working on
>> APIs that should also allow us to ditch the --add-opens flag requirement
>> too.
>> >
>> > [1]: https://lists.apache.org/thread/phpgpydtt3yrgnncdyv4qdq1gf02s0yj<
>> https://mailshield.baidu.com/check?q=Nz%2bGj2hdKguk92URjA7sg0PfbSN%2fXUIMgrHTmW45gOOKEr3Shre45B7TRzhEpb%2baVsnyuRL%2fl%2f0cu7IVGHunSGDVnxM%3d
>> >
>> > [2]:
>> https://github.com/grpc/proposal/blob/master/P5-jdk-version-support.md<
>> https://mailshield.baidu.com/check?q=s89S3eo8GCJkV7Mpx7aG1SXId7uCRYGjQMA6DeLuX9duS86LhIODZMJfeFdGMWdFzJ8S7minyHoC7mCrzHagbJXCXYTBH%2fpZBpfTbw%3d%3d
>> >
>> > [3]: https://github.com/grpc/grpc-java/issues/9386<
>> https://mailshield.baidu.com/check?q=R0HtWZIkY5eIxpz8jtqHLzd0ugNbcaXIKW2LbUUxpIn0t9Y9yAhuHPuZ4buryfNwRnnJTA%3d%3d
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: ASF policy violation and Scala version issues

2023-06-07 Thread Sean Owen

Hi Dongjoon, I think this conversation is not advancing anymore. I
personally consider the matter closed unless you can find other support or
respond with more specifics. While this perhaps should be on private@, I
think it's not wrong as an instructive discussion on dev@.

I don't believe you've made a clear argument about the problem, or how it
relates specifically to policy. Nevertheless I will show you my logic.

You are asserting that a vendor cannot call a product Apache Spark 3.4.0 if
it omits a patch updating a Scala maintenance version. This difference has
no known impact on usage, as far as I can tell.

Let's see what policy requires:

1/ All source code changes must meet at least one of the acceptable changes
criteria set out below:
- The change has accepted by the relevant Apache project community for
inclusion in a future release. Note that the process used to accept changes
and how that acceptance is documented varies between projects.
- A change is a fix for an undisclosed security issue; and the fix is not
publicly disclosed as as security fix; and the Apache project has been
notified of the both issue and the proposed fix; and the PMC has rejected
neither the vulnerability report nor the proposed fix.
- A change is a fix for a bug; and the Apache project has been notified of
both the bug and the proposed fix; and the PMC has rejected neither the bug
report nor the proposed fix.
- Minor changes (e.g. alterations to the start-up and shutdown scripts,
configuration files, file layout etc.) to integrate with the target
platform providing the Apache project has not objected to those changes.

The change you cite meets the 4th point, minor change, made for integration
reasons. There is no known technical objection; this was after all at one
point the state of Apache Spark.

2/ A version number must be used that both clearly differentiates it from
an Apache Software Foundation release and clearly identifies the Apache
Software Foundation version on which the software is based.

Keep in mind the product here is not "Apache Spark", but the "Databricks
Runtime 13.1 (including Apache Spark 3.4.0)". That is, there is far more
than a version number differentiating this product from Apache Spark. There
is no standalone distribution of Apache Spark anywhere here. I believe that
easily matches the intent.

3/ The documentation must clearly identify the Apache Software Foundation
version on which the software is based.

Clearly, yes.

4/ The end user expects that the distribution channel will back-port fixes.
It is not necessary to back-port all fixes. Selection of fixes to back-port
must be consistent with the update policy of that distribution channel.

I think this is safe to say too. Indeed this explicitly contemplates not
back-porting a change.

Backing up, you can see from this document that the spirit of it is: don't
include changes in your own Apache Foo x.y that aren't wanted by the
project, and still call it Apache Foo x.y. I don't believe your case
matches this spirit either.

I do think it's not crazy to suggest, hey vendor, would you call this
"Apache Spark + patches" or ".vendor123". But that's at best a suggestion,
and I think it does nothing in particular for users. You've made the
suggestion, and I do not see some police action from the PMC must follow.

I think you're simply objecting to a vendor choice, but that is not
on-topic here unless you can specifically rebut the reasoning above and
show it's connected.

On Wed, Jun 7, 2023 at 11:02 AM Dongjoon Hyun  wrote:

> Sean, it seems that you are confused here. We are not talking about your
> upper system (the notebook environment). We are talking about the
> submodule, "Apache Spark 3.4.0-databricks". Whatever you call it, both of
> us knows "Apache Spark 3.4.0-databricks" is different from "Apache Spark
> 3.4.0". You should not use "3.4.0" at your subsystem.
>
> > This also is aimed at distributions of "Apache Foo", not products that
> > "include Apache Foo", which are clearly not Apache Foo.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: ASF policy violation and Scala version issues

2023-06-07 Thread Sean Owen

(With consent, shall we move this to the PMC list?)

No, I don't think that's what this policy says.

First, could you please be more specific here? why do you think a certain
release is at odds with this?
Because so far you've mentioned, I think, not taking a Scala maintenance
release update.

But this says things like:

The source code on which the software is based must either be identical to
an Apache Software Foundation source code release or all of the following
must also be true:
  ...
  - The end user expects that the distribution channel will back-port
fixes. It is not necessary to back-port all fixes. Selection of fixes to
back-port must be consistent with the update policy of that distribution
channel.

That describes what you're talking about.

This also is aimed at distributions of "Apache Foo", not products that
"include Apache Foo", which are clearly not Apache Foo.
The spirit of it is, more generally: don't keep new features and fixes to
yourself. That does not seem to apply here.

On Tue, Jun 6, 2023 at 11:34 PM Dongjoon Hyun 
wrote:

> Hi, All and Matei (as the Chair of Spark PMC).
>
> For the ASF policy violation part, here is a legal recommendation
> documentation (draft) from `legal-discuss@`.
>
> https://www.apache.org/foundation/marks/downstream.html#source
>
> > A version number must be used that both clearly differentiates it from
> an Apache Software Foundation release and clearly identifies the Apache
> Software Foundation version on which the software is based.
>
> In short, Databricks should not claim its product like "Apache Spark
> 3.4.0". The version number should clearly differentiate it from Apache
> Spark 3.4.0. I hope we can conclude this together in this way and move our
> focus forward to the other remaining issues.
>
> To Matei, could you do the legal follow-up officially with Databricks with
> the above info?
>
> If there is a person to do this, I believe you are the best person to
> drive this.
>
> Thank you in advance.
>
> Dongjoon.
>
>
> On Tue, Jun 6, 2023 at 2:49 PM Dongjoon Hyun  wrote:
>
>> It goes to "legal-discuss@".
>>
>> https://lists.apache.org/thread/mzhggd0rpz8t4d7vdsbhkp38mvd3lty4
>>
>> I hope we can conclude the legal part clearly and shortly in one way or
>> another which we will follow with confidence.
>>
>> Dongjoon
>>
>> On 2023/06/06 20:06:42 Dongjoon Hyun wrote:
>> > Thank you, Sean, Mich, Holden, again.
>> >
>> > For this specific part, let's ask the ASF board via bo...@apache.org to
>> > find a right answer because it's a controversial legal issue here.
>> >
>> > > I think you'd just prefer Databricks make a different choice, which is
>> > legitimate, but, an issue to take up with Databricks, not here.
>> >
>> > Dongjoon.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: JDK version support policy?

2023-06-06 Thread Sean Owen

I haven't followed this discussion closely, but I think we could/should
drop Java 8 in Spark 4.0, which is up next after 3.5?

On Tue, Jun 6, 2023 at 2:44 PM David Li  wrote:

> Hello Spark developers,
>
> I'm from the Apache Arrow project. We've discussed Java version support
> [1], and crucially, whether to continue supporting Java 8 or not. As Spark
> is a big user of Arrow in Java, I was curious what Spark's policy here was.
>
> If Spark intends to stay on Java 8, for instance, we may also want to stay
> on Java 8 or otherwise provide some supported version of Arrow for Java 8.
>
> We've seen dependencies dropping or planning to drop support. gRPC may
> drop Java 8 at any time [2], possibly this September [3], which may affect
> Spark (due to Spark Connect). And today we saw that Arrow had issues
> running tests with Mockito on Java 20, but we couldn't update Mockito since
> it had dropped Java 8 support. (We pinned the JDK version in that CI
> pipeline for now.)
>
> So at least, I am curious if Arrow could start the long process of
> migrating Java versions without impacting Spark, or if we should continue
> to cooperate. Arrow Java doesn't see quite so much activity these days, so
> it's not quite critical, but it's possible that these dependency issues
> will start to affect us more soon. And looking forward, Java is working on
> APIs that should also allow us to ditch the --add-opens flag requirement
> too.
>
> [1]: https://lists.apache.org/thread/phpgpydtt3yrgnncdyv4qdq1gf02s0yj
> [2]:
> https://github.com/grpc/proposal/blob/master/P5-jdk-version-support.md
> [3]: https://github.com/grpc/grpc-java/issues/9386
>

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen

I think the issue is whether a distribution of Spark is so materially
different from OSS that it causes problems for the larger community of
users. There's a legitimate question of whether such a thing can be called
"Apache Spark + changes", as describing it that way becomes meaningfully
inaccurate. And if it's inaccurate, then it's a trademark usage issue, and
a matter for the PMC to act on. I certainly recall this type of problem
from the early days of Hadoop - the project itself had 2 or 3 live branches
in development (was it 0.20.x vs 0.23.x vs 1.x? YARN vs no YARN?) picked up
by different vendors and it was unclear what "Apache Hadoop" meant in a
vendor distro. Or frankly, upstream.

In comparison, variation in Scala maintenance release seems trivial. I'm
not clear from the thread what actual issue this causes to users. Is there
more to it - does this go hand in hand with JDK version and Ammonite, or
are those separate? What's an example of the practical user issue. Like, I
compile vs Spark 3.4.0 and because of Scala version differences it doesn't
run on some vendor distro? That's not great, but seems like a vendor
problem. Unless you tell me we are getting tons of bug reports to OSS Spark
as a result or something.

Is the implication that something in OSS Spark is being blocked to prefer
some set of vendor choices? because the changes you're pointing to seem to
be going into Apache Spark, actually. It'd be more useful to be specific
and name names at this point, seems fine.

The rest of this is just a discussion about Databricks choices. (If it's
not clear, I'm at Databricks but do not work on the Spark distro). We can
discuss but it seems off-topic _if_ it can't be connected to a problem for
OSS Spark. Anyway:

If it helps, _some_ important patches are described at
https://docs.databricks.com/release-notes/runtime/maintenance-updates.html
; I don't think this is exactly hidden.

Out of curiosity, how would you describe this software in the UI instead?
"3.4.0" is shorthand, because this is a little dropdown menu; the terminal
output is likewise not a place to list all patches. You would propose
requesting calling this "3.4.0 + patches"? That's the best I can think of,
but I don't think it addresses what you're getting at anyway. I think you'd
just prefer Databricks make a different choice, which is legitimate, but,
an issue to take up with Databricks, not here.

On Mon, Jun 5, 2023 at 6:58 PM Dongjoon Hyun 
wrote:

> Hi, Sean.
>
> "+ patches" or "powered by Apache Spark 3.4.0" is not a problem as you
> mentioned. For the record, I also didn't bring up any old story here.
>
> > "Apache Spark 3.4.0 + patches"
>
> However, "including Apache Spark 3.4.0" still causes confusion even in a
> different way because of those missing patches, SPARK-40436 (Upgrade Scala
> to 2.12.17) and SPARK-39414 (Upgrade Scala to 2.12.16). Technically,
> Databricks Runtime doesn't include Apache Spark 3.4.0 while it claims it to
> the users.
>
> [image: image.png]
>
> It's a sad story from the Apache Spark Scala perspective because the users
> cannot even try to use the correct Scala 2.12.17 version in the runtime.
>
> All items I've shared are connected via a single theme, hurting Apache
> Spark Scala users.
> From (1) building Spark, (2) creating a fragmented Scala Spark runtime
> environment and (3) hidden user-facing documentation.
>
> Of course, I don't think those are designed in an organized way
> intentionally. It just happens at the same time.
>
> Based on your comments, let me ask you two questions. (1) When Databricks
> builds its internal Spark from its private code repository, is it a company
> policy to always expose "Apache 3.4.0" to the users like the following by
> ignoring all changes (whatever they are). And, (2) Do you insist that it is
> normative and clear to the users and the community?
>
> > - The runtime logs "23/06/05 04:23:27 INFO SparkContext: Running Spark
> version 3.4.0"
> > - UI shows Apache Spark logo and `3.4.0`.
>
>>
>>

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen

On Mon, Jun 5, 2023 at 12:01 PM Dongjoon Hyun 
wrote:

> 1. For the naming, yes, but the company should use different version
> numbers instead of the exact "3.4.0". As I shared the screenshot in my
> previous email, the company exposes "Apache Spark 3.4.0" exactly because
> they build their distribution without changing their version number at all.
>

I don't believe this is supported by guidance on the underlying issue here,
which is trademark. There is nothing wrong with nominative use, and I think
that's what this is. A thing can be "Apache Spark 3.4.0 + patches" and be
described that way.
Calling it "Apache Spark 3.4.0.vendor123" is argubaly more confusing IMHO,
as there is no such Apache Spark version.

> 2. According to
> https://mvnrepository.com/artifact/org.apache.spark/spark-core,
> all the other companies followed  "Semantic Versioning" or added
> additional version numbers at their distributions, didn't they? AFAIK,
> nobody claims to take over the exact, "3.4.0" version string, in source
> code level like this company.
>

Here you're talking about software artifact numbering, for companies that
were also releasing their own maintenance branch of OSS. That pretty much
requires some sub-versioning scheme. I think that's fine too, although as
above I think this is arguably _worse_ w.r.t. reuse of the Apache name and
namespace.
I'm not aware of any policy on this, and don't find this problematic
myself. Doesn't mean it's right, but does mean implicitly this has never
before been viewed as an issue?

The one I'm aware of was releasing a product "including Apache Spark 2.0"
before it existed, which does seem to potentially cause confusion, and that
was addressed.

Can you describe what policy is violated? we can disagree about what we'd
prefer or not, but the question is, what if anything is disallowed? I'm not
seeing that.

> 3. This company not only causes the 'Scala Version Segmentation'
> environment in a subtle way, but also defames Apache Spark 3.4.0 by
> removing many bug fixes of SPARK-40436 (Upgrade Scala to 2.12.17) and
> SPARK-39414 (Upgrade Scala to 2.12.16) for some unknown reason. Apparently,
> this looks like not a superior version of Apache Spark 3.4.0. For me, it's
> the inferior version. If a company disagrees with Scala 2.12.17 for some
> internal reason, they are able to stick to 2.12.15, of course. However,
> Apache Spark PMC should not allow them to lie to the customers that "Apache
> Spark 3.4.0" uses Scala 2.12.15 by default. That's the reason why I
> initiated this email because I'm considering this as a serious blocker to
> make Apache Spark Scala improvement.
> - https://github.com/scala/scala/releases/tag/v2.12.17 (21 Merged PRs)
> - https://github.com/scala/scala/releases/tag/v2.12.16 (68 Merged PRs)
>

To be clear, this seems unrelated to your first two points above?

I'm having trouble following what you are arguing here. You are saying a
vendor release based on "Apache Spark 3.4.0" is not the same in some
material way that you don't like. That's a fine position to take, but I
think the product is still substantially describable as "Apache Spark
3.4.0 + patches". You can take up the issue with the vendor.

But more importantly, I am not seeing how that constrains anything in
Apache Spark? those updates were merged to OSS. But even taking up the
point you describe, why is the scala maintenance version even such a
material issue that is so severe it warrants PMC action?

Could you connect the dots a little more?

>

Re: ASF policy violation and Scala version issues

2023-06-05 Thread Sean Owen

1/ Regarding naming - I believe releasing "Apache Foo X.Y + patches" is
acceptable, if it is substantially Apache Foo X.Y. This is common practice
for downstream vendors. It's fair nominative use. The principle here is
consumer confusion. Is anyone substantially misled? Here I don't think so.
I know that we have in the past decided it would not be OK, for example, to
release a product with "Apache Spark 4.0" now as there is no such release,
even building from master. A vendor should elaborate the changes
somewhere, ideally. I'm sure this one is about Databricks but I'm also sure
Cloudera, Hortonworks, etc had Spark releases with patches, too.

2a/ That issue seems to be about just flipping which code sample is shown
by default. It seemed widely agree that this would slightly help more users
than it harms. I agree with the change and don't see a need to escalate.
the question of further Python parity is a big one but is separate.

2b/ If a single dependency blocks important updates, yeah it's fair to
remove it, IMHO. I wouldn't remove in 3.5 unless the other updates are
critical, and it's not clear they are. In 4.0 yes.

2c/ Scala 2.13 is already supported in 3.x, and does not require 4.0. This
was about what the default non-Scala release convenience binaries use.
Sticking to 2.12 in 3.x doesn't seem like an issue, even desirable.

2d/ Same as 2b

3/ I don't think 1/ is an incident. Yes to moving towards 4.0 after 3.5,
IMHO, and to removing Ammonite in 4.0 if there is no resolution forthcoming

On Mon, Jun 5, 2023 at 2:46 AM Dongjoon Hyun 
wrote:

> Hi, All and Matei (as the Chair of Apache Spark PMC).
>
> Sorry for a long email, I want to share two topics and corresponding
> action items.
> You can go to "Section 3: Action Items" directly for the conclusion.
>
>
> ### 1. ASF Policy Violation ###
>
> ASF has a rule for "MAY I CALL MY MODIFIED CODE 'APACHE'?"
>
> https://www.apache.org/foundation/license-faq.html#Name-changes
>
> For example, when we call `Apache Spark 3.4.0`, it's supposed to be the
> same with one of our official distributions.
>
> https://downloads.apache.org/spark/spark-3.4.0/
>
> Specifically, in terms of the Scala version, we believe it should have
> Scala 2.12.17 because of 'SPARK-40436 Upgrade Scala to 2.12.17'.
>
> There is a company claiming something non-Apache like "Apache Spark 3.4.0
> minus SPARK-40436" with the name "Apache Spark 3.4.0."
>
> - The company website shows "X.Y (includes Apache Spark 3.4.0, Scala
> 2.12)"
> - The runtime logs "23/06/05 04:23:27 INFO SparkContext: Running Spark
> version 3.4.0"
> - UI shows Apache Spark logo and `3.4.0`.
> - However, Scala Version is '2.12.15'
>
> [image: Screenshot 2023-06-04 at 9.37.16 PM.png][image: Screenshot
> 2023-06-04 at 10.14.45 PM.png]
>
> Lastly, this is not a single instance. For example, the same company also
> claims "Apache Spark 3.3.2" with a mismatched Scala version.
>
>
> ### 2. Scala Issues ###
>
> In addition to (1), although we proceeded with good intentions and great
> care
> including dev mailing list discussion, there are several concerning areas
> which
> need more attention and our love.
>
> a) Scala Spark users will experience UX inconvenience from Spark 3.5.
>
> SPARK-42493 Make Python the first tab for code examples
>
> For the record, we discussed it here.
> - https://lists.apache.org/thread/1p8s09ysrh4jqsfd47qdtrl7rm4rrs05
>   "[DISCUSS] Show Python code examples first in Spark documentation"
>
> b) Scala version upgrade is blocked by the Ammonite library dev cycle
> currently.
>
> Although we discussed it here and it had good intentions,
> the current master branch cannot use the latest Scala.
>
> - https://lists.apache.org/thread/4nk5ddtmlobdt8g3z8xbqjclzkhlsdfk
> "Ammonite as REPL for Spark Connect"
>  SPARK-42884 Add Ammonite REPL integration
>
> Specifically, the following are blocked and I'm monitoring the
> Ammonite repository.
> - SPARK-40497 Upgrade Scala to 2.13.11
> - SPARK-43832 Upgrade Scala to 2.12.18
> - According to https://github.com/com-lihaoyi/Ammonite/issues ,
>   Scala 3.3.0 LTS support also looks infeasible.
>
> Although we may be able to wait for a while, there are two fundamental
> solutions
> to unblock this situation in a long-term maintenance perspective.
> - Replace it with a Scala-shell based implementation
> - Move `connector/connect/client/jvm/pom.xml` outside from Spark repo.
>Maybe, we can put it into the new repo like Rust and Go client.
>
> c) Scala 2.13 and above needs Apache Spark 4.0.
>
> In "Apache Spark 3.5.0 Expectations?" and "Apache Spark 4.0
> Timeframe?" threads,
> we discussed Spark 3.5.0 scope and decided to revert
> "SPARK-43836 Make Scala 2.13 as default in Spark 3.5".
> Apache Spark 4.0.0 is the only way to support Scala 2.13 or higher.
>
> - https://lists.apache.org/thread/3x6dh17bmy20n3frtt3crgxjydnxh2o0
> ("Apache Spark 3.5.0 E

Re: Apache Spark 3.5.0 Expectations (?)

2023-05-29 Thread Sean Owen

It does seem risky; there are still likely libs out there that don't cross
compile for 2.13. I would make it the default at 4.0, myself.

On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon  wrote:

> While I support going forward with a higher version, actually using Scala
> 2.13 by default is a big deal especially in a way that:
>
>- Users would likely download the built-in version assuming that it’s
>backward binary compatible.
>- PyPI doesn't allow specifying the Scala version, meaning that users
>wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
>
> I wonder if it’s safer to do it in Spark 4 (which I believe will be
> discussed soon).
>
>
> On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
>
>> Thanks Dongjoon!
>> There are some ticket I want to share.
>> SPARK-39420 Support ANALYZE TABLE on v2 tables
>> SPARK-42750 Support INSERT INTO by name
>> SPARK-43521 Support CREATE TABLE LIKE FILE
>>
>> Dongjoon Hyun  于2023年5月29日周一 08:42写道：
>>
>>> Hi, All.
>>>
>>> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
>>> currently a few notable things are under discussions in the mailing list.
>>>
>>> I believe it's a good time to share a short summary list (containing
>>> both completed and in-progress items) to give a highlight in advance and to
>>> collect your targets too.
>>>
>>> Please share your expectations or working items if you want to
>>> prioritize them more in the community in Apache Spark 3.5.0 timeframe.
>>>
>>> (Sorted by ID)
>>> SPARK-40497 Upgrade Scala 2.13.11
>>> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
>>> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
>>> 1.12.316)
>>> SPARK-43024 Upgrade Pandas to 2.0.0
>>> SPARK-43200 Remove Hadoop 2 reference in docs
>>> SPARK-43347 Remove Python 3.7 Support
>>> SPARK-43348 Support Python 3.8 in PyPy3
>>> SPARK-43351 Add Spark Connect Go prototype code and example
>>> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
>>> SPARK-43394 Upgrade to Maven 3.8.8
>>> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
>>> SPARK-43446 Upgrade to Apache Arrow 12.0.0
>>> SPARK-43447 Support R 4.3.0
>>> SPARK-43489 Remove protobuf 2.5.0
>>> SPARK-43519 Bump Parquet to 1.13.1
>>> SPARK-43581 Upgrade kubernetes-client to 6.6.2
>>> SPARK-43588 Upgrade to ASM 9.5
>>> SPARK-43600 Update K8s doc to recommend K8s 1.24+
>>> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
>>> SPARK-43831 Build and Run Spark on Java 21
>>> SPARK-43832 Upgrade to Scala 2.12.18
>>> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
>>> SPARK-43842 Upgrade gcs-connector to 2.2.14
>>> SPARK-43844 Update to ORC 1.9.0
>>> UMBRELLA: Add SQL functions into Scala, Python and R API
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>> PS. The above is not a list of release blockers. Instead, it could be a
>>> nice-to-have from someone's perspective.
>>>
>>

Re: Spark 3.4.0 with Hadoop2.7 cannot be downloaded

2023-04-20 Thread Sean Owen

We just removed it now, yes.

On Thu, Apr 20, 2023 at 9:08 AM Emil Ejbyfeldt
 wrote:

> Hi,
>
> I think this is expected as it was dropped from the release process in
> https://issues.apache.org/jira/browse/SPARK-40651
>
> Also I don't see a Hadoop2.7 option when selecting Spark 3.4.0 on
> https://spark.apache.org/downloads.html
> Not really sure why you could be seeing that.
>
> Best,
> Emil
>
>
> On 20/04/2023 08:23, Enrico Minack wrote:
> > Hi,
> >
> > selecting Spark 3.4.0 with Hadoop2.7 at
> > https://spark.apache.org/downloads.html leads to
> >
> >
> https://www.apache.org/dyn/closer.lua/spark/spark-3.4.0/spark-3.4.0-bin-hadoop2.tgz
> >
> > saying:
> >
> > The requested file or directory is *not* on the mirrors.
> >
> > The object is in not in our archive https://archive.apache.org/dist/
> >
> > Is this expected?
> >
> > Enrico
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-10 Thread Sean Owen

+1 from me

On Sun, Apr 9, 2023 at 7:19 PM Dongjoon Hyun  wrote:

> I'll start with my +1.
>
> I verified the checksum, signatures of the artifacts, and documentations.
> Also, ran the tests with YARN and K8s modules.
>
> Dongjoon.
>
> On 2023/04/09 23:46:10 Dongjoon Hyun wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 3.2.4.
> >
> > The vote is open until April 13th 1AM (PST) and passes if a majority +1
> PMC
> > votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.2.4
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/
> >
> > The tag to be voted on is v3.2.4-rc1 (commit
> > 0ae10ac18298d1792828f1d59b652ef17462d76e)
> > https://github.com/apache/spark/tree/v3.2.4-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1442/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-docs/
> >
> > The list of bug fixes going into 3.2.4 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12352607
> >
> > This release is using the release script of the tag v3.2.4-rc1.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.2.4?
> > ===
> >
> > The current list of open tickets targeted at 3.2.4 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > Version/s" = 3.2.4
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-08 Thread Sean Owen

+1 form me, same result as last time.

On Fri, Apr 7, 2023 at 6:30 PM Xinrong Meng 
wrote:

> Please vote on releasing the following candidate(RC7) as Apache Spark
> version 3.4.0.
>
> The vote is open until 11:59pm Pacific time *April 12th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.4.0-rc7 (commit
> 87a5442f7ed96b11051d8a9333476d080054e5a0):
> https://github.com/apache/spark/tree/v3.4.0-rc7
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1441
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc7.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong Meng
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-03-30 Thread Sean Owen

+1 same result from me as last time.

On Thu, Mar 30, 2023 at 3:21 AM Xinrong Meng 
wrote:

> Please vote on releasing the following candidate(RC5) as Apache Spark
> version 3.4.0.
>
> The vote is open until 11:59pm Pacific time *April 4th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v3.4.0-rc5* (commit
> f39ad617d32a671e120464e4a75986241d72c487):
> https://github.com/apache/spark/tree/v3.4.0-rc5
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1439
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc5.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong Meng
>
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Sean Owen

If the issue were just tags, then you can simply delete the tag and re-tag
the right commit. That doesn't change a commit log.
But is the issue that the relevant commits aren't in branch-3.4? Like I
don't see the usual release commits in
https://github.com/apache/spark/commits/branch-3.4
Yeah OK that needs a re-do.

We can still test this release.
It works for me, except that I still get the weird infinite-compile-loop
issue that doesn't seem to be related to Spark. The Spark Connect parts
seem to work.

On Thu, Mar 9, 2023 at 3:25 PM Dongjoon Hyun 
wrote:

> No~ We cannot in the AS-IS commit log status because it's screwed already
> as Emil wrote.
> Did you check the branch-3.2 commit log, Sean?
>
> Dongjoon.
>
>
> On Thu, Mar 9, 2023 at 11:42 AM Sean Owen  wrote:
>
>> We can just push the tags onto the branches as needed right? No need to
>> roll a new release
>>
>> On Thu, Mar 9, 2023, 1:36 PM Dongjoon Hyun 
>> wrote:
>>
>>> Yes, I also confirmed that the v3.4.0-rc3 tag is invalid.
>>>
>>> I guess we need RC4.
>>>
>>> Dongjoon.
>>>
>>> On Thu, Mar 9, 2023 at 7:13 AM Emil Ejbyfeldt
>>>  wrote:
>>>
>>>> It might being caused by the v3.4.0-rc3 tag not being part of the 3.4
>>>> branch branch-3.4:
>>>>
>>>> $ git log --pretty='format:%d %h' --graph origin/branch-3.4  v3.4.0-rc3
>>>> | head -n 10
>>>> *  (HEAD, origin/branch-3.4) e38e619946
>>>> *  f3e69a1fe2
>>>> *  74cf1a32b0
>>>> *  0191a5bde0
>>>> *  afced91348
>>>> | *  (tag: v3.4.0-rc3) b9be9ce15a
>>>> |/
>>>> *  006e838ede
>>>> *  fc29b07a31
>>>> *  8655dfe66d
>>>>
>>>>
>>>> Best,
>>>> Emil
>>>>
>>>> On 09/03/2023 15:50, yangjie01 wrote:
>>>> > HI, all
>>>> >
>>>> > I can't git check out the tag of v3.4.0-rc3. At the same time, there
>>>> is
>>>> > the following information on the Github page.
>>>> >
>>>> > Does anyone else have the same problem?
>>>> >
>>>> > Yang Jie
>>>> >
>>>> > *发件人**: *Xinrong Meng 
>>>> > *日期**: *2023年3月9日星期四20:05
>>>> > *收件人**: *dev 
>>>> > *主题**: *[VOTE] Release Apache Spark 3.4.0 (RC3)
>>>> >
>>>> > Please vote on releasing the following candidate(RC3) as Apache Spark
>>>> > version 3.4.0.
>>>> >
>>>> > The vote is open until 11:59pm Pacific time *March 14th* and passes
>>>> if a
>>>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>> >
>>>> > [ ] +1 Release this package as Apache Spark 3.4.0
>>>> > [ ] -1 Do not release this package because ...
>>>> >
>>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>>> > <
>>>> https://mailshield.baidu.com/check?q=eJcUboQ1HRRomPZKEwRzpl69wA8DbI%2fNIiRNsQ%3d%3d
>>>> >
>>>> >
>>>> > The tag to be voted on is *v3.4.0-rc3* (commit
>>>> > b9be9ce15a82b18cca080ee365d308c0820a29a9):
>>>> > https://github.com/apache/spark/tree/v3.4.0-rc3
>>>> > <
>>>> https://mailshield.baidu.com/check?q=ScnsHLDD3dexVfW9cjs3GovMbG2LLAZqBLq9cA8V%2fTOpCQ1LdeNWoD0%2fy7eVo%2b3de8Rk%2bQ%3d%3d
>>>> >
>>>> >
>>>> > The release files, including signatures, digests, etc. can be found
>>>> at:
>>>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc3-bin/
>>>> > <
>>>> https://mailshield.baidu.com/check?q=U%2fLs35p0l%2bUUTclb%2blAPSYb%2bALxMfer1Jc%2b3i965Bjh2CxHpG45RFLW0NqSwMx00Ci3MRMz%2b7mTmcKUIa27Pww%3d%3d
>>>> >
>>>> >
>>>> > Signatures used for Spark RCs can be found in this file:
>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>> > <
>>>> https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d
>>>> >
>>>> >
>>>> > The staging repository for this release can be found at:
>>>> >
>>>> https://repository.apache.org/content/repositories/orgapachespark-1437
>>>> > <
>>>> https://mailshield.baidu.com/check?q=otrdG4krOioiB1q4MH%2fIEA444B80s7LLO8D2IdosERiNzIymKGZ2D1jV4O0JA9%2fRVf

Re: [VOTE] Release Apache Spark 3.4.0 (RC3)

2023-03-09 Thread Sean Owen

We can just push the tags onto the branches as needed right? No need to
roll a new release

On Thu, Mar 9, 2023, 1:36 PM Dongjoon Hyun  wrote:

> Yes, I also confirmed that the v3.4.0-rc3 tag is invalid.
>
> I guess we need RC4.
>
> Dongjoon.
>
> On Thu, Mar 9, 2023 at 7:13 AM Emil Ejbyfeldt
>  wrote:
>
>> It might being caused by the v3.4.0-rc3 tag not being part of the 3.4
>> branch branch-3.4:
>>
>> $ git log --pretty='format:%d %h' --graph origin/branch-3.4  v3.4.0-rc3
>> | head -n 10
>> *  (HEAD, origin/branch-3.4) e38e619946
>> *  f3e69a1fe2
>> *  74cf1a32b0
>> *  0191a5bde0
>> *  afced91348
>> | *  (tag: v3.4.0-rc3) b9be9ce15a
>> |/
>> *  006e838ede
>> *  fc29b07a31
>> *  8655dfe66d
>>
>>
>> Best,
>> Emil
>>
>> On 09/03/2023 15:50, yangjie01 wrote:
>> > HI, all
>> >
>> > I can't git check out the tag of v3.4.0-rc3. At the same time, there is
>> > the following information on the Github page.
>> >
>> > Does anyone else have the same problem?
>> >
>> > Yang Jie
>> >
>> > *发件人**: *Xinrong Meng 
>> > *日期**: *2023年3月9日星期四20:05
>> > *收件人**: *dev 
>> > *主题**: *[VOTE] Release Apache Spark 3.4.0 (RC3)
>> >
>> > Please vote on releasing the following candidate(RC3) as Apache Spark
>> > version 3.4.0.
>> >
>> > The vote is open until 11:59pm Pacific time *March 14th* and passes if
>> a
>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.4.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> > <
>> https://mailshield.baidu.com/check?q=eJcUboQ1HRRomPZKEwRzpl69wA8DbI%2fNIiRNsQ%3d%3d
>> >
>> >
>> > The tag to be voted on is *v3.4.0-rc3* (commit
>> > b9be9ce15a82b18cca080ee365d308c0820a29a9):
>> > https://github.com/apache/spark/tree/v3.4.0-rc3
>> > <
>> https://mailshield.baidu.com/check?q=ScnsHLDD3dexVfW9cjs3GovMbG2LLAZqBLq9cA8V%2fTOpCQ1LdeNWoD0%2fy7eVo%2b3de8Rk%2bQ%3d%3d
>> >
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc3-bin/
>> > <
>> https://mailshield.baidu.com/check?q=U%2fLs35p0l%2bUUTclb%2blAPSYb%2bALxMfer1Jc%2b3i965Bjh2CxHpG45RFLW0NqSwMx00Ci3MRMz%2b7mTmcKUIa27Pww%3d%3d
>> >
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> > <
>> https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d
>> >
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1437
>> > <
>> https://mailshield.baidu.com/check?q=otrdG4krOioiB1q4MH%2fIEA444B80s7LLO8D2IdosERiNzIymKGZ2D1jV4O0JA9%2fRVfJje3xu6%2b33PB24x0R5V8ArX6BnzcYSkG5cHg%3d%3d
>> >
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc3-docs/
>> > <
>> https://mailshield.baidu.com/check?q=wA4vz1x6jiz0lcn1hQ0AhAiPk3gdFJbs7dSHwusppbgB4ph846QORuIJQzNRr8GzerMucW3FL7ADPE3radzpmm3er3g%3d
>> >
>> >
>> > The list of bug fixes going into 3.4.0 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12351465
>> > <
>> https://mailshield.baidu.com/check?q=hdSxPMAr37WGNHJRNA4Mh1JkSlqjUL%2bM8BgEclwc23ePHCBzkAjvhgnZa0N7SPRWAcgfoLXjX43CxJXmKnDj0LIElJs%3d
>> >
>> >
>> > This release is using the release script of the tag v3.4.0-rc3.
>> >
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.4.0?
>> > ===
>> > The current list of open tickets targeted at 3.4.0 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK
>> > <
>> https://mailshield.baidu.com/check?q=4UUpJqq41y71Gnuj0qTUYo6hTjqzT7oytN6x%2fvgC5XUtQUC8MfJ77tj7K70O%2f1QMmNoa1A%3d%3d>
>>  and
>> search for "Target Version/s" = 3.4.0
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> > In order to make ti

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-03 Thread Sean Owen

Oh OK, I thought this RC was meant to fix that.

On Fri, Mar 3, 2023 at 12:35 AM Jonathan Kelly 
wrote:

> I see that one too but have not investigated it myself. In the RC1 thread,
> it was mentioned that this occurs when running the tests via Maven but not
> via SBT. Does the test class path get set up differently when running via
> SBT vs. Maven?
>
> On Thu, Mar 2, 2023 at 5:37 PM Sean Owen  wrote:
>
>> Thanks, that's good to know. The workaround (deleting the thriftserver
>> target dir) works for me. Who knows?
>>
>> But I'm also still seeing:
>>
>> - simple udf *** FAILED ***
>>   io.grpc.StatusRuntimeException: INTERNAL:
>> org.apache.spark.sql.ClientE2ETestSuite
>>   at io.grpc.Status.asRuntimeException(Status.java:535)
>>   at
>> io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
>>   at org.apache.spark.sql.connect.client.SparkResult.org
>> $apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:61)
>>   at
>> org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:106)
>>   at
>> org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:123)
>>   at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2426)
>>   at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2747)
>>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2425)
>>   at
>> org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$8(ClientE2ETestSuite.scala:85)
>>   at
>> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>>
>> On Thu, Mar 2, 2023 at 4:38 PM Jonathan Kelly 
>> wrote:
>>
>>> Yes, this issue has driven me quite crazy as well! I hit this issue for
>>> a long time when compiling the master branch and running tests. Strangely,
>>> it would only occur, as you say, when running the tests and not during an
>>> initial build that skips running the tests. (However, I have seen instances
>>> where it does occur even in the initial build with tests skipped, but only
>>> on AWS CodeBuild, not when building locally or on Amazon Linux.)
>>>
>>> I thought for a long time that I was alone in this bizarre issue, but I
>>> eventually found sbt#6183 <https://github.com/sbt/sbt/issues/6183> and
>>> SPARK-41063 <https://issues.apache.org/jira/browse/SPARK-41063>, but
>>> both are unfortunately still open.
>>>
>>> I found at one point that the issue magically disappeared once
>>> [SPARK-41408] <https://issues.apache.org/jira/browse/SPARK-41408>[BUILD]
>>> Upgrade scala-maven-plugin to 4.8.0
>>> <https://github.com/apache/spark/commit/a3a755d36136295473a4873a6df33c295c29213e>
>>>  was
>>> merged, but then it cropped back up again at some point after that, and I
>>> used git bisect to find that the issue appeared again when [SPARK-27561]
>>> <https://issues.apache.org/jira/browse/SPARK-27561>[SQL] Support
>>> implicit lateral column alias resolution on Project
>>> <https://github.com/apache/spark/commit/7e9b88bfceb86d3b32e82a86b672aab3c74def8c>
>>>  was
>>> merged. This commit didn't even directly affect anything in
>>> hive-thriftserver, but it does make some pretty big changes to pretty core
>>> classes in sql/catalyst, so it's not too surprising that this could trigger
>>> an issue that seems to have to do with "very complicated inheritance
>>> hierarchies involving both Java and Scala", which is a phrase mentioned on
>>> sbt#6183 <https://github.com/sbt/sbt/issues/6183>.
>>>
>>> One thing that I did find to help was to
>>> delete sql/hive-thriftserver/target between building Spark and running the
>>> tests. This helps in my builds where the issue only occurs during the
>>> testing phase and not during the initial build phase, but of course it
>>> doesn't help in my builds where the issue occurs during that first build
>>> phase.
>>>
>>> ~ Jonathan Kelly
>>>
>>> On Thu, Mar 2, 2023 at 1:47 PM Sean Owen  wrote:
>>>
>>>> Has anyone seen this behavior -- I've never seen it before. The Hive
>>>> thriftserver module for me just goes into an infinite loop when running
>>>> tests:
>>>>
>>>> ...
>>>> [INFO] done compiling
>>>> [INFO] compiling 22 Scala sources and 24 Java sources to
>>>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
>>>> ...
>>>>

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-02 Thread Sean Owen

Thanks, that's good to know. The workaround (deleting the thriftserver
target dir) works for me. Who knows?

But I'm also still seeing:

- simple udf *** FAILED ***
  io.grpc.StatusRuntimeException: INTERNAL:
org.apache.spark.sql.ClientE2ETestSuite
  at io.grpc.Status.asRuntimeException(Status.java:535)
  at
io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660)
  at org.apache.spark.sql.connect.client.SparkResult.org
$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:61)
  at
org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:106)
  at
org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:123)
  at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2426)
  at org.apache.spark.sql.Dataset.withResult(Dataset.scala:2747)
  at org.apache.spark.sql.Dataset.collect(Dataset.scala:2425)
  at
org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$8(ClientE2ETestSuite.scala:85)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

On Thu, Mar 2, 2023 at 4:38 PM Jonathan Kelly 
wrote:

> Yes, this issue has driven me quite crazy as well! I hit this issue for a
> long time when compiling the master branch and running tests. Strangely, it
> would only occur, as you say, when running the tests and not during an
> initial build that skips running the tests. (However, I have seen instances
> where it does occur even in the initial build with tests skipped, but only
> on AWS CodeBuild, not when building locally or on Amazon Linux.)
>
> I thought for a long time that I was alone in this bizarre issue, but I
> eventually found sbt#6183 <https://github.com/sbt/sbt/issues/6183> and
> SPARK-41063 <https://issues.apache.org/jira/browse/SPARK-41063>, but both
> are unfortunately still open.
>
> I found at one point that the issue magically disappeared once
> [SPARK-41408] <https://issues.apache.org/jira/browse/SPARK-41408>[BUILD]
> Upgrade scala-maven-plugin to 4.8.0
> <https://github.com/apache/spark/commit/a3a755d36136295473a4873a6df33c295c29213e>
>  was
> merged, but then it cropped back up again at some point after that, and I
> used git bisect to find that the issue appeared again when [SPARK-27561]
> <https://issues.apache.org/jira/browse/SPARK-27561>[SQL] Support implicit
> lateral column alias resolution on Project
> <https://github.com/apache/spark/commit/7e9b88bfceb86d3b32e82a86b672aab3c74def8c>
>  was
> merged. This commit didn't even directly affect anything in
> hive-thriftserver, but it does make some pretty big changes to pretty core
> classes in sql/catalyst, so it's not too surprising that this could trigger
> an issue that seems to have to do with "very complicated inheritance
> hierarchies involving both Java and Scala", which is a phrase mentioned on
> sbt#6183 <https://github.com/sbt/sbt/issues/6183>.
>
> One thing that I did find to help was to
> delete sql/hive-thriftserver/target between building Spark and running the
> tests. This helps in my builds where the issue only occurs during the
> testing phase and not during the initial build phase, but of course it
> doesn't help in my builds where the issue occurs during that first build
> phase.
>
> ~ Jonathan Kelly
>
> On Thu, Mar 2, 2023 at 1:47 PM Sean Owen  wrote:
>
>> Has anyone seen this behavior -- I've never seen it before. The Hive
>> thriftserver module for me just goes into an infinite loop when running
>> tests:
>>
>> ...
>> [INFO] done compiling
>> [INFO] compiling 22 Scala sources and 24 Java sources to
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
>> ...
>> [INFO] done compiling
>> [INFO] compiling 22 Scala sources and 9 Java sources to
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
>> ...
>> [WARNING] [Warn]
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:25:29:
>>  [deprecation] GnuParser in org.apache.commons.cli has been deprecated
>> [WARNING] [Warn]
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HiveAuthFactory.java:333:18:
>>  [deprecation] authorize(UserGroupInformation,String,Configuration) in
>> ProxyUsers has been deprecated
>> [WARNING] [Warn]
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java:110:16:
>>  [deprecation] HIVE_SERVER2_THRIFT_HTTP_COOKIE_IS_SECURE in ConfVars has
>> been deprecated
>> [WARNING] [Warn]
>> /mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apach

Re: [VOTE] Release Apache Spark 3.4.0 (RC2)

2023-03-02 Thread Sean Owen

Has anyone seen this behavior -- I've never seen it before. The Hive
thriftserver module for me just goes into an infinite loop when running
tests:

...
[INFO] done compiling
[INFO] compiling 22 Scala sources and 24 Java sources to
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
...
[INFO] done compiling
[INFO] compiling 22 Scala sources and 9 Java sources to
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/target/scala-2.12/classes
...
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:25:29:
 [deprecation] GnuParser in org.apache.commons.cli has been deprecated
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HiveAuthFactory.java:333:18:
 [deprecation] authorize(UserGroupInformation,String,Configuration) in
ProxyUsers has been deprecated
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java:110:16:
 [deprecation] HIVE_SERVER2_THRIFT_HTTP_COOKIE_IS_SECURE in ConfVars has
been deprecated
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftHttpServlet.java:553:53:
 [deprecation] HttpUtils in javax.servlet.http has been deprecated
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:185:24:
 [deprecation] OptionBuilder in org.apache.commons.cli has been deprecated
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:187:10:
 [static] static method should be qualified by type name, OptionBuilder,
instead of by an expression
[WARNING] [Warn]
/mnt/data/testing/spark-3.4.0/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:197:26:
 [deprecation] GnuParser in org.apache.commons.cli has been deprecated
...

... repeated over and over.

On Thu, Mar 2, 2023 at 6:04 AM Xinrong Meng 
wrote:

> Please vote on releasing the following candidate(RC2) as Apache Spark
> version 3.4.0.
>
> The vote is open until 11:59pm Pacific time *March 7th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v3.4.0-rc2* (commit
> 759511bb59b206ac5ff18f377c239a2f38bf5db6):
> https://github.com/apache/spark/tree/v3.4.0-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1436
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc2-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc2.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong Me

Re: [Question] LimitedInputStream license issue in Spark source.

2023-03-01 Thread Sean Owen

Right, it contains ALv2 licensed code attributed to two authors - some is
from Guava, some is from Apache Spark contributors.
I thought this is how we should handle this. It's not feasible to go line
by line and say what came from where.

On Wed, Mar 1, 2023 at 1:33 AM Dongjoon Hyun 
wrote:

> May I ask why do you thinkn in that way? Could you elaborate a little more
> about your concerns if you mean it from a legal perspective?
>
> > The ASF header states "Licensed to the Apache Software Foundation (ASF)
> under one or more contributor license agreements.”
> > I ‘m not sure this is true with this file even though both Spark and
> this file are under the ALv2 license.
>
> On Tue, Feb 28, 2023 at 11:26 PM Justin Mclean 
> wrote:
>
>> Hi,
>>
>> The issue is not the original header it is the addition of the ASF
>> header. The ASF header states "Licensed to the Apache Software Foundation
>> (ASF) under one or more contributor license agreements.” I ‘m not sure this
>> is true with this file even though both Spark and this file are under the
>> ALv2 license.
>>
>> Kind Regards,
>> Justin
>
>

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Sean Owen

FWIW I agree with this.

On Wed, Feb 22, 2023 at 2:59 PM Allan Folting  wrote:

> Hi all,
>
> I would like to propose that we show Python code examples first in the
> Spark documentation where we have multiple programming language examples.
> An example is on the Quick Start page:
> https://spark.apache.org/docs/latest/quick-start.html
>
> I propose this change because Python has become more popular than the
> other languages supported in Apache Spark. There are a lot more users of
> Spark in Python than Scala today and Python attracts a broader set of new
> users.
> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava.
>
> Also, this change aligns with Python already being the first tab on our
> home page:
> https://spark.apache.org/
>
> Anyone who wants to use another language can still just click on the other
> tabs.
>
> I created a draft PR for the Spark SQL, DataFrames and Datasets Guide page
> as a first step:
> https://github.com/apache/spark/pull/40087
>
>
> I would appreciate it if you could share your thoughts on this proposal.
>
>
> Thanks a lot,
> Allan Folting
>

Re: [DISCUSS] Make release cadence predictable

2023-02-15 Thread Sean Owen

I don't think there is a delay per se, because there is no hard release
date to begin with, to delay with respect to. It's been driven by, "feels
like enough stuff has gone in" and "someone is willing to roll a release",
and that happens more like every 8-9 months. This would be a shift not only
in expectation - lower the threshold for 'enough stuff has gone in' to
probably match a 6 month cadence - but also a shift in policy to a release
train-like process. If something isn't ready then it just waits another 6
months.

You're right, the problem is kind of - what is something is in process in a
half-baked state? you don't really want to release half a thing, nor do you
want to develop it quite separately from the master branch.
It is worth asking what prompts this, too. Just, we want to release earlier
and more often?

On Wed, Feb 15, 2023 at 1:19 PM Maciej  wrote:

> Hi,
>
> Sorry for a silly question, but do we know what exactly caused these
> delays? Are these avoidable?
>
> It is not a systematic observation, but my general impression is that we
> rarely delay for sake of individual features, unless there is some soft
> consensus about their importance. Arguably, these could be postponed,
> assuming we can adhere to the schedule.
>
> And then, we're left with large, multi-task features. A lot can be done
> with proper timing and design, but in our current process there is no way
> to guarantee that each of these can be delivered within given time window.
> How are we going to handle these? Delivering half-baked things is hardly
> satisfying solution and more rigid schedule can only increase pressure on
> maintainers. Do we plan to introduce something like feature branches for
> these, to isolate upcoming release in case of delay?
>
> On 2/14/23 19:53, Dongjoon Hyun wrote:
>
> +1 for Hyukjin and Sean's opinion.
>
> Thank you for initiating this discussion.
>
> If we have a fixed-predefined regular 6-month, I believe we can persuade
> the incomplete features to wait for next releases more easily.
>
> In addition, I want to add the first RC1 date requirement because RC1
> always did a great job for us.
>
> I guess `branch-cut + 1M (no later than 1month)` could be the reasonable
> deadline.
>
> Thanks,
> Dongjoon.
>
>
> On Tue, Feb 14, 2023 at 6:33 AM Sean Owen  wrote:
>
>> I'm fine with shifting to a stricter cadence-based schedule. Sometimes,
>> it'll mean some significant change misses a release rather than delays it.
>> If people are OK with that discipline, sure.
>> A hard 6-month cycle would mean the minor releases are more frequent and
>> have less change in them. That's probably OK. We could also decide to
>> choose a longer cadence like 9 months, but I don't know if that's better.
>> I assume maintenance releases would still be as-needed, and major
>> releases would also work differently - probably no 4.0 until next year at
>> the earliest.
>>
>> On Tue, Feb 14, 2023 at 3:01 AM Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>> *TL;DR*: Branch cut for every 6 months (January and July).
>>>
>>> I would like to discuss/propose to make our release cadence predictable.
>>> In our documentation, we mention as follows:
>>>
>>> In general, feature (“minor”) releases occur about every 6 months. Hence,
>>> Spark 2.3.0 would generally be released about 6 months after 2.2.0.
>>>
>>> However, the reality is slightly different. Here is the time it took for
>>> the recent releases:
>>>
>>>- Spark 3.3.0 took 8 months
>>>- Spark 3.2.0 took 7 months
>>>- Spark 3.1 took 9 months
>>>
>>> Here are problems caused by such delay:
>>>
>>>- The whole related schedules are affected in all downstream
>>>projects, vendors, etc.
>>>- It makes the release date unpredictable to the end users.
>>>- Developers as well as the release managers have to rush because of
>>>the delay, which prevents us from focusing on having a proper
>>>regression-free release.
>>>
>>> My proposal is to branch cut every 6 months (January and July that
>>> avoids the public holidays / vacation period in general) so the release can
>>> happen twice
>>> every year regardless of the actual release date.
>>> I believe it both makes the release cadence predictable, and relaxes the
>>> burden about making releases.
>>>
>>> WDYT?
>>>
>>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>

Re: [DISCUSS] Make release cadence predictable

2023-02-14 Thread Sean Owen

I'm fine with shifting to a stricter cadence-based schedule. Sometimes,
it'll mean some significant change misses a release rather than delays it.
If people are OK with that discipline, sure.
A hard 6-month cycle would mean the minor releases are more frequent and
have less change in them. That's probably OK. We could also decide to
choose a longer cadence like 9 months, but I don't know if that's better.
I assume maintenance releases would still be as-needed, and major releases
would also work differently - probably no 4.0 until next year at the
earliest.

On Tue, Feb 14, 2023 at 3:01 AM Hyukjin Kwon  wrote:

> Hi all,
>
> *TL;DR*: Branch cut for every 6 months (January and July).
>
> I would like to discuss/propose to make our release cadence predictable.
> In our documentation, we mention as follows:
>
> In general, feature (“minor”) releases occur about every 6 months. Hence,
> Spark 2.3.0 would generally be released about 6 months after 2.2.0.
>
> However, the reality is slightly different. Here is the time it took for
> the recent releases:
>
>- Spark 3.3.0 took 8 months
>- Spark 3.2.0 took 7 months
>- Spark 3.1 took 9 months
>
> Here are problems caused by such delay:
>
>- The whole related schedules are affected in all downstream projects,
>vendors, etc.
>- It makes the release date unpredictable to the end users.
>- Developers as well as the release managers have to rush because of
>the delay, which prevents us from focusing on having a proper
>regression-free release.
>
> My proposal is to branch cut every 6 months (January and July that avoids
> the public holidays / vacation period in general) so the release can happen
> twice
> every year regardless of the actual release date.
> I believe it both makes the release cadence predictable, and relaxes the
> burden about making releases.
>
> WDYT?
>

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread Sean Owen

Agree, just, if it's such a tiny change, and it actually fixes the issue,
maybe worth getting that into 3.3.x. I don't feel strongly.

On Mon, Feb 13, 2023 at 11:19 AM L. C. Hsieh  wrote:

> If it is not supported in Spark 3.3.x, it looks like an improvement at
> Spark 3.4.
> For such cases we usually do not back port. I think this is also why
> the PR did not back port when it was merged.
>
> I'm okay if there is consensus to back port it.
>
>

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-13 Thread Sean Owen

Does that change change the result for Spark 3.3.x?
It looks like we do not support Python 3.11 in Spark 3.3.x, which is one
answer to whether this should be changed now.
But if that's the only change that matters for Python 3.11 and makes it
work, sure I think we should back-port. It doesn't necessarily block a
release but if that's the case, it seems OK to include to me in a next RC.

On Mon, Feb 13, 2023 at 10:53 AM Bjørn Jørgensen 
wrote:

> There is a fix for python 3.11 https://github.com/apache/spark/pull/38987
> We should have this in more branches.
>
> man. 13. feb. 2023 kl. 09:39 skrev Bjørn Jørgensen <
> bjornjorgen...@gmail.com>:
>
>> On manjaro it is Python 3.10.9
>>
>> On ubuntu it is Python 3.11.1
>>
>> man. 13. feb. 2023 kl. 03:24 skrev yangjie01 :
>>
>>> Which Python version do you use for testing? When I use the latest
>>> Python 3.11, I can reproduce similar test failures (43 tests of sql module
>>> fail), but when I use python 3.10, they will succeed
>>>
>>>
>>>
>>> YangJie
>>>
>>>
>>>
>>> *发件人**: *Bjørn Jørgensen 
>>> *日期**: *2023年2月13日 星期一 05:09
>>> *收件人**: *Sean Owen 
>>> *抄送**: *"L. C. Hsieh" , Spark dev list <
>>> dev@spark.apache.org>
>>> *主题**: *Re: [VOTE] Release Spark 3.3.2 (RC1)
>>>
>>>
>>>
>>> Tried it one more time and the same result.
>>>
>>>
>>>
>>> On another box with Manjaro
>>>
>>> 
>>> [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
>>> [INFO]
>>> [INFO] Spark Project Parent POM ... SUCCESS
>>> [01:50 min]
>>> [INFO] Spark Project Tags . SUCCESS [
>>> 17.359 s]
>>> [INFO] Spark Project Sketch ... SUCCESS [
>>> 12.517 s]
>>> [INFO] Spark Project Local DB . SUCCESS [
>>> 14.463 s]
>>> [INFO] Spark Project Networking ... SUCCESS
>>> [01:07 min]
>>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>>>  9.013 s]
>>> [INFO] Spark Project Unsafe ... SUCCESS [
>>>  8.184 s]
>>> [INFO] Spark Project Launcher . SUCCESS [
>>> 10.454 s]
>>> [INFO] Spark Project Core . SUCCESS
>>> [23:58 min]
>>> [INFO] Spark Project ML Local Library . SUCCESS [
>>> 21.218 s]
>>> [INFO] Spark Project GraphX ... SUCCESS
>>> [01:24 min]
>>> [INFO] Spark Project Streaming  SUCCESS
>>> [04:57 min]
>>> [INFO] Spark Project Catalyst . SUCCESS
>>> [08:00 min]
>>> [INFO] Spark Project SQL .. SUCCESS [
>>>  01:02 h]
>>> [INFO] Spark Project ML Library ... SUCCESS
>>> [14:38 min]
>>> [INFO] Spark Project Tools  SUCCESS [
>>>  4.394 s]
>>> [INFO] Spark Project Hive . SUCCESS
>>> [53:43 min]
>>> [INFO] Spark Project REPL . SUCCESS
>>> [01:16 min]
>>> [INFO] Spark Project Assembly . SUCCESS [
>>>  2.186 s]
>>> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [
>>> 16.150 s]
>>> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS
>>> [01:34 min]
>>> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS
>>> [32:55 min]
>>> [INFO] Spark Project Examples . SUCCESS [
>>> 23.800 s]
>>> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [
>>>  7.301 s]
>>> [INFO] Spark Avro . SUCCESS
>>> [01:19 min]
>>> [INFO]
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO]
>>> 
>>> [INFO] Total time:  03:31 h
>>> [INFO] Finished at: 2023-02-12T21:54:20+01:00
>>> [INFO]
>>> 
>>> [bjorn@amd7g spark-3.3.2]$  java -version
>>

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-11 Thread Sean Owen

+1 The tests and all results were the same as ever for me (Java 11, Scala
2.13, Ubuntu 22.04)
I also didn't see that issue ... maybe somehow locale related? which could
still be a bug.

On Sat, Feb 11, 2023 at 8:49 PM L. C. Hsieh  wrote:

> Thank you for testing it.
>
> I was going to run it again but still didn't see any errors.
>
> I also checked CI (and looked again now) on branch-3.3 before cutting RC.
>
> BTW, I didn't find an actual test failure (i.e. "- test_name ***
> FAILED ***") in the log file.
>
> Maybe it is due to the dev env? What dev env you're using to run the test?
>
>
> On Sat, Feb 11, 2023 at 8:58 AM Bjørn Jørgensen
>  wrote:
> >
> >
> > ./build/mvn clean package
> >
> > Run completed in 1 hour, 18 minutes, 29 seconds.
> > Total number of tests run: 11652
> > Suites: completed 516, aborted 0
> > Tests: succeeded 11609, failed 43, canceled 8, ignored 57, pending 0
> > *** 43 TESTS FAILED ***
> > [INFO]
> 
> > [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
> > [INFO]
> > [INFO] Spark Project Parent POM ... SUCCESS [
> 3.418 s]
> > [INFO] Spark Project Tags . SUCCESS [
> 17.845 s]
> > [INFO] Spark Project Sketch ... SUCCESS [
> 20.791 s]
> > [INFO] Spark Project Local DB . SUCCESS [
> 16.527 s]
> > [INFO] Spark Project Networking ... SUCCESS
> [01:03 min]
> > [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 9.914 s]
> > [INFO] Spark Project Unsafe ... SUCCESS [
> 12.007 s]
> > [INFO] Spark Project Launcher . SUCCESS [
> 7.620 s]
> > [INFO] Spark Project Core . SUCCESS
> [40:04 min]
> > [INFO] Spark Project ML Local Library . SUCCESS [
> 29.997 s]
> > [INFO] Spark Project GraphX ... SUCCESS
> [02:33 min]
> > [INFO] Spark Project Streaming  SUCCESS
> [05:51 min]
> > [INFO] Spark Project Catalyst . SUCCESS
> [13:29 min]
> > [INFO] Spark Project SQL .. FAILURE [
> 01:25 h]
> > [INFO] Spark Project ML Library ... SKIPPED
> > [INFO] Spark Project Tools  SKIPPED
> > [INFO] Spark Project Hive . SKIPPED
> > [INFO] Spark Project REPL . SKIPPED
> > [INFO] Spark Project Assembly . SKIPPED
> > [INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED
> > [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
> > [INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
> > [INFO] Spark Project Examples . SKIPPED
> > [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
> > [INFO] Spark Avro . SKIPPED
> > [INFO]
> 
> > [INFO] BUILD FAILURE
> > [INFO]
> 
> > [INFO] Total time:  02:30 h
> > [INFO] Finished at: 2023-02-11T17:32:45+01:00
> >
> > lør. 11. feb. 2023 kl. 06:01 skrev L. C. Hsieh :
> >>
> >> Please vote on releasing the following candidate as Apache Spark
> version 3.3.2.
> >>
> >> The vote is open until Feb 15th 9AM (PST) and passes if a majority +1
> >> PMC votes are cast, with a minimum of 3 +1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 3.3.2
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see https://spark.apache.org/
> >>
> >> The tag to be voted on is v3.3.2-rc1 (commit
> >> 5103e00c4ce5fcc4264ca9c4df12295d42557af6):
> >> https://github.com/apache/spark/tree/v3.3.2-rc1
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1433/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.2-rc1-docs/
> >>
> >> The list of bug fixes going into 3.3.2 can be found at the following
> URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12352299
> >>
> >> This release is using the release script of the tag v3.3.2-rc1.
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark work

Re: Building Spark to run PySpark Tests?

2023-01-19 Thread Sean Owen

It's not clear what error you're facing from this info (ConnectionError
could mean lots of things), so would be hard to generalize answers. How
much mem do you have on your Mac?
-Xmx2g sounds low, but also probably doesn't matter much.
Spark builds work on my Mac, FWIW.

On Thu, Jan 19, 2023 at 10:15 AM Adam Chhina  wrote:

> Hmm, would there be a list of common env issues that would interfere with
> builds? Looking up the error message, it seemed like often the issue was
> OOM by the JVM process. I’m not sure if that’s what’s happening here, since
> during the build and setting up the tests the config should have allocated
> enough memory?
>
> I’ve been just trying to follow the build docs, and so far I’m running as
> such:
>
> > git clone --branch v3.2.3 https://github.com/apache/spark.git
> > cd spark
> > export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g” // was
> unset, but set to be safe
> > export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES // I saw in the
> developer tools that some pyspark tests were having issues on macOS
> > export JAVA_HOME=`/usr/libexec/java_home -v 11`
> > ./build/mvn -DskipTests clean package -Phive
> > ./python/run-tests --python-executables --testnames
> ‘pyspark.tests.test_broadcast'
>
> > java -version
>
> openjdk version "11.0.17" 2022-10-18
>
> OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>
> OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>
>
> > OS
>
> Ventura 13.1 (22C65)
>
>
> Best,
>
>
> Adam Chhina
>
> On Jan 18, 2023, at 6:50 PM, Sean Owen  wrote:
>
> Release _branches_ are tested as commits arrive to the branch, yes. That's
> what you see at https://github.com/apache/spark/actions
> Released versions are fixed, they don't change, and were also manually
> tested before release, so no they are not re-tested; there is no need.
>
> You presumably have some local env issue, because the source of Spark
> 3.2.3 was passing CI/CD at time of release as well as manual tests of the
> PMC.
>
>
> On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina  wrote:
>
>> Hi Sean,
>>
>> That’s fair in regards to 3.3.x being the current release branch. I’m not
>> familiar with the testing schedule, but I had assumed all currently
>> supported release versions would have some nightly/weekly tests ran; is
>> that not the case? I only ask, as when I when I’m seeing these test
>> failures, I assumed these were either known/unknown from some recurring
>> testing pipeline.
>>
>> Also, unfortunately using v3.2.3 also had the same test failures.
>>
>> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>>
>> I’ve posted the traceback below for one of the ran tests. At the end it
>> mentioned to check the logs - `see logs`. However I wasn’t sure whether
>> that just meant the traceback or some more detailed logs elsewhere? I
>> wasn’t able to see any files that looked relevant running `find . -name
>> “*logs*”` afterwards. Sorry if I’m missing something obvious.
>>
>> ```
>> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest)
>> ... ERROR
>> test_broadcast_value_against_gc
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_no_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_with_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>>
>> ==
>> ERROR: test_broadcast_with_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest)
>> --
>> Traceback (most recent call last):
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in
>> test_broadcast_with_encryption
>> self._test_multiple_broadcasts(("spark.io.encryption.enabled",
>> "true"))
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in
>> _test_multiple_broadcasts
>> conf = SparkConf()
>>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
>> self._jconf = _jvm.SparkConf(loadDefaults)
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
>> 1709, in __getattr__
>> answer = self._gateway_client.send_command(
>>   File
>> "$path/spark/python/lib/py4j-0.

Re: Can you create an apache jira account for me? Thanks very much!

2023-01-19 Thread Sean Owen

I can help offline. Send me your preferred JIRA user name.

On Thu, Jan 19, 2023 at 7:12 AM Wei Yan  wrote:

> When I tried to sign up through this site:
> https://issues.apache.org/jira/secure/Signup!default.jspa
> I got an error message:"Sorry, you can't sign up to this Jira site at the
> moment as it's private."
> and I got a suggestion:"If you think you should be able to sign up then
> you should let the Jira administrator know".
> So I think I need some help.
>
>
>

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen

Release _branches_ are tested as commits arrive to the branch, yes. That's
what you see at https://github.com/apache/spark/actions
Released versions are fixed, they don't change, and were also manually
tested before release, so no they are not re-tested; there is no need.

You presumably have some local env issue, because the source of Spark 3.2.3
was passing CI/CD at time of release as well as manual tests of the PMC.


On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina  wrote:

> Hi Sean,
>
> That’s fair in regards to 3.3.x being the current release branch. I’m not
> familiar with the testing schedule, but I had assumed all currently
> supported release versions would have some nightly/weekly tests ran; is
> that not the case? I only ask, as when I when I’m seeing these test
> failures, I assumed these were either known/unknown from some recurring
> testing pipeline.
>
> Also, unfortunately using v3.2.3 also had the same test failures.
>
> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>
> I’ve posted the traceback below for one of the ran tests. At the end it
> mentioned to check the logs - `see logs`. However I wasn’t sure whether
> that just meant the traceback or some more detailed logs elsewhere? I
> wasn’t able to see any files that looked relevant running `find . -name
> “*logs*”` afterwards. Sorry if I’m missing something obvious.
>
> ```
> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest)
> ... ERROR
> test_broadcast_value_against_gc
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_no_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>
> ==
> ERROR: test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest)
> --
> Traceback (most recent call last):
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in
> test_broadcast_with_encryption
> self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in
> _test_multiple_broadcasts
> conf = SparkConf()
>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
> self._jconf = _jvm.SparkConf(loadDefaults)
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1709, in __getattr__
> answer = self._gateway_client.send_command(
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1036, in send_command
> connection = self._get_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 284, in _get_connection
> connection = self._create_new_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 291, in _create_new_connection
> connection.connect_to_java_server()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 438, in connect_to_java_server
> self.socket.connect((self.java_address, self.java_port))
> ConnectionRefusedError: [Errno 61] Connection refused
>
> --
> Ran 7 tests in 12.950s
>
> FAILED (errors=7)
> sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>
>
> Had test failures in pyspark.tests.test_broadcast with
> /usr/local/bin/python3; see logs.
> ```
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 5:03 PM, Sean Owen  wrote:
>
> That isn't the released version either, but rather the head of the 3.2
> branch (which is beyond 3.2.3).
> You may want to check out the v3.2.3 tag instead:
> https://github.com/apache/spark/tree/v3.2.3
> ... instead of 3.2.1.
> But note of course the 3.3.x is the current release branch anyway.
>
> Hard to say what the error is without seeing more of the error log.
>
> That final warning is fine, just means you are using Java 11+.
>
>
> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina  wrote:
>
>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>>
>> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>>
>> Ah, so the old failing tests are passing now, but I am seeing failures in
>> `pyspark.tests.test_broadc

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen

That isn't the released version either, but rather the head of the 3.2
branch (which is beyond 3.2.3).
You may want to check out the v3.2.3 tag instead:
https://github.com/apache/spark/tree/v3.2.3
... instead of 3.2.1.
But note of course the 3.3.x is the current release branch anyway.

Hard to say what the error is without seeing more of the error log.

That final warning is fine, just means you are using Java 11+.


On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina  wrote:

> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>
> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>
> Ah, so the old failing tests are passing now, but I am seeing failures in 
> `pyspark.tests.test_broadcast`
> such as  `test_broadcast_value_against_gc`, with a majority of them
> failing due to `ConnectionRefusedError: [Errno 61] Connection refused`.
> Maybe these tests are not mean to be ran locally, and only in the pipeline?
>
> Also, I see this warning that mentions to notify the maintainers here:
>
> ```
>
> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor
> java.nio.DirectByteBuffer(long,int)
> ```
>
> FWIW, not sure if this matters, but python executable used for running
> these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen 
> wrote:
>
> Replace
> > > git clone g...@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
>
> with
> git clone --branch branch-3.2 https://github.com/apache/spark.git
> This will give you branch 3.2 as today, what I suppose you call upstream
>
> https://github.com/apache/spark/commits/branch-3.2
> and right now all tests in github action are passed :)
>
>
> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen :
>
>> Never seen those, but it's probably a difference in pandas, numpy
>> versions. You can see the current CICD test results in GitHub Actions. But,
>> you want to use release versions, not an RC. 3.2.1 is not the latest
>> version, and it's possible the tests were actually failing in the RC.
>>
>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina  wrote:
>>
>>> Bump,
>>>
>>> Just trying to see where I can find what tests are known failing for a
>>> particular release, to ensure I’m building upstream correctly following the
>>> build docs. I figured this would be the best place to ask as it pertains to
>>> building and testing upstream (also more than happy to provide a PR for any
>>> docs if required afterwards), however if there would be a more appropriate
>>> place, please let me know.
>>>
>>> Best,
>>>
>>> Adam Chhina
>>>
>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina 
>>> wrote:
>>> >
>>> > As part of an upgrade I was looking to run upstream PySpark unit tests
>>> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
>>> However, I'm running into some issues with failing unit tests, which I'm
>>> not sure are failing upstream or due to some step I missed in the build.
>>> >
>>> > The current failing tests (at least so far, since I believe the python
>>> script exits on test failure):
>>> > ```
>>> > ==
>>> > FAIL: test_train_prediction
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>> > Test that error on test data improves as model is trained.
>>> > --
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 474, in test_train_prediction
>>> > eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 86, in eventually
>>> > lastValue = condition()
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 469, in condition
>>> > self.assertGreater(errors[1] - errors[-1], 2)
>>> > AssertionError: 1.8960983527735014 not greater than 2
>>> >
>

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen

Never seen those, but it's probably a difference in pandas, numpy versions.
You can see the current CICD test results in GitHub Actions. But, you want
to use release versions, not an RC. 3.2.1 is not the latest version, and
it's possible the tests were actually failing in the RC.

On Wed, Jan 18, 2023, 10:57 AM Adam Chhina  wrote:

> Bump,
>
> Just trying to see where I can find what tests are known failing for a
> particular release, to ensure I’m building upstream correctly following the
> build docs. I figured this would be the best place to ask as it pertains to
> building and testing upstream (also more than happy to provide a PR for any
> docs if required afterwards), however if there would be a more appropriate
> place, please let me know.
>
> Best,
>
> Adam Chhina
>
> > On Dec 27, 2022, at 11:37 AM, Adam Chhina  wrote:
> >
> > As part of an upgrade I was looking to run upstream PySpark unit tests
> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
> However, I'm running into some issues with failing unit tests, which I'm
> not sure are failing upstream or due to some step I missed in the build.
> >
> > The current failing tests (at least so far, since I believe the python
> script exits on test failure):
> > ```
> > ==
> > FAIL: test_train_prediction
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> > Test that error on test data improves as model is trained.
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 474, in test_train_prediction
> > eventually(condition, timeout=180.0)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86,
> in eventually
> > lastValue = condition()
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 469, in condition
> > self.assertGreater(errors[1] - errors[-1], 2)
> > AssertionError: 1.8960983527735014 not greater than 2
> >
> > ==
> > FAIL: test_parameter_accuracy
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> > Test that the final value of weights is close to the desired value.
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 229, in test_parameter_accuracy
> > eventually(condition, timeout=60.0, catch_assertions=True)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91,
> in eventually
> > raise lastValue
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82,
> in eventually
> > lastValue = condition()
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 226, in condition
> > self.assertAlmostEqual(rel, 0.1, 1)
> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
> (0.13052813480829392 difference)
> >
> > ==
> > FAIL: test_training_and_prediction
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> > Test that the model improves on toy data with no. of batches
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 334, in test_training_and_prediction
> > eventually(condition, timeout=180.0)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93,
> in eventually
> > raise AssertionError(
> > AssertionError: Test failed due to timeout after 180 sec, with last
> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
> >
> > --
> > Ran 13 tests in 661.536s
> >
> > FAILED (failures=3, skipped=1)
> >
> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with
> /usr/local/bin/python3; see logs.
> > ```
> >
> > Here's how I'm currently building Spark, I was using the
> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
> docs as a reference.
> > ```
> > > git clone g...@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
> > > ./build/mvn -DskipTests clean package -Phive
> > > export JAVA_HOME=$(path/to/jdk/11)
> > > ./python/run-tests
> > ```
> >
> > Current Java version
> > ```
> > java -

Re: [VOTE] Release Spark 3.2.3 (RC1)

2022-11-15 Thread Sean Owen

+1 from me, at least from my testing. Java 8 + Scala 2.12 and Java 8 +
Scala 2.13 worked for me, and I didn't see a test hang. I am testing with
Python 3.10 FWIW.

On Tue, Nov 15, 2022 at 6:37 AM Yang,Jie(INF)  wrote:

> Hi, all
>
>
>
> I test v3.2.3 with following command:
>
>
>
> ```
>
> dev/change-scala-version.sh 2.13
>
> build/mvn clean install -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
> -Pscala-2.13 -fn
>
> ```
>
>
>
> The testing environment is:
>
>
>
> OS: CentOS 6u3 Final
>
> Java: zulu 11.0.17
>
> Python: 3.9.7
>
> Scala: 2.13
>
>
>
> The above test command has been executed twice, and all times hang in the
> following stack:
>
>
>
> ```
>
> "ScalaTest-main-running-JoinSuite" #1 prio=5 os_prio=0 cpu=312870.06ms
> elapsed=1552.65s tid=0x7f2ddc02d000 nid=0x7132 waiting on condition
> [0x7f2de3929000]
>
>java.lang.Thread.State: WAITING (parking)
>
>at jdk.internal.misc.Unsafe.park(java.base@11.0.17/Native Method)
>
>- parking to wait for  <0x000790d00050> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>
>at java.util.concurrent.locks.LockSupport.park(java.base@11.0.17
> /LockSupport.java:194)
>
>at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.17
> /AbstractQueuedSynchronizer.java:2081)
>
>at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.17
> /LinkedBlockingQueue.java:433)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:275)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$Lambda$9429/0x000802269840.apply(Unknown
> Source)
>
>at
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:228)
>
>- locked <0x000790d00208> (a java.lang.Object)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:370)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecute(AdaptiveSparkPlanExec.scala:355)
>
>at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
>
>at
> org.apache.spark.sql.execution.SparkPlan$$Lambda$8573/0x000801f99c40.apply(Unknown
> Source)
>
>at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
>
>at
> org.apache.spark.sql.execution.SparkPlan$$Lambda$8574/0x000801f9a040.apply(Unknown
> Source)
>
>at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>
>at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
>
>at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
>
>at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:172)
>
>- locked <0x000790d00218> (a
> org.apache.spark.sql.execution.QueryExecution)
>
>at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:171)
>
>at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3247)
>
>- locked <0x000790d002d8> (a org.apache.spark.sql.Dataset)
>
>at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3245)
>
>at
> org.apache.spark.sql.QueryTest$.$anonfun$getErrorMessageInCheckAnswer$1(QueryTest.scala:265)
>
>at
> org.apache.spark.sql.QueryTest$$$Lambda$8564/0x000801f94440.apply$mcJ$sp(Unknown
> Source)
>
>at
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.scala:17)
>
>at
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>
>at
> org.apache.spark.sql.QueryTest$.getErrorMessageInCheckAnswer(QueryTest.scala:265)
>
>at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:242)
>
>at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:151)
>
>at org.apache.spark.sql.JoinSuite.checkAnswer(JoinSuite.scala:58)
>
>at
> org.apache.spark.sql.JoinSuite.$anonfun$new$138(JoinSuite.scala:1062)
>
>at
> org.apache.spark.sql.JoinSuite$$Lambda$2827/0x0008013d5840.apply$mcV$sp(Unknown
> Source)
>
>at
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
>
>at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>
>at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>
>at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>
>at org.scalatest.Transformer.apply(Transformer.scala:22)
>
>at org.scalatest.Transformer.apply(Transformer.scala:20)
>
>at
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.

Re: CVE-2022-42889

2022-10-27 Thread Sean Owen

Right. It seems there is only one direct use of that part of commons-text,
and it is not applied to user-supplied inputs (reads and substitutes into
error message templates).
At a glance I do not see how it would affect Spark; it's not impossible
that it does. In any event, commons-text is being updated anyway in branch
3.2 and later, so this will be updated in maintained branches eventually.
It missed the 3.3.1 release, but my message is, it's also not even clear it
matters to Spark.

I don't think this would become a Spark CVE; it affects commons-text.
Sometimes CVEs note other affected software products when they are
widely-used and very directly affected. But typically they would not list
every single downstream user, let alone generate separate CVEs, and in any
event here I do not see an argument that it affects Spark anyway.

On Thu, Oct 27, 2022 at 10:08 AM Pastrana, Rodrigo (RIS-BCT) <
rodrigo.pastr...@lexisnexisrisk.com> wrote:

> Thanks Sean,
>
> I assume Spark’s not affected because it either doesn’t reference the
> affected API(s) or because it does not unsafely utilize user input through
> the vulnerable API(s), but is there an official statement about this from
> Spark?
>
> We weren’t able to find references to 2022-42889 here:
> https://spark.apache.org/security.html (likely because Spark determined
> it is not affected?)
>
>
>
> *From:* Sean Owen 
> *Sent:* Thursday, October 27, 2022 10:27 AM
> *To:* Pastrana, Rodrigo (RIS-BCT)
> 
> *Cc:* dev@spark.apache.org
> *Subject:* Re: CVE-2022-42889
>
>
>
> You don't often get email from sro...@gmail.com. Learn why this is
> important <https://aka.ms/LearnAboutSenderIdentification>
>
>  External email: use caution 
>
>
>
> Probably a few months between maintenance releases.
>
> It does not appear to affect Spark, however.
>
>
>
> On Thu, Oct 27, 2022 at 9:24 AM Pastrana, Rodrigo (RIS-BCT) <
> rodrigo.pastr...@lexisnexisrisk.com.invalid> wrote:
>
> Hello,
>
> This issue (SPARK-40801)
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-40801&data=05%7C01%7CRodrigo.Pastrana%40lexisnexisrisk.com%7C507dc12538bf44d2646d08dab8276a76%7C9274ee3f94254109a27f9fb15c10675d%7C0%7C0%7C638024776687375556%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wZV1KpRw248DOPuWkJ2qjDNK9DwN4zFIgL6z2g0MOkw%3D&reserved=0>
> which addresses CVE-2022-42889 doesn’t seem to have been included in the
> latest release (3.3.1
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Freleases%2Fspark-release-3-3-1.html&data=05%7C01%7CRodrigo.Pastrana%40lexisnexisrisk.com%7C507dc12538bf44d2646d08dab8276a76%7C9274ee3f94254109a27f9fb15c10675d%7C0%7C0%7C638024776687375556%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aJXVwPl36j83CFFM%2F1jKDhSIm7mCNwRozMpXCt8dvDQ%3D&reserved=0>
> ).
>
> Is there a way to estimate a timeline for the first release which includes
> that change (likely 3.3.2)? Much appreciation!
>
>
> --
>
> The information contained in this e-mail message is intended only for the
> personal and confidential use of the recipient(s) named above. This message
> may be an attorney-client communication and/or work product and as such is
> privileged and confidential. If the reader of this message is not the
> intended recipient or an agent responsible for delivering it to the
> intended recipient, you are hereby notified that you have received this
> document in error and that any review, dissemination, distribution, or
> copying of this message is strictly prohibited. If you have received this
> communication in error, please notify us immediately by e-mail, and delete
> the original message.
>
>
> --
> The information contained in this e-mail message is intended only for the
> personal and confidential use of the recipient(s) named above. This message
> may be an attorney-client communication and/or work product and as such is
> privileged and confidential. If the reader of this message is not the
> intended recipient or an agent responsible for delivering it to the
> intended recipient, you are hereby notified that you have received this
> document in error and that any review, dissemination, distribution, or
> copying of this message is strictly prohibited. If you have received this
> communication in error, please notify us immediately by e-mail, and delete
> the original message.
>

Re: CVE-2022-42889

2022-10-27 Thread Sean Owen

Probably a few months between maintenance releases.
It does not appear to affect Spark, however.

On Thu, Oct 27, 2022 at 9:24 AM Pastrana, Rodrigo (RIS-BCT)
 wrote:

> Hello,
>
> This issue (SPARK-40801)
>  which addresses
> CVE-2022-42889 doesn’t seem to have been included in the latest release (
> 3.3.1 ).
>
> Is there a way to estimate a timeline for the first release which includes
> that change (likely 3.3.2)? Much appreciation!
>
> --
> The information contained in this e-mail message is intended only for the
> personal and confidential use of the recipient(s) named above. This message
> may be an attorney-client communication and/or work product and as such is
> privileged and confidential. If the reader of this message is not the
> intended recipient or an agent responsible for delivering it to the
> intended recipient, you are hereby notified that you have received this
> document in error and that any review, dissemination, distribution, or
> copying of this message is strictly prohibited. If you have received this
> communication in error, please notify us immediately by e-mail, and delete
> the original message.
>

Re: Apache Spark 3.2.3 Release?

2022-10-18 Thread Sean Owen

OK by me, if someone is willing to drive it.

On Tue, Oct 18, 2022 at 11:47 AM Chao Sun  wrote:

> Hi All,
>
> It's been more than 3 months since 3.2.2 (tagged at Jul 11) was
> released There are now 66 patches accumulated in branch-3.2, including
> 2 correctness issues.
>
> Is it a good time to start a new release? If there's no objection, I'd
> like to volunteer as the release manager for the 3.2.3 release, and
> start preparing the first RC next week.
>
> # Correctness issues
>
> SPARK-39833Filtered parquet data frame count() and show() produce
> inconsistent results when spark.sql.parquet.filterPushdown is true
> SPARK-40002.   Limit improperly pushed down through window using ntile
> function
>
> Best,
> Chao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-17 Thread Sean Owen

+1 from me, same as last time

On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang  wrote:

> Please vote on releasing the following candidate as Apache Spark version 
> 3.3.1.
>
> The vote is open until 11:59pm Pacific time October 21th and passes if a 
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org
>
> The tag to be voted on is v3.3.1-rc4 (commit 
> fbbcf9434ac070dd4ced4fb9efe32899c6db12a9):
> https://github.com/apache/spark/tree/v3.3.1-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-bin
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1430
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-docs
>
> The list of bug fixes going into 3.3.1 can be found at the following URL:
> https://s.apache.org/ttgz6
>
> This release is using the release script of the tag v3.3.1-rc4.
>
>
> FAQ
>
> ==
> What happened to v3.3.1-rc3?
> ==
> A performance regression(SPARK-40703) was found after tagging v3.3.1-rc3, 
> which the Iceberg community hopes Spark 3.3.1 could fix.
> So we skipped the vote on v3.3.1-rc3.
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.1?
> ===
> The current list of open tickets targeted at 3.3.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 3.3.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>

Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-10-11 Thread Sean Owen

Actually yeah that is how the release vote works by default at Apache:
https://www.apache.org/foundation/voting.html#ReleaseVotes

However I would imagine there is broad consent to just roll another RC if
there's any objection or -1. We could formally re-check the votes, as I
think the +1s would agree, but think we've defaulted into accepting a
'veto' if there are otherwise no objections.

On Tue, Oct 11, 2022 at 2:01 PM Jonathan Kelly 
wrote:

> Hi, Yuming,
>
> In your original email, you said that the vote "passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes". There were four +1 votes
> (all from PMC members) and one -1 (also from a PMC member), so shouldn't
> the vote pass because both requirements (majority +1 and minimum of 3 +1
> votes) were met?
>
> I don't personally mind either way if the vote is considered passed or
> failed (and I see you've already cut the v3.3.1-rc3 tag but haven't started
> the new vote yet), but I just wanted to ask for clarification on the
> requirements.
>
> Thank you,
> Jonathan Kelly
>
>
> On Wed, Oct 5, 2022 at 7:49 PM Yuming Wang  wrote:
>
>> Hi All,
>>
>> Thank you all for testing and voting!
>>
>> There's a -1 vote here, so I think this RC fails. I will prepare for
>> RC3 soon.
>>
>> On Tue, Oct 4, 2022 at 6:34 AM Mridul Muralidharan 
>> wrote:
>>
>>> +1 from me, with a few comments.
>>>
>>> I saw the following failures, are these known issues/flakey tests ?
>>>
>>> * PersistenceEngineSuite.ZooKeeperPersistenceEngine
>>> Looks like a port conflict issue from a quick look into logs (conflict
>>> with starting admin port at 8080) - is this expected behavior for the test ?
>>> I worked around it by shutting down the process which was using the port
>>> - though did not investigate deeply.
>>>
>>> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite was aborted
>>> It is expecting these artifacts in $HOME/.m2/repository
>>>
>>> 1. tomcat#jasper-compiler;5.5.23!jasper-compiler.jar
>>> 2. tomcat#jasper-runtime;5.5.23!jasper-runtime.jar
>>> 3. commons-el#commons-el;1.0!commons-el.jar
>>> 4. org.apache.hive#hive-exec;2.3.7!hive-exec.jar
>>>
>>> I worked around it by adding them locally explicitly - we should
>>> probably  add them as test dependency ?
>>> Not sure if this changed in this release though (I had cleaned my local
>>> .m2 recently)
>>>
>>> Other than this, rest looks good to me.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Wed, Sep 28, 2022 at 2:56 PM Sean Owen  wrote:
>>>
>>>> +1 from me, same result as last RC.
>>>>
>>>> On Wed, Sep 28, 2022 at 12:21 AM Yuming Wang  wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark version 
>>>>> 3.3.1.
>>>>>
>>>>> The vote is open until 11:59pm Pacific time October 3th and passes if a 
>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 3.3.1
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org
>>>>>
>>>>> The tag to be voted on is v3.3.1-rc2 (commit 
>>>>> 1d3b8f7cb15283a1e37ecada6d751e17f30647ce):
>>>>> https://github.com/apache/spark/tree/v3.3.1-rc2
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-bin
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1421
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-docs
>>>>>
>>>>> The list of bug fixes going into 3.3.1 can be found at the following URL:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12351710
>>>>>
>>>>> This release is using the release script of the tag v3.3.1-rc2.
>>>>>
>>>&

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Sean Owen

I'm OK with this. It simplifies maintenance a bit, and specifically may
allow us to finally move off of the ancient version of Guava (?)

On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
> is still used by someone in the community or not. If it's not used or not
> useful,
> we may remove it from Apache Spark 3.4.0 release.
>
>
> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>
> Here is the background of this question.
> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
> Spark community has been building and releasing with Java 8 only.
> I believe that the user applications also use Java8+ in these days.
> Recently, I received the following message from the Hadoop PMC.
>
>   > "if you really want to claim hadoop 2.x compatibility, then you have to
>   > be building against java 7". Otherwise a lot of people with hadoop 2.x
>   > clusters won't be able to run your code. If your projects are java8+
>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>   > in your build. Hence: no need for branch-2 branches except
>   > to complicate your build/test/release processes [1]
>
> If Hadoop2 binary distribution is no longer used as of today,
> or incomplete somewhere due to Java 8 building, the following three
> existing alternative Hadoop 3 binary distributions could be
> the better official solution for old Hadoop 2 clusters.
>
> 1) Scala 2.12 and without-hadoop distribution
> 2) Scala 2.12 and Hadoop 3 distribution
> 3) Scala 2.13 and Hadoop 3 distribution
>
> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary
> distribution?
>
> Dongjoon
>
> [1]
> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
>

Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-09-28 Thread Sean Owen

+1 from me, same result as last RC.

On Wed, Sep 28, 2022 at 12:21 AM Yuming Wang  wrote:

> Please vote on releasing the following candidate as Apache Spark version 
> 3.3.1.
>
> The vote is open until 11:59pm Pacific time October 3th and passes if a 
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org
>
> The tag to be voted on is v3.3.1-rc2 (commit 
> 1d3b8f7cb15283a1e37ecada6d751e17f30647ce):
> https://github.com/apache/spark/tree/v3.3.1-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-bin
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1421
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-docs
>
> The list of bug fixes going into 3.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351710
>
> This release is using the release script of the tag v3.3.1-rc2.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.1?
> ===
> The current list of open tickets targeted at 3.3.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 3.3.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>
>

Re: Why are hash functions seeded with 42?

2022-09-26 Thread Sean Owen

Oh yeah I get why we love to pick 42 for random things. I'm guessing it was
a bit of an oversight here as the 'seed' is directly initial state and 0
makes much more sense.

On Mon, Sep 26, 2022, 7:24 PM Nicholas Gustafson 
wrote:

> I don’t know the reason, however would offer a hunch that perhaps it’s a
> nod to Douglas Adams (author of The Hitchhiker’s Guide to the Galaxy).
>
>
> https://news.mit.edu/2019/answer-life-universe-and-everything-sum-three-cubes-mathematics-0910
>
> On Sep 26, 2022, at 16:59, Sean Owen  wrote:
>
> 
> OK, it came to my attention today that hash functions in spark, like
> xxhash64, actually always seed with 42:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L655
>
> This is an issue if you want the hash of some value in Spark to match the
> hash you compute with xxhash64 somewhere else, and, AFAICT most any other
> impl will start with seed=0.
>
> I'm guessing there wasn't a *great* reason for this, just seemed like 42
> was a nice default seed. And we can't change it now without maybe subtly
> changing program behaviors. And, I am guessing it's messy to let the
> function now take a seed argument, esp. in SQL.
>
> So I'm left with, I guess we should doc that? I can do it if so.
> And just a cautionary tale I guess, for hash function users.
>
>

Why are hash functions seeded with 42?

2022-09-26 Thread Sean Owen

OK, it came to my attention today that hash functions in spark, like
xxhash64, actually always seed with 42:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L655

This is an issue if you want the hash of some value in Spark to match the
hash you compute with xxhash64 somewhere else, and, AFAICT most any other
impl will start with seed=0.

I'm guessing there wasn't a *great* reason for this, just seemed like 42
was a nice default seed. And we can't change it now without maybe subtly
changing program behaviors. And, I am guessing it's messy to let the
function now take a seed argument, esp. in SQL.

So I'm left with, I guess we should doc that? I can do it if so.
And just a cautionary tale I guess, for hash function users.

Re: [VOTE] Release Spark 3.3.1 (RC1)

2022-09-17 Thread Sean Owen

+1 LGTM. I tested Scala 2.13 + Java 11 on Ubuntu 22.04. I get the same
results as usual.

On Sat, Sep 17, 2022 at 2:42 AM Yuming Wang  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.3.1.
>
> The vote is open until 11:59pm Pacific time September 22th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org
>
> The tag to be voted on is v3.3.1-rc1 (commit
> ea1a426a889626f1ee1933e3befaa975a2f0a072):
> https://github.com/apache/spark/tree/v3.3.1-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc1-bin
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1418
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc1-docs
>
> The list of bug fixes going into 3.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351710
>
> This release is using the release script of the tag v3.3.1-rc1.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.1?
> ===
> The current list of open tickets targeted at 3.3.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>

Re: Time for Spark 3.3.1 release?

2022-09-14 Thread Sean Owen

Yeah we're not going to make convenience binaries for all possible
combinations. It's a pretty good assumption that anyone moving to later
Scala versions is also off old Hadoop versions.
You can of course build the combo you like.

On Wed, Sep 14, 2022 at 11:26 AM Denis Bolshakov 
wrote:

> Unfortunately it's for hadoop 3 only.
>
> ср, 14 сент. 2022 г., 19:04 Dongjoon Hyun :
>
>> Hi, Denis.
>>
>> Apache Spark community already provides both Scala 2.12 and 2.13
>> pre-built distributions.
>> Please check the distribution site and Apache Spark download page.
>>
>> https://dlcdn.apache.org/spark/spark-3.3.0/
>>
>> spark-3.3.0-bin-hadoop3-scala2.13.tgz
>> spark-3.3.0-bin-hadoop3.tgz
>>
>> [image: Screenshot 2022-09-14 at 9.03.27 AM.png]
>>
>> Dongjoon.
>>
>> On Wed, Sep 14, 2022 at 12:31 AM Denis Bolshakov <
>> bolshakov.de...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> It would be great if it's possible to provide a spark distro for both
>>> scala 2.12 and scala 2.13.
>>>
>>> It will encourage spark users to switch to scala 2.13.
>>>
>>> I know that spark jar artifacts available for both scala versions, but
>>> it does not make sense to migrate to scala 2.13 while there is no spark
>>> distro for this version.
>>>
>>> Kind regards,
>>> Denis
>>>
>>> On Tue, 13 Sept 2022 at 17:38, Yuming Wang  wrote:
>>>
 Thank you all.

 I will be preparing 3.3.1 RC1 soon.

 On Tue, Sep 13, 2022 at 12:09 PM John Zhuge  wrote:

> +1
>
> On Mon, Sep 12, 2022 at 9:08 PM Yang,Jie(INF) 
> wrote:
>
>> +1
>>
>>
>>
>> Thanks Yuming ~
>>
>>
>>
>> *发件人**: *Hyukjin Kwon 
>> *日期**: *2022年9月13日 星期二 08:19
>> *收件人**: *Gengliang Wang 
>> *抄送**: *"L. C. Hsieh" , Dongjoon Hyun <
>> dongjoon.h...@gmail.com>, Yuming Wang , dev <
>> dev@spark.apache.org>
>> *主题**: *Re: Time for Spark 3.3.1 release?
>>
>>
>>
>> +1
>>
>>
>>
>> On Tue, 13 Sept 2022 at 06:45, Gengliang Wang 
>> wrote:
>>
>> +1.
>>
>> Thank you, Yuming!
>>
>>
>>
>> On Mon, Sep 12, 2022 at 12:10 PM L. C. Hsieh 
>> wrote:
>>
>> +1
>>
>> Thanks Yuming!
>>
>> On Mon, Sep 12, 2022 at 11:50 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>> >
>> > +1
>> >
>> > Thanks,
>> > Dongjoon.
>> >
>> > On Mon, Sep 12, 2022 at 6:38 AM Yuming Wang 
>> wrote:
>> >>
>> >> Hi, All.
>> >>
>> >>
>> >>
>> >> Since Apache Spark 3.3.0 tag creation (Jun 10), new 138 patches
>> including 7 correctness patches arrived at branch-3.3.
>> >>
>> >>
>> >>
>> >> Shall we make a new release, Apache Spark 3.3.1, as the second
>> release at branch-3.3? I'd like to volunteer as the release manager for
>> Apache Spark 3.3.1.
>> >>
>> >>
>> >>
>> >> All changes:
>> >>
>> >> https://github.com/apache/spark/compare/v3.3.0...branch-3.3
>> 
>> >>
>> >>
>> >>
>> >> Correctness issues:
>> >>
>> >> SPARK-40149: Propagate metadata columns through Project
>> >>
>> >> SPARK-40002: Don't push down limit through window using ntile
>> >>
>> >> SPARK-39976: ArrayIntersect should handle null in left expression
>> correctly
>> >>
>> >> SPARK-39833: Disable Parquet column index in DSv1 to fix a
>> correctness issue in the case of overlapping partition and data columns
>> >>
>> >> SPARK-39061: Set nullable correctly for Inline output attributes
>> >>
>> >> SPARK-39887: RemoveRedundantAliases should keep aliases that make
>> the output of projection nodes unique
>> >>
>> >> SPARK-38614: Don't push down limit through window that's using
>> percent_rank
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
> John Zhuge
>

>>>
>>> --
>>> //with Best Regards
>>> --Denis Bolshakov
>>> e-mail: bolshakov.de...@gmail.com
>>>
>>

Re: Support for spark-packages.org

2022-09-13 Thread Sean Owen

I think that in practice it is unsupported at this point. I'd just release
your packages on Github / Maven Central / Pypi.

On Tue, Sep 13, 2022 at 3:36 AM Enrico Minack 
wrote:

> Hi devs,
>
> I understand that spark-packages.org is not associated with Apache and
> Apache Spark, but hosted by Databricks. Does anyone have any pointers on
> how to get support? The e-mail address feedb...@spark-packages.org does
> not respond.
>
> I found a few "missing features" that block me from registering my
> packages / releases.
>
> Thanks a lot,
> Enrico
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Java object serialization error, java.io.InvalidClassException: org.apache.spark.deploy.ApplicationDescription; local class incompatible

2022-08-25 Thread Sean Owen

This suggests you have mixed two versions of Spark libraries. You probably
packaged Spark itself in your Spark app?

On Thu, Aug 25, 2022 at 4:56 PM Elliot Metsger  wrote:

> Elliot Metsger 
> 9:48 AM (7 hours ago)
> to dev
> Howdy folks,
>
> Relative newbie to Spark, and super new to Beam.  (I've asked this
> question on Beam lists, but this seems like a Spark-related issue so I'm
> trying my query here, too).  I'm attempting to get a simple Beam pipeline
> (using the Go SDK) running on Spark. There seems to be an incompatibility
> between Java components related to object serializations which prevents a
> simple "hello world" pipeline from executing successfully.  I'm really
> looking for some direction on where to look, so if anyone has any pointers,
> it is appreciated!
>
> When I submit the job via the go sdk, it errors out on the Spark side with:
> [8:59 AM] 22/08/25 12:45:59 ERROR TransportRequestHandler: Error while
> invoking RpcHandler#receive() for one-way message.
> java.io.InvalidClassException:
> org.apache.spark.deploy.ApplicationDescription; local class incompatible:
> stream classdesc serialVersionUID = 6543101073799644159, local class
> serialVersionUID = 1574364215946805297
> I’m using apache/beam_spark_job_server:2.41.0 and apache/spark:latest.
>  (docker-compose[0], hello world wordcount example pipeline[1]).
>
> It appears that the org.apache.spark.deploy.ApplicationDescription object
> (or something in its graph) doesn't explicitly assign a serialVersionUID.
>
> This simple repo[2] should demonstrate the issue.  Any pointers would be
> appreciated!
>
> [0]: https://github.com/emetsger/beam-test/blob/develop/docker-compose.yml
> [1]:
> https://github.com/emetsger/beam-test/blob/develop/debugging_wordcount.go
> [2]: https://github.com/emetsger/beam-test
>

Re: Update Spark 3.4 Release Window?

2022-07-20 Thread Sean Owen

I don't know any better than others when it will actually happen, though
historically, it's more like 7-8 months between minor releases. I might
therefore expect a release more like February 2023, and work backwards from
there. Doesn't really matter, this is just a public guess and can be
changed.

On Wed, Jul 20, 2022 at 3:27 PM Xinrong Meng 
wrote:

> Hi All,
>
> Since Spark 3.3.0 was released on June 16, 2022, shall we update the
> release window https://spark.apache.org/versioning-policy.html for Spark
> 3.4?
>
> A proposal is as follows:
>
> | October 15th 2022 | Code freeze. Release branch cut.
> | Late October 2022 | QA period. Focus on bug fixes, tests, stability and
> docs. Generally, no new features merged.
> | November 2022 | Release candidates (RC), voting, etc. until final
> release passes
>
> Thanks!
>
> Xinrong Meng
>
>

CVE-2022-33891: Apache Spark shell command injection vulnerability via Spark UI

2022-07-17 Thread Sean Owen

Severity: important

Description:

The Apache Spark UI offers the possibility to enable ACLs via the
configuration option spark.acls.enable. With an authentication filter, this
checks whether a user has access permissions to view or modify the
application. If ACLs are enabled, a code path in HttpSecurityFilter can
allow someone to perform impersonation by providing an arbitrary user name.
A malicious user might then be able to reach a permission check function
that will ultimately build a Unix shell command based on their input, and
execute it. This will result in arbitrary shell command execution as the
user Spark is currently running as. This affects Apache Spark versions
3.0.3 and earlier, versions 3.1.1 to 3.1.2, and versions 3.2.0 to 3.2.1.

This issue is being tracked as SPARK-38992

Mitigation:

Upgrade to supported Apache Spark maintenance release 3.1.3, 3.2.2, or
3.3.0 or later

Credit:

 Kostya Kortchinsky (Databricks)

Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-11 Thread Sean Owen

Is anyone seeing this error? I'm on OpenJDK 8 on a Mac:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000101ca8ace, pid=11962, tid=0x1603
#
# JRE version: OpenJDK Runtime Environment (8.0_322) (build
1.8.0_322-bre_2022_02_28_15_01-b00)
# Java VM: OpenJDK 64-Bit Server VM (25.322-b00 mixed mode bsd-amd64
compressed oops)
# Problematic frame:
# V  [libjvm.dylib+0x549ace]
#
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /private/tmp/spark-3.2.2/sql/core/hs_err_pid11962.log
ColumnVectorSuite:
- boolean
- byte
Compiled method (nm)  885897 75403 n 0   sun.misc.Unsafe::putShort
(native)
 total in heap  [0x000102fdaa10,0x000102fdad48] = 824
 relocation [0x000102fdab38,0x000102fdab78] = 64
 main code  [0x000102fdab80,0x000102fdad48] = 456
Compiled method (nm)  885897 75403 n 0   sun.misc.Unsafe::putShort
(native)
 total in heap  [0x000102fdaa10,0x000102fdad48] = 824
 relocation [0x000102fdab38,0x000102fdab78] = 64
 main code  [0x000102fdab80,0x000102fdad48] = 456

On Mon, Jul 11, 2022 at 4:58 PM Dongjoon Hyun 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.2.2.
>
> The vote is open until July 15th 1AM (PST) and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.2.2-rc1 (commit
> 78a5825fe266c0884d2dd18cbca9625fa258d7f7):
> https://github.com/apache/spark/tree/v3.2.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1409/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.2-rc1-docs/
>
> The list of bug fixes going into 3.2.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351232
>
> This release is using the release script of the tag v3.2.2-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.2?
> ===
>
> The current list of open tickets targeted at 3.2.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> Dongjoon
>

Re: Javascript Based UDFs

2022-06-27 Thread Sean Owen

I don't know how these frameworks work, but I'd hope that it takes
JavaScript and gives you some invokeable object. It's just Java executing
like anything else, nothing special, no special consideration for being in
a task? Why would it need another JVM?

On Mon, Jun 27, 2022, 11:18 PM Matt Hawes  wrote:

> Thanks for the reply! I had originally thought that this would incur a
> cost of spinning up a VM every time the UDF is called but thinking about it
> again you might be right. I guess if I make the VM accessible via a
> transient property on the UDF class then it would only be initialized once
> per executor right? Or would it be once per task?
>
> I also was worried that this would mean you end up paying a lot in SerDe
> cost if you send each row over to the VM one by one?
>
> On Mon, Jun 27, 2022 at 10:02 PM Sean Owen  wrote:
>
>> Rather than reimplement a new UDF, why not indeed just use an embedded
>> interpreter? if something can turn javascript into something executable you
>> can wrap that in a normal Java/Scala UDF and go.
>>
>> On Mon, Jun 27, 2022 at 10:42 PM Matt Hawes  wrote:
>>
>>> Hi all, I'm thinking about trying to implement the ability to write
>>> spark UDFs using javascript.
>>>
>>> For the use case I have in mind, a lot of the code is already written in
>>> javascript and so it would be very convenient to be able to call this
>>> directly from spark.
>>>
>>> I wanted to post here first before I start digging into the UDF code to
>>> see if anyone has attempted this already or if people have thoughts on it.
>>> I couldn't find anything in the Jira. I'd be especially appreciative of any
>>> pointers towards relevant sections of the code to get started!
>>>
>>> My rough plan is to do something similar to how python UDFs work (as I
>>> understand them). I.e. call out to a javascript process, potentially just
>>> something in GraalJs for example: https://github.com/oracle/graaljs.
>>>
>>> I understand that there's probably a long discussion to be had here with
>>> regards to making this part of Spark core, but I wanted to start that
>>> discussion. :)
>>>
>>> Best,
>>> Matt
>>>
>>>

Re: Javascript Based UDFs

2022-06-27 Thread Sean Owen

Rather than reimplement a new UDF, why not indeed just use an embedded
interpreter? if something can turn javascript into something executable you
can wrap that in a normal Java/Scala UDF and go.

On Mon, Jun 27, 2022 at 10:42 PM Matt Hawes  wrote:

> Hi all, I'm thinking about trying to implement the ability to write spark
> UDFs using javascript.
>
> For the use case I have in mind, a lot of the code is already written in
> javascript and so it would be very convenient to be able to call this
> directly from spark.
>
> I wanted to post here first before I start digging into the UDF code to
> see if anyone has attempted this already or if people have thoughts on it.
> I couldn't find anything in the Jira. I'd be especially appreciative of any
> pointers towards relevant sections of the code to get started!
>
> My rough plan is to do something similar to how python UDFs work (as I
> understand them). I.e. call out to a javascript process, potentially just
> something in GraalJs for example: https://github.com/oracle/graaljs.
>
> I understand that there's probably a long discussion to be had here with
> regards to making this part of Spark core, but I wanted to start that
> discussion. :)
>
> Best,
> Matt
>
>

Re: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Sean Owen

+1 still looks good, same as last results.

On Thu, Jun 9, 2022 at 11:27 PM Maxim Gekk
 wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time June 14th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc6 (commit
> f74867bddfbcdd4d08076db36851e88b15e66556):
> https://github.com/apache/spark/tree/v3.3.0-rc6
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1407
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc6.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-04 Thread Sean Owen

+1 looks good now on Scala 2.13

On Sat, Jun 4, 2022 at 9:51 AM Maxim Gekk 
wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time June 8th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc5 (commit
> 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
> https://github.com/apache/spark/tree/v3.3.0-rc5
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1406
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc5.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: [VOTE] Release Spark 3.3.0 (RC4)

2022-06-03 Thread Sean Owen

Ah yeah, I think it's this change from 15 hrs ago. That needs to be .toSeq:

https://github.com/apache/spark/commit/4a0f0ff6c22b85cb0fc1eef842da8dbe4c90543a#diff-01813c3e2e933ed573e4a93750107f004a86e587330cba5e91b5052fa6ade2a5R146

On Fri, Jun 3, 2022 at 4:13 PM Sean Owen  wrote:

> In Scala 2.13, I'm getting errors like this:
>
>  analyzer should replace current_timestamp with literals *** FAILED ***
>   java.lang.ClassCastException: class scala.collection.mutable.ArrayBuffer
> cannot be cast to class scala.collection.immutable.Seq
> (scala.collection.mutable.ArrayBuffer and scala.collection.immutable.Seq
> are in unnamed module of loader 'app')
>   at
> org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146)
> ...
> - analyzer should replace current_date with literals *** FAILED ***
>   java.lang.ClassCastException: class scala.collection.mutable.ArrayBuffer
> cannot be cast to class scala.collection.immutable.Seq
> (scala.collection.mutable.ArrayBuffer and scala.collection.immutable.Seq
> are in unnamed module of loader 'app')
> ...
>
> I haven't investigated yet, just flagging in case anyone knows more about
> it immediately.
>
>
> On Fri, Jun 3, 2022 at 9:54 AM Maxim Gekk
>  wrote:
>
>> Please vote on releasing the following candidate as
>> Apache Spark version 3.3.0.
>>
>> The vote is open until 11:59pm Pacific time June 7th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.3.0-rc4 (commit
>> 4e3599bc11a1cb0ea9fc819e7f752d2228e54baf):
>> https://github.com/apache/spark/tree/v3.3.0-rc4
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc4-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1405
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc4-docs/
>>
>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>
>> This release is using the release script of the tag v3.3.0-rc4.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.0?
>> ===
>> The current list of open tickets targeted at 3.3.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.3.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>

Re: [VOTE] Release Spark 3.3.0 (RC4)

2022-06-03 Thread Sean Owen

In Scala 2.13, I'm getting errors like this:

 analyzer should replace current_timestamp with literals *** FAILED ***
  java.lang.ClassCastException: class scala.collection.mutable.ArrayBuffer
cannot be cast to class scala.collection.immutable.Seq
(scala.collection.mutable.ArrayBuffer and scala.collection.immutable.Seq
are in unnamed module of loader 'app')
  at
org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146)
...
- analyzer should replace current_date with literals *** FAILED ***
  java.lang.ClassCastException: class scala.collection.mutable.ArrayBuffer
cannot be cast to class scala.collection.immutable.Seq
(scala.collection.mutable.ArrayBuffer and scala.collection.immutable.Seq
are in unnamed module of loader 'app')
...

I haven't investigated yet, just flagging in case anyone knows more about
it immediately.


On Fri, Jun 3, 2022 at 9:54 AM Maxim Gekk 
wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time June 7th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc4 (commit
> 4e3599bc11a1cb0ea9fc819e7f752d2228e54baf):
> https://github.com/apache/spark/tree/v3.3.0-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1405
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc4-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc4.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: [VOTE] Release Spark 3.3.0 (RC3)

2022-05-25 Thread Sean Owen

+1 works for me as usual, with Java 8 + Scala 2.12, Java 11 + Scala 2.13.

On Tue, May 24, 2022 at 12:14 PM Maxim Gekk
 wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time May 27th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc3 (commit
> a7259279d07b302a51456adb13dc1e41a6fd06ed):
> https://github.com/apache/spark/tree/v3.3.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1404
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc3-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc3.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-16 Thread Sean Owen

I'm still seeing failures related to the function registry, like:

ExpressionsSchemaSuite:
- Check schemas for expression examples *** FAILED ***
  396 did not equal 398 Expected 396 blocks in result file but got 398. Try
regenerating the result files. (ExpressionsSchemaSuite.scala:161)

- SPARK-14415: All functions should have own descriptions *** FAILED ***
  "Function: bloom_filter_aggClass:
org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
the result) (QueryTest.scala:54)

There seems to be consistently a difference of 2 in the list of expected
functions and actual. I haven't looked closely, don't know this code. I'm
on Ubuntu 22.04. Anyone else seeing something like this? Wondering if it's
something weird to do with case sensitivity, hidden files lurking
somewhere, etc.

I suspect it's not a 'real' error as the Linux-based testers work fine, but
I also can't think of why this is failing.



On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
 wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time May 19th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc2 (commit
> c8c657b922ac8fd8dcf9553113e11a80079db059):
> https://github.com/apache/spark/tree/v3.3.0-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1403
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc2.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-10 Thread Sean Owen

There's a -1 vote here, so I think this RC fails anyway.

On Fri, May 6, 2022 at 10:30 AM Gengliang Wang  wrote:

> Hi Maxim,
>
> Thanks for the work!
> There is a bug fix from Bruce merged on branch-3.3 right after the RC1 is
> cut:
> SPARK-39093: Dividing interval by integral can result in codegen
> compilation error
> <https://github.com/apache/spark/commit/fd998c8a6783c0c8aceed8dcde4017cd479e42c8>
>
> So -1 from me. We should have RC2 to include the fix.
>
> Thanks
> Gengliang
>
> On Fri, May 6, 2022 at 6:15 PM Maxim Gekk
>  wrote:
>
>> Hi Dongjoon,
>>
>>  > https://issues.apache.org/jira/projects/SPARK/versions/12350369
>> > Since RC1 is started, could you move them out from the 3.3.0 milestone?
>>
>> I have removed the 3.3.0 label from Fix version(s). Thank you, Dongjoon.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Fri, May 6, 2022 at 11:06 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, Sean.
>>> It's interesting. I didn't see those failures from my side.
>>>
>>> Hi, Maxim.
>>> In the following link, there are 17 in-progress and 6 to-do JIRA issues
>>> which look irrelevant to this RC1 vote.
>>>
>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>
>>> Since RC1 is started, could you move them out from the 3.3.0 milestone?
>>> Otherwise, we cannot distinguish new real blocker issues from those
>>> obsolete JIRA issues.
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Thu, May 5, 2022 at 11:46 AM Adam Binford  wrote:
>>>
>>>> I looked back at the first one (SPARK-37618), it expects/assumes a 0022
>>>> umask to correctly test the behavior. I'm not sure how to get that to not
>>>> fail or be ignored with a more open umask.
>>>>
>>>> On Thu, May 5, 2022 at 1:56 PM Sean Owen  wrote:
>>>>
>>>>> I'm seeing test failures; is anyone seeing ones like this? This is
>>>>> Java 8 / Scala 2.12 / Ubuntu 22.04:
>>>>>
>>>>> - SPARK-37618: Sub dirs are group writable when removing from shuffle
>>>>> service enabled *** FAILED ***
>>>>>   [OWNER_WRITE, GROUP_READ, GROUP_WRITE, GROUP_EXECUTE, OTHERS_READ,
>>>>> OWNER_READ, OTHERS_EXECUTE, OWNER_EXECUTE] contained GROUP_WRITE
>>>>> (DiskBlockManagerSuite.scala:155)
>>>>>
>>>>> - Check schemas for expression examples *** FAILED ***
>>>>>   396 did not equal 398 Expected 396 blocks in result file but got
>>>>> 398. Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)
>>>>>
>>>>>  Function 'bloom_filter_agg', Expression class
>>>>> 'org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregate'
>>>>> "" did not start with "
>>>>>   Examples:
>>>>>   " (ExpressionInfoSuite.scala:142)
>>>>>
>>>>> On Thu, May 5, 2022 at 6:01 AM Maxim Gekk
>>>>>  wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>  version 3.3.0.
>>>>>>
>>>>>> The vote is open until 11:59pm Pacific time May 10th and passes if a
>>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 3.3.0
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>
>>>>>> The tag to be voted on is v3.3.0-rc1 (commit
>>>>>> 482b7d54b522c4d1e25f3e84eabbc78126f22a3d):
>>>>>> https://github.com/apache/spark/tree/v3.3.0-rc1
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-bin/
>>>>>>
>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1402

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-05 Thread Sean Owen

I'm seeing test failures; is anyone seeing ones like this? This is Java 8 /
Scala 2.12 / Ubuntu 22.04:

- SPARK-37618: Sub dirs are group writable when removing from shuffle
service enabled *** FAILED ***
  [OWNER_WRITE, GROUP_READ, GROUP_WRITE, GROUP_EXECUTE, OTHERS_READ,
OWNER_READ, OTHERS_EXECUTE, OWNER_EXECUTE] contained GROUP_WRITE
(DiskBlockManagerSuite.scala:155)

- Check schemas for expression examples *** FAILED ***
  396 did not equal 398 Expected 396 blocks in result file but got 398. Try
regenerating the result files. (ExpressionsSchemaSuite.scala:161)

 Function 'bloom_filter_agg', Expression class
'org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregate'
"" did not start with "
  Examples:
  " (ExpressionInfoSuite.scala:142)

On Thu, May 5, 2022 at 6:01 AM Maxim Gekk 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.3.0.
>
> The vote is open until 11:59pm Pacific time May 10th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc1 (commit
> 482b7d54b522c4d1e25f3e84eabbc78126f22a3d):
> https://github.com/apache/spark/tree/v3.3.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1402
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc1.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: CVE-2020-13936

2022-05-05 Thread Sean Owen

This is a Velocity issue. Spark doesn't use it, although it looks like Avro
does. From reading the CVE, I do not believe it would impact Avro's usage -
velocity templates it may use for codegen aren't exposed that I know of. Is
there a known relationship to Spark here? That is the key question in
security questions like this.

In any event, to pursue an update, it would likely have to start by
updating Avro if it hasn't already, and if it has, pursue upgrading Avro in
Spark -- if the supported Hadoop versions work with it.

On Thu, May 5, 2022 at 12:32 PM Pralabh Kumar 
wrote:

> Hi Dev Team
>
> Please let me know if  there is a jira to track this CVE changes with
> respect to Spark  . Searched jira but couldn't find anything.
>
> Please help
>
> Regards
> Pralabh Kumar
>

Re: CVE-2021-22569

2022-05-04 Thread Sean Owen

Sure, did you search the JIRA?
https://issues.apache.org/jira/browse/SPARK-38340

Does this affect Spark's usage of protobuf?

Looks like it can't be updated to 3.x -- this is really not a dependency of
Spark but underlying dependencies.
Feel free to re-attempt a change that might work, at least with Hadoop 3 if
possible.

On Wed, May 4, 2022 at 10:46 AM Pralabh Kumar 
wrote:

> Hi Dev Team
>
> Spark is using protobuf 2.5.0 which is vulnerable to CVE-2021-22569. CVE
> recommends to use protobuf 3.19.2
>
> Please let me know , if there is a jira to track the update w.r.t CVE and
> Spark or should I create the one ?
>
> Regards
> Pralabh Kumar
>

Re: CVE -2020-28458, How to upgrade datatables dependency

2022-04-16 Thread Sean Owen

FWIW here's an update to 1.10.25: https://github.com/apache/spark/pull/36226


On Wed, Apr 13, 2022 at 8:28 AM Sean Owen  wrote:

> You can see the files in
> core/src/main/resources/org/apache/spark/ui/static - you can try dropping
> in the new minified versions and see if the UI is OK.
> You can open a pull request if it works to update it, in case this affects
> Spark.
> It looks like the smaller upgrade to 1.10.22 is also sufficient.
>
> On Wed, Apr 13, 2022 at 7:43 AM Pralabh Kumar 
> wrote:
>
>> Hi Dev Team
>>
>> Spark 3.2 (and 3.3 might also) have CVE 2020-28458.  Therefore  in my
>> local repo of Spark I would like to update DataTables to 1.11.5.
>>
>> Can you please help me to point out where I should upgrade DataTables
>> dependency ?.
>>
>> Regards
>> Pralabh Kumar
>>
>

Re: CVE-2021-38296: Apache Spark Key Negotiation Vulnerability - 2.4 Backport?

2022-04-14 Thread Sean Owen

It does affect 2.4.x, yes. 2.4.x was EOL a while ago, so there wouldn't be
a new release of 2.4.x in any event. It's recommended to update instead, at
least to 3.1.3.

On Thu, Apr 14, 2022 at 12:07 PM Chris Nauroth  wrote:

> A fix for CVE-2021-38296 was committed and released in Apache Spark 3.1.3.
> I'm curious, is the issue relevant to the 2.4 version line, and if so, are
> there any plans for a backport?
>
> https://lists.apache.org/thread/70x8fw2gx3g9ty7yk0f2f1dlpqml2smd
>
> Chris Nauroth
>

Re: CVE -2020-28458, How to upgrade datatables dependency

2022-04-13 Thread Sean Owen

You can see the files in core/src/main/resources/org/apache/spark/ui/static
- you can try dropping in the new minified versions and see if the UI is
OK.
You can open a pull request if it works to update it, in case this affects
Spark.
It looks like the smaller upgrade to 1.10.22 is also sufficient.

On Wed, Apr 13, 2022 at 7:43 AM Pralabh Kumar 
wrote:

> Hi Dev Team
>
> Spark 3.2 (and 3.3 might also) have CVE 2020-28458.  Therefore  in my
> local repo of Spark I would like to update DataTables to 1.11.5.
>
> Can you please help me to point out where I should upgrade DataTables
> dependency ?.
>
> Regards
> Pralabh Kumar
>

Re: Spark 3.0.1 and spark 3.2 compatibility

2022-04-07 Thread Sean Owen

(Don't cross post please)
Generally you definitely want to compile and test vs what you're running on.
There shouldn't be many binary or source incompatibilities -- these are
avoided in a major release where possible. So it may need no code change.
But I would certainly recompile just on principle!

On Thu, Apr 7, 2022 at 12:28 PM Pralabh Kumar 
wrote:

> Hi spark community
>
> I have quick question .I am planning to migrate from spark 3.0.1 to spark
> 3.2.
>
> Do I need to recompile my application with 3.2 dependencies or application
> compiled with 3.0.1 will work fine on 3.2 ?
>
>
> Regards
> Pralabh kumar
>
>

Re: Deluge of GitBox emails

2022-04-04 Thread Sean Owen

https://issues.apache.org/jira/browse/INFRA-23082 for those following.

On Mon, Apr 4, 2022 at 9:32 AM Nicholas Chammas 
wrote:

> I’m not familiar with GitBox, but it must be an independent thing. When
> you participate in a PR, GitHub emails you notifications directly.
>
> The GitBox emails, on the other hand, are going to the dev list. They seem
> like something setup as a repo-wide setting, or perhaps as an Apache bot
> that monitors repo activity and converts it into emails. (I’ve seen other
> projects -- I think Hadoop -- where GitHub activity is converted into
> comments on Jira.
>
> Turning off these GitBox emails should not have in impact on the usual
> GitHub emails we are all already familiar with.
>
>
> On Apr 4, 2022, at 9:47 AM, Sean Owen  wrote:
>
> I think this must be related to the Gitbox migration that just happened.
> It does seem like I'm getting more emails - some are on PRs I'm attached
> to, but some I don't recognize. The thing is, I'm not yet clear if they
> duplicate the normal Github emails - that is if we turn them off do we have
> anything?
>
> On Mon, Apr 4, 2022 at 8:44 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I assume I’m not the only one getting these new emails from GitBox. Is
>> there a story behind that that I missed?
>>
>> I’d rather not get these emails on the dev list. I assume most of the
>> list would agree with me.
>>
>> GitHub has a good set of options for following activity on the repo.
>> People who want to follow conversations can easily do that without
>> involving the whole dev list.
>>
>> Do we know who is responsible for these GitBox emails? Perhaps we need to
>> file an Apache INFRA ticket?
>>
>> Nick
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>

Re: Deluge of GitBox emails

2022-04-04 Thread Sean Owen

I think this must be related to the Gitbox migration that just happened. It
does seem like I'm getting more emails - some are on PRs I'm attached to,
but some I don't recognize. The thing is, I'm not yet clear if they
duplicate the normal Github emails - that is if we turn them off do we have
anything?

On Mon, Apr 4, 2022 at 8:44 AM Nicholas Chammas 
wrote:

> I assume I’m not the only one getting these new emails from GitBox. Is
> there a story behind that that I missed?
>
> I’d rather not get these emails on the dev list. I assume most of the list
> would agree with me.
>
> GitHub has a good set of options for following activity on the repo.
> People who want to follow conversations can easily do that without
> involving the whole dev list.
>
> Do we know who is responsible for these GitBox emails? Perhaps we need to
> file an Apache INFRA ticket?
>
> Nick
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Tools for regression testing

2022-03-24 Thread Sean Owen

Hm, then what are you looking for besides all the tests in Spark?

On Thu, Mar 24, 2022, 2:34 PM Mich Talebzadeh 
wrote:

> Thanks
>
> I know what unit testing is. The question was not about unit testing. it
> was specific to regression testing
> 
>  artifacts .
>
>
> cheers,
>
>
> Mich
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 24 Mar 2022 at 19:02, Bjørn Jørgensen 
> wrote:
>
>> Yes, Spark uses unit tests.
>>
>> https://app.codecov.io/gh/apache/spark
>>
>> https://en.wikipedia.org/wiki/Unit_testing
>>
>>
>>
>> man. 21. mar. 2022 kl. 15:46 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Hi,
>>>
>>> As a matter of interest do Spark releases deploy a specific regression
>>> testing tool?
>>>
>>> Thanks
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-23 Thread Sean Owen

Well, yes, but if it requires a Kafka server-side update, it does, and that
is out of scope for us to document.
It is important that we document if and how (if we know) the client update
will impact existing Kafka installations (does it require a server-side
update or not?), and document the change itself for sure along with any
Spark-side migration notes.

On Fri, Mar 18, 2022 at 8:47 PM Jungtaek Lim 
wrote:

> The thing is, it is “us” who upgrades Kafka client and makes possible
> divergence between client and broker in end users’ production env.
>
> Someone can claim that end users can downgrade the kafka-client artifact
> when building their app so that the version can be matched, but we don’t
> test anything against downgrading kafka-client version for kafka connector.
> That sounds to me we defer our work to end users.
>
> It sounds to me “someone” should refer to us, and then it is no longer a
> matter of “help”. It is a matter of “responsibility”, as you said.
>
> 2022년 3월 18일 (금) 오후 10:15, Sean Owen 님이 작성:
>
>> I think we can assume that someone upgrading Kafka will be responsible
>> for thinking through the breaking changes. We can help by listing anything
>> we know could affect Spark-Kafka usage and calling those out in a release
>> note, for sure. I don't think we need to get into items that would affect
>> Kafka usage itself; focus on the connector-related issues.
>>
>> On Fri, Mar 18, 2022 at 5:15 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> CORRECTION: in option 2, we enumerate KIPs which may bring
>>> incompatibility with older brokers (not all KIPs).
>>>
>>> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi dev,
>>>>
>>>> I would like to initiate the discussion about how to deal with the
>>>> migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
>>>> 3.3.
>>>>
>>>> We didn't care much about the upgrade of Kafka dependency since our
>>>> belief on Kafka client has been that the new Kafka client version should
>>>> have no compatibility issues with older brokers. Based on semantic
>>>> versioning, upgrading major versions rings an alarm for me.
>>>>
>>>> I haven't gone through changes that happened between versions, but
>>>> found one KIP (KIP-679
>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-679%3A+Producer+will+enable+the+strongest+delivery+guarantee+by+default>)
>>>> which may not work with older brokers with specific setup. (It's described
>>>> in the "Compatibility, Deprecation, and Migration Plan" section of the 
>>>> KIP).
>>>>
>>>> This may not be problematic for the users who upgrade both client and
>>>> broker altogether, but end users of Spark may be unlikely the case.
>>>> Computation engines are relatively easier to upgrade. Storage systems
>>>> aren't. End users would think the components are independent.
>>>>
>>>> I looked through the notable changes in the Kafka doc, and it does
>>>> mention this KIP, but it just says the default config has changed and
>>>> doesn't mention about the impacts. There is a link to
>>>> KIP, that said, everyone needs to read through the KIP wiki page for
>>>> details.
>>>>
>>>> Based on the context, what would be the best way to notice end users
>>>> for the major version upgrade of Kafka? I can imagine several options
>>>> including...
>>>>
>>>> 1. Explicitly mention that Spark 3.3 upgrades Kafka to 3.1 with linking
>>>> the noticeable changes in the Kafka doc in the migration guide.
>>>> 2. Do 1 & spend more effort to read through all KIPs and check
>>>> "Compatibility, Deprecation, and Migration Plan" section, and enumerate all
>>>> KIPs (or even summarize) in the migration guide.
>>>> 3. Do 2 & actively override the default configs to be compatible with
>>>> older versions if the change of the default configs in Kafka 3.0 is
>>>> backward incompatible. End users should set these configs explicitly to
>>>> override them back.
>>>> 4. Do not care. End users can indicate the upgrade in the release note,
>>>> and we expect end users to actively check the notable changes (& KIPs) from
>>>> Kafka doc.
>>>> 5. Options not described above...
>>>>
>>>> Please take a look and provide your voice on this.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> ps. Probably this would be applied to all non-bugfix versions of
>>>> dependency upgrades. We may still want to be pragmatic, e.g. pass-through
>>>> for minor versions, though.
>>>>
>>>

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Sean Owen

I think we can assume that someone upgrading Kafka will be responsible for
thinking through the breaking changes. We can help by listing anything we
know could affect Spark-Kafka usage and calling those out in a release
note, for sure. I don't think we need to get into items that would affect
Kafka usage itself; focus on the connector-related issues.

On Fri, Mar 18, 2022 at 5:15 AM Jungtaek Lim 
wrote:

> CORRECTION: in option 2, we enumerate KIPs which may bring incompatibility
> with older brokers (not all KIPs).
>
> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim 
> wrote:
>
>> Hi dev,
>>
>> I would like to initiate the discussion about how to deal with the
>> migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
>> 3.3.
>>
>> We didn't care much about the upgrade of Kafka dependency since our
>> belief on Kafka client has been that the new Kafka client version should
>> have no compatibility issues with older brokers. Based on semantic
>> versioning, upgrading major versions rings an alarm for me.
>>
>> I haven't gone through changes that happened between versions, but found
>> one KIP (KIP-679
>> )
>> which may not work with older brokers with specific setup. (It's described
>> in the "Compatibility, Deprecation, and Migration Plan" section of the KIP).
>>
>> This may not be problematic for the users who upgrade both client and
>> broker altogether, but end users of Spark may be unlikely the case.
>> Computation engines are relatively easier to upgrade. Storage systems
>> aren't. End users would think the components are independent.
>>
>> I looked through the notable changes in the Kafka doc, and it does
>> mention this KIP, but it just says the default config has changed and
>> doesn't mention about the impacts. There is a link to
>> KIP, that said, everyone needs to read through the KIP wiki page for
>> details.
>>
>> Based on the context, what would be the best way to notice end users for
>> the major version upgrade of Kafka? I can imagine several options
>> including...
>>
>> 1. Explicitly mention that Spark 3.3 upgrades Kafka to 3.1 with linking
>> the noticeable changes in the Kafka doc in the migration guide.
>> 2. Do 1 & spend more effort to read through all KIPs and check
>> "Compatibility, Deprecation, and Migration Plan" section, and enumerate all
>> KIPs (or even summarize) in the migration guide.
>> 3. Do 2 & actively override the default configs to be compatible with
>> older versions if the change of the default configs in Kafka 3.0 is
>> backward incompatible. End users should set these configs explicitly to
>> override them back.
>> 4. Do not care. End users can indicate the upgrade in the release note,
>> and we expect end users to actively check the notable changes (& KIPs) from
>> Kafka doc.
>> 5. Options not described above...
>>
>> Please take a look and provide your voice on this.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> ps. Probably this would be applied to all non-bugfix versions of
>> dependency upgrades. We may still want to be pragmatic, e.g. pass-through
>> for minor versions, though.
>>
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1013 matches

Mail list logo