[ANNOUNCE] Apache Spark 3.4.3 released

2024-04-18 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.3!

Spark 3.4.3 is a maintenance release containing many fixes including
security and correctness domains. This release is based on the
branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade to this stable release.

To download Spark 3.4.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-4-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


Re: [DISCUSS] MySQL version support policy

2024-03-25 Thread Dongjoon Hyun
Hi, Cheng.

Thank you for the suggestion. Your suggestion seems to have at least two
themes.

A. Adding a new Apache Spark community policy (contract) to guarantee MySQL
LTS Versions Support.
B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1)

And, it brings me three questions.

1. For (A), do you mean MySQL LTS versions are not supported by Apache
Spark releases properly due to the improper test suite?
2. For (B), why does Apache Spark need to drop non-LTS MySQL support?
3. What about MariaDB? Do we need to stick to some versions?

To be clear, if needed, we can have daily GitHub Action CIs easily like
Python CI (Python 3.8/3.10/3.11/3.12).

-
https://github.com/apache/spark/blob/master/.github/workflows/build_python.yml

Thanks,
Dongjoon.


On Sun, Mar 24, 2024 at 10:29 PM Cheng Pan  wrote:

> Hi, Spark community,
>
> I noticed that the Spark JDBC connector MySQL dialect is testing against
> the 8.3.0[1] now, a non-LTS version.
>
> MySQL changed the version policy recently[2], which is now very similar to
> the Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version,
> 8.1, 8.2, 8.3 is non-LTS, and the next LTS version is 8.4.
>
> I would say that MySQL is one of the most important infrastructures today,
> I checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version
> support policy, and both only support 5.7 and 8.0.
>
> Also, Spark officially only supports LTS Java versions, like JDK 17 and
> 21, but not 22. I would recommend using MySQL 8.0 for testing until the
> next MySQL LTS version (8.4) is available.
>
> Additional discussion can be found at [3]
>
> [1] https://issues.apache.org/jira/browse/SPARK-47453
> [2]
> https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
> [3] https://github.com/apache/spark/pull/45581
> [4] https://aws.amazon.com/rds/mysql/
> [5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy
>
> Thanks,
> Cheng Pan
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Dongjoon Hyun
BTW, Jungtaek.

PySpark document seems to show a wrong branch. At this time, `master`.

https://spark.apache.org/docs/3.5.1/api/python/index.html

PySpark Overview
<https://spark.apache.org/docs/3.5.1/api/python/index.html#pyspark-overview>

   Date: Feb 24, 2024 Version: master

[image: Screenshot 2024-02-29 at 21.12.24.png]


Could you do the follow-up, please?

Thank you in advance.

Dongjoon.


On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:

> Excellent work, congratulations!
>
> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
> wrote:
>
>> Congratulations!
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>
>>> Congratulations!
>>>
>>>
>>>
>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> We are happy to announce the availability of Spark 3.5.1!
>>>
>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>> strongly
>>> recommend all 3.5 users to upgrade to this stable release.
>>>
>>> To download Spark 3.5.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>> Jungtaek Lim
>>>
>>> ps. Yikun is helping us through releasing the official docker image for
>>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>>
>>>
>
> --
> John Zhuge
>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Dongjoon Hyun
Congratulations!

Bests,
Dongjoon.

On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:

> Congratulations!
>
>
>
> At 2024-02-28 17:43:25, "Jungtaek Lim" 
> wrote:
>
> Hi everyone,
>
> We are happy to announce the availability of Spark 3.5.1!
>
> Spark 3.5.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.5 maintenance branch of Spark. We strongly
> recommend all 3.5 users to upgrade to this stable release.
>
> To download Spark 3.5.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-5-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Jungtaek Lim
>
> ps. Yikun is helping us through releasing the official docker image for
> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>
>


[ANNOUNCE] Apache Spark 3.3.4 released

2023-12-16 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.3.4!

Spark 3.3.4 is the last maintenance release based on the
branch-3.3 maintenance branch of Spark. It contains many fixes
including security and correctness domains. We strongly
recommend all 3.3 users to upgrade to this or higher stable release.

To download Spark 3.3.4, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-3-4.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.

Dongjoon Hyun


[FYI] SPARK-45981: Improve Python language test coverage

2023-12-01 Thread Dongjoon Hyun
Hi, All.

As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community
starts to have test coverage for all supported Python versions from Today.

- https://github.com/apache/spark/actions/runs/7061665420

Here is a summary.

1. Main CI: All PRs and commits on `master` branch are tested with Python
3.9.
2. Daily CI:
https://github.com/apache/spark/actions/workflows/build_python.yml
- PyPy 3.8
- Python 3.10
- Python 3.11
- Python 3.12

This is a great addition for PySpark 4.0+ users and an extensible framework
for all future Python versions.

Thank you all for making this together!

Best,
Dongjoon.


[ANNOUNCE] Apache Spark 3.4.2 released

2023-11-30 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.2!

Spark 3.4.2 is a maintenance release containing many fixes including
security and correctness domains. This release is based on the
branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade to this stable release.

To download Spark 3.4.2, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-4-2.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


[ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.1!

Spark 3.4.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade to this stable release.

To download Spark 3.4.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-4-1.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


[ANNOUNCE] Apache Spark 3.2.4 released

2023-04-13 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.2.4!

Spark 3.2.4 is a maintenance release containing stability fixes. This
release is based on the branch-3.2 maintenance branch of Spark. We strongly
recommend all 3.2 users to upgrade to this stable release.

To download Spark 3.2.4, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-2-4.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
Thank you, Denny.

May I interpret your comment as a request to support multiple channels in
ASF too?

> because it would allow us to create multiple channels for different topics


Any other reasons?

Dongjoon.


On Mon, Apr 3, 2023 at 5:31 PM Denny Lee  wrote:

> I do think creating a new Slack channel would be helpful because it would
> allow us to create multiple channels for different topics - streaming,
> graph, ML, etc.
>
> We would need a volunteer core to maintain it so we can keep the spirit
> and letter of ASF / code of conduct.  I’d be glad to volunteer to keep this
> active.
>
>
>
> On Mon, Apr 3, 2023 at 16:46 Dongjoon Hyun 
> wrote:
>
>> Shall we summarize the discussion so far?
>>
>> To sum up, "ASF Slack" vs "3rd-party Slack" was the real background to
>> initiate this thread instead of "Slack" vs "Mailing list"?
>>
>> If ASF Slack provides what you need, is it better than creating a
>> new Slack channel?
>>
>> Or, is there another reason for us to create a new Slack channel?
>>
>> Dongjoon.
>>
>>
>> On Mon, Apr 3, 2023 at 3:27 PM Mich Talebzadeh 
>> wrote:
>>
>>> I agree, whatever individual sentiments are.
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 3 Apr 2023 at 23:21, Jungtaek Lim 
>>> wrote:
>>>
>>>> Just to be clear, if there is no strong volunteer to make the new
>>>> community channel stay active, I'd probably be OK to not fork the channel.
>>>> You can see a strong counter example from #spark channel in ASF. It is the
>>>> place where there are only questions and promos but zero answers. I see
>>>> volunteers here demanding for another channel, so I want to see us go with
>>>> the most preferred way for these volunteers.
>>>>
>>>> User mailing list does not go in a good shape. I hope we give another
>>>> try with recent technology to see whether we can gain traction - if we
>>>> fail, the user mailing list will still be there.
>>>>
>>>> On Tue, Apr 4, 2023 at 7:04 AM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> The number of subscribers doesn't give any meaningful value. Please
>>>>> look into the number of mails being sent to the list.
>>>>>
>>>>> https://lists.apache.org/list.html?user@spark.apache.org
>>>>> The latest month there were more than 200 emails being sent was Feb
>>>>> 2022, more than a year ago. It was more than 1k in 2016, and more than 2k
>>>>> in 2015 and earlier.
>>>>> Let's face the fact. User mailing list is dying, even before we start
>>>>> discussion about alternative communication methods.
>>>>>
>>>>> Users never go with the way if it's just because PMC members (or ASF)
>>>>> have preference. They are going with the way they are convenient.
>>>>>
>>>>> Same applies here - if ASF Slack requires a restricted invitation
>>>>> mechanism then it won't work. Looks like there is a link for an 
>>>>> invitation,
>>>>> but we are also talking about the cost as well.
>>>>> https://cwiki.apache.org/confluence/display/INFRA/Slack+Guest+Invites
>>>>> As long as we are being serious about the cost, I don't think we are
>>>>> going to land in the way "users" are convenient.
>>>>>
>>>>> On Tue, Apr 4, 2023 at 4:59 AM Dongjoon Hyun 
>>>>> wrote:
>>>>>
>>>>>> As Mich Talebzadeh pointed out, Apache Spark has an official Slack
>>>>>> channel.
>>>>>>
>>>>>> > It's unavoidable if "users" prefer to use an alternat

Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
Shall we summarize the discussion so far?

To sum up, "ASF Slack" vs "3rd-party Slack" was the real background to
initiate this thread instead of "Slack" vs "Mailing list"?

If ASF Slack provides what you need, is it better than creating a new Slack
channel?

Or, is there another reason for us to create a new Slack channel?

Dongjoon.


On Mon, Apr 3, 2023 at 3:27 PM Mich Talebzadeh 
wrote:

> I agree, whatever individual sentiments are.
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 3 Apr 2023 at 23:21, Jungtaek Lim 
> wrote:
>
>> Just to be clear, if there is no strong volunteer to make the new
>> community channel stay active, I'd probably be OK to not fork the channel.
>> You can see a strong counter example from #spark channel in ASF. It is the
>> place where there are only questions and promos but zero answers. I see
>> volunteers here demanding for another channel, so I want to see us go with
>> the most preferred way for these volunteers.
>>
>> User mailing list does not go in a good shape. I hope we give another try
>> with recent technology to see whether we can gain traction - if we fail,
>> the user mailing list will still be there.
>>
>> On Tue, Apr 4, 2023 at 7:04 AM Jungtaek Lim 
>> wrote:
>>
>>> The number of subscribers doesn't give any meaningful value. Please look
>>> into the number of mails being sent to the list.
>>>
>>> https://lists.apache.org/list.html?user@spark.apache.org
>>> The latest month there were more than 200 emails being sent was Feb
>>> 2022, more than a year ago. It was more than 1k in 2016, and more than 2k
>>> in 2015 and earlier.
>>> Let's face the fact. User mailing list is dying, even before we start
>>> discussion about alternative communication methods.
>>>
>>> Users never go with the way if it's just because PMC members (or ASF)
>>> have preference. They are going with the way they are convenient.
>>>
>>> Same applies here - if ASF Slack requires a restricted invitation
>>> mechanism then it won't work. Looks like there is a link for an invitation,
>>> but we are also talking about the cost as well.
>>> https://cwiki.apache.org/confluence/display/INFRA/Slack+Guest+Invites
>>> As long as we are being serious about the cost, I don't think we are
>>> going to land in the way "users" are convenient.
>>>
>>> On Tue, Apr 4, 2023 at 4:59 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> As Mich Talebzadeh pointed out, Apache Spark has an official Slack
>>>> channel.
>>>>
>>>> > It's unavoidable if "users" prefer to use an alternative
>>>> communication mechanism rather than the user mailing list.
>>>>
>>>> The following is the number of people in the official channels.
>>>>
>>>> - user@spark.apache.org has 4519 subscribers.
>>>> - d...@spark.apache.org has 3149 subscribers.
>>>> - ASF Official Slack channel has 602 subscribers.
>>>>
>>>> May I ask if the users prefer to use the ASF Official Slack channel
>>>> than the user mailing list?
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>> On Thu, Mar 30, 2023 at 9:10 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> I'm reading through the page "Briefing: The Apache Way", and in the
>>>>> section of "Open Communications", restriction of communication inside ASF
>>>>> INFRA (mailing list) is more about code and decision-making.
>>>>>
>>>>> https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define
>>>>>
>>>>> It's unavoidable if "users" prefer to use an alternative communication
>>>>> mechanism rather than the user mailing list. Before Stack Overflow days,
>>>>> there had been a meanin

Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
Do you think there is a way to put it back to the official ASF-provided
Slack channel?

Dongjoon.

On Mon, Apr 3, 2023 at 2:18 PM Mich Talebzadeh 
wrote:

>
> I for myself prefer to use the newly formed slack.
>
> sparkcommunitytalk.slack.com
>
> In summary, it may be a good idea to take a tour of it and see for
> yourself. Topics are sectioned as per user requests.
>
> I trust this answers your question.
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 3 Apr 2023 at 20:59, Dongjoon Hyun 
> wrote:
>
>> As Mich Talebzadeh pointed out, Apache Spark has an official Slack
>> channel.
>>
>> > It's unavoidable if "users" prefer to use an alternative communication
>> mechanism rather than the user mailing list.
>>
>> The following is the number of people in the official channels.
>>
>> - user@spark.apache.org has 4519 subscribers.
>> - d...@spark.apache.org has 3149 subscribers.
>> - ASF Official Slack channel has 602 subscribers.
>>
>> May I ask if the users prefer to use the ASF Official Slack channel
>> than the user mailing list?
>>
>> Dongjoon.
>>
>>
>>
>> On Thu, Mar 30, 2023 at 9:10 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> I'm reading through the page "Briefing: The Apache Way", and in the
>>> section of "Open Communications", restriction of communication inside ASF
>>> INFRA (mailing list) is more about code and decision-making.
>>>
>>> https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define
>>>
>>> It's unavoidable if "users" prefer to use an alternative communication
>>> mechanism rather than the user mailing list. Before Stack Overflow days,
>>> there had been a meaningful number of questions around user@. It's just
>>> impossible to let them go back and post to the user mailing list.
>>>
>>> We just need to make sure it is not the purpose of employing Slack to
>>> move all discussions about developments, direction of the project, etc
>>> which must happen in dev@/private@. The purpose of Slack thread here
>>> does not seem to aim to serve the purpose.
>>>
>>>
>>> On Fri, Mar 31, 2023 at 7:00 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Good discussions and proposals.all around.
>>>>
>>>> I have used slack in anger on a customer site before. For small and
>>>> medium size groups it is good and affordable. Alternatives have been
>>>> suggested as well so those who like investigative search can agree and come
>>>> up with a freebie one.
>>>> I am inclined to agree with Bjorn that this slack has more social
>>>> dimensions than the mailing list. It is akin to a sports club using
>>>> WhatsApp groups for communication. Remember we were originally looking for
>>>> space for webinars, including Spark on Linkedin that Denney Lee suggested.
>>>> I think Slack and mailing groups can coexist happily. On a more serious
>>>> note, when I joined the user group back in 2015-2016, there was a lot of
>>>> traffic. Currently we hardly get many mails daily <> less than 5. So having
>>>> a slack type medium may improve members participation.
>>>>
>>>> so +1 for me as well.
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relyi

Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
gt;>>> official policy around it.
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>> On Thu, Mar 30, 2023 at 4:23 PM Bjørn Jørgensen <
>>>> bjornjorgen...@gmail.com> wrote:
>>>>
>>>>> I like the idea of having a talk channel. It can make it easier for
>>>>> everyone to say hello. Or to dare to ask about small or big matters that
>>>>> you would not have dared to ask about before on mailing lists.
>>>>> But then there is the price and what is the best for an open source
>>>>> project.
>>>>>
>>>>> The price for using slack is expensive.
>>>>> Right now for those that have join spark slack
>>>>> $8.75 USD
>>>>> 72 members
>>>>> 1 month
>>>>> $630 USD
>>>>>
>>>>> https://app.slack.com/plans/T04URTRBZ1R/checkout/form?entry_point=hero_banner_upgrade_cta=2
>>>>>
>>>>> And they - slack does not have an option for open source projects.
>>>>>
>>>>> There seems to be some alternatives for open source software. I have
>>>>> not tried it.
>>>>> Like https://www.rocket.chat/blog/slack-open-source-alternatives
>>>>>
>>>>> 
>>>>>
>>>>>
>>>>> rocket chat is open source https://github.com/RocketChat/Rocket.Chat
>>>>>
>>>>> tor. 30. mar. 2023 kl. 18:54 skrev Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com>:
>>>>>
>>>>>> Hi Dongjoon
>>>>>>
>>>>>> to your points if I may
>>>>>>
>>>>>> - Do you have any reference from other official ASF-related Slack
>>>>>> channels?
>>>>>>No, I don't have any reference from other official ASF-related
>>>>>> Slack channels because I don't think that matters. However, I stand
>>>>>> corrected
>>>>>> - To be clear, I intentionally didn't refer to any specific mailing
>>>>>> list because we didn't set up any rule here yet.
>>>>>>fair enough
>>>>>>
>>>>>> going back to your original point
>>>>>>
>>>>>> ..There is a concern expressed by ASF board because recent Slack
>>>>>> activities created an isolated silo outside of ASF mailing list 
>>>>>> archive...
>>>>>> Well, there are activities on Spark and indeed other open source
>>>>>> software everywhere. One way or other they do help getting community
>>>>>> (inside the user groups and other) to get interested and involved. Slack
>>>>>> happens to be one of them.
>>>>>> I am of the opinion that creating such silos is already a reality and
>>>>>> we ought to be pragmatic. Unless there is an overriding reason, we should
>>>>>> embrace it as slack can co-exist with the other mailing lists and 
>>>>>> channels
>>>>>> like linkedin etc.
>>>>>>
>>>>>> Hope this clarifies my position
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>> Palantir Technologies Limited
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, 30 Mar 2023 at 17:28, Dongjoon Hyun 
>>>>>> wrote:
>>>>>>
>>>>>>> To Mich.
>>>>>>> - Do you have any reference from other official ASF-related Slack
>>>>>>> 

Re: Slack for PySpark users

2023-03-30 Thread Dongjoon Hyun
To Mich.
- Do you have any reference from other official ASF-related Slack channels?
- To be clear, I intentionally didn't refer to any specific mailing list
because we didn't set up any rule here yet.

To Xiao. I understand what you mean. That's the reason why I added Matei
from your side.
> I did not see an objection from the ASF board.

There is on-going discussion about the communication channels outside ASF
email which is specifically concerning Slack.
Please hold on any official action for this topic. We will know how to
support it seamlessly.

Dongjoon.


On Thu, Mar 30, 2023 at 9:21 AM Xiao Li  wrote:

> Hi, Dongjoon,
>
> The other communities (e.g., Pinot, Druid, Flink) created their own Slack
> workspaces last year. I did not see an objection from the ASF board. At the
> same time, Slack workspaces are very popular and useful in most non-ASF
> open source communities. TBH, we are kind of late. I think we can do the
> same in our community?
>
> We can follow the guide when the ASF has an official process for ASF
> archiving. Since our PMC are the owner of the slack workspace, we can make
> a change based on the policy. WDYT?
>
> Xiao
>
>
> Dongjoon Hyun  于2023年3月30日周四 09:03写道:
>
>> Hi, Xiao and all.
>>
>> (cc Matei)
>>
>> Please hold on the vote.
>>
>> There is a concern expressed by ASF board because recent Slack activities
>> created an isolated silo outside of ASF mailing list archive.
>>
>> We need to establish a way to embrace it back to ASF archive before
>> starting anything official.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Mar 29, 2023 at 11:32 PM Xiao Li  wrote:
>>
>>> +1
>>>
>>> + @d...@spark.apache.org 
>>>
>>> This is a good idea. The other Apache projects (e.g., Pinot, Druid,
>>> Flink) have created their own dedicated Slack workspaces for faster
>>> communication. We can do the same in Apache Spark. The Slack workspace will
>>> be maintained by the Apache Spark PMC. I propose to initiate a vote for the
>>> creation of a new Apache Spark Slack workspace. Does that sound good?
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Mich Talebzadeh  于2023年3月28日周二 07:07写道:
>>>
>>>> I created one at slack called pyspark
>>>>
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 28 Mar 2023 at 03:52, asma zgolli  wrote:
>>>>
>>>>> +1 good idea, I d like to join as well.
>>>>>
>>>>> Le mar. 28 mars 2023 à 04:09, Winston Lai  a
>>>>> écrit :
>>>>>
>>>>>> Please let us know when the channel is created. I'd like to join :)
>>>>>>
>>>>>> Thank You & Best Regards
>>>>>> Winston Lai
>>>>>> --
>>>>>> *From:* Denny Lee 
>>>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>>>>> *To:* Hyukjin Kwon 
>>>>>> *Cc:* keen ; user@spark.apache.org <
>>>>>> user@spark.apache.org>
>>>>>> *Subject:* Re: Slack for PySpark users
>>>>>>
>>>>>> +1 I think this is a great idea!
>>>>>>
>>>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>> Yeah, actually I think we should better have a slack channel so we
>>>>>> can easily discuss with users and developers.
>>>>>>
>>>>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>>>>
>>>>>> Hi all,
>>>>>> I really like *Slack *as communication channel for a tech community.
>>>>>> There is a Slack workspace for *delta lake users* (
>>>>>> https://go.delta.io/slack) that I enjoy a lot.
>>>>>> I was wondering if there is something similar for PySpark users.
>>>>>>
>>>>>> If not, would there be anything wrong with creating a new
>>>>>> Slack workspace for PySpark users? (when explicitly mentioning that this 
>>>>>> is
>>>>>> *not* officially part of Apache Spark)?
>>>>>>
>>>>>> Cheers
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Asma ZGOLLI
>>>>>
>>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>>
>>>>>


Re: Slack for PySpark users

2023-03-30 Thread Dongjoon Hyun
Hi, Xiao and all.

(cc Matei)

Please hold on the vote.

There is a concern expressed by ASF board because recent Slack activities
created an isolated silo outside of ASF mailing list archive.

We need to establish a way to embrace it back to ASF archive before
starting anything official.

Bests,
Dongjoon.



On Wed, Mar 29, 2023 at 11:32 PM Xiao Li  wrote:

> +1
>
> + @d...@spark.apache.org 
>
> This is a good idea. The other Apache projects (e.g., Pinot, Druid, Flink)
> have created their own dedicated Slack workspaces for faster communication.
> We can do the same in Apache Spark. The Slack workspace will be maintained
> by the Apache Spark PMC. I propose to initiate a vote for the creation of a
> new Apache Spark Slack workspace. Does that sound good?
>
> Cheers,
>
> Xiao
>
>
>
>
>
>
>
> Mich Talebzadeh  于2023年3月28日周二 07:07写道:
>
>> I created one at slack called pyspark
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 28 Mar 2023 at 03:52, asma zgolli  wrote:
>>
>>> +1 good idea, I d like to join as well.
>>>
>>> Le mar. 28 mars 2023 à 04:09, Winston Lai  a
>>> écrit :
>>>
 Please let us know when the channel is created. I'd like to join :)

 Thank You & Best Regards
 Winston Lai
 --
 *From:* Denny Lee 
 *Sent:* Tuesday, March 28, 2023 9:43:08 AM
 *To:* Hyukjin Kwon 
 *Cc:* keen ; user@spark.apache.org <
 user@spark.apache.org>
 *Subject:* Re: Slack for PySpark users

 +1 I think this is a great idea!

 On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
 wrote:

 Yeah, actually I think we should better have a slack channel so we can
 easily discuss with users and developers.

 On Tue, 28 Mar 2023 at 03:08, keen  wrote:

 Hi all,
 I really like *Slack *as communication channel for a tech community.
 There is a Slack workspace for *delta lake users* (
 https://go.delta.io/slack) that I enjoy a lot.
 I was wondering if there is something similar for PySpark users.

 If not, would there be anything wrong with creating a new
 Slack workspace for PySpark users? (when explicitly mentioning that this is
 *not* officially part of Apache Spark)?

 Cheers
 Martin


>>>
>>> --
>>> Asma ZGOLLI
>>>
>>> Ph.D. in Big Data - Applied Machine Learning
>>>
>>>


Re: SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-18 Thread Dongjoon Hyun
Thank you for considering me, but may I ask what makes you think to put me
there, Mich? I'm curious about your reason.

> I have put dongjoon.hyun as a shepherd.

BTW, unfortunately, I cannot help you with that due to my on-going personal
stuff. I'll adjust the JIRA first.

Thanks,
Dongjoon.


On Sat, Feb 18, 2023 at 10:51 AM Mich Talebzadeh 
wrote:

> https://issues.apache.org/jira/browse/SPARK-42485
>
>
> Spark Structured Streaming is a very useful tool in dealing with Event
> Driven Architecture. In an Event Driven Architecture, there is generally a
> main loop that listens for events and then triggers a call-back function
> when one of those events is detected. In a streaming application the
> application waits to receive the source messages in a set interval or
> whenever they happen and reacts accordingly.
>
> There are occasions that you may want to stop the Spark program
> gracefully. Gracefully meaning that Spark application handles the last
> streaming message completely and terminates the application. This is
> different from invoking interrupts such as CTRL-C.
>
> Of course one can terminate the process based on the following
>
>1. query.awaitTermination() # Waits for the termination of this query,
>with stop() or with error
>
>
>1. query.awaitTermination(timeoutMs) # Returns true if this query is
>terminated within the timeout in milliseconds.
>
> So the first one above waits until an interrupt signal is received. The
> second one will count the timeout and will exit when the timeout in
> milliseconds is reached.
>
> The issue is that one needs to predict how long the streaming job needs to
> run. Clearly any interrupt at the terminal or OS level (kill process), may
> end up the processing terminated without a proper completion of the
> streaming process.
>
> I have devised a method that allows one to terminate the spark application
> internally after processing the last received message. Within say 2 seconds
> of the confirmation of shutdown, the process will invoke a graceful
> shutdown.
>
> This new feature proposes a solution to handle the topic doing work for
> the message being processed gracefully, wait for it to complete and
> shutdown the streaming process for a given topic without loss of data or
> orphaned transactions
>
>
> I have put dongjoon.hyun as a shepherd. Kindly advise me if that is the
> correct approach.
>
> JIRA ticket https://issues.apache.org/jira/browse/SPARK-42485
>
> SPIP doc: TBC
>
> Discussion thread: in
>
> https://lists.apache.org/list.html?d...@spark.apache.org
>
>
> Thanks.
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Dongjoon Hyun
Thank you, Chao!

On Wed, Nov 30, 2022 at 8:16 AM Yang,Jie(INF)  wrote:

> Thanks, Chao!
>
>
>
> *发件人**: *Maxim Gekk 
> *日期**: *2022年11月30日 星期三 19:40
> *收件人**: *Jungtaek Lim 
> *抄送**: *Wenchen Fan , Chao Sun ,
> dev , user 
> *主题**: *Re: [ANNOUNCE] Apache Spark 3.2.3 released
>
>
>
> Thank you, Chao!
>
>
>
> On Wed, Nov 30, 2022 at 12:42 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
> Thanks Chao for driving the release!
>
>
>
> On Wed, Nov 30, 2022 at 6:03 PM Wenchen Fan  wrote:
>
> Thanks, Chao!
>
>
>
> On Wed, Nov 30, 2022 at 1:33 AM Chao Sun  wrote:
>
> We are happy to announce the availability of Apache Spark 3.2.3!
>
> Spark 3.2.3 is a maintenance release containing stability fixes. This
> release is based on the branch-3.2 maintenance branch of Spark. We strongly
> recommend all 3.2 users to upgrade to this stable release.
>
> To download Spark 3.2.3, head over to the download page:
> https://spark.apache.org/downloads.html
> 
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-3.html
> 
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Chao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Dongjoon Hyun
It's great. Thank you so much, Yuming!

Dongjoon

On Tue, Oct 25, 2022 at 11:23 PM Yuming Wang  wrote:

> We are happy to announce the availability of Apache Spark 3.3.1!
>
> Spark 3.3.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.3 maintenance branch of Spark. We strongly
> recommend all 3.3 users to upgrade to this stable release.
>
> To download Spark 3.3.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-3-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
>
>


[ANNOUNCE] Apache Spark 3.2.2 released

2022-07-17 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.2.2!

Spark 3.2.2 is a maintenance release containing stability fixes. This
release is based on the branch-3.2 maintenance branch of Spark. We strongly
recommend all 3.2 users to upgrade to this stable release.

To download Spark 3.2.2, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-2-2.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread Dongjoon Hyun
Thank you again, Huaxin!

Dongjoon.

On Fri, Jan 28, 2022 at 6:23 PM DB Tsai  wrote:

> Thank you, Huaxin for the 3.2.1 release!
>
> Sent from my iPhone
>
> On Jan 28, 2022, at 5:45 PM, Chao Sun  wrote:
>
> 
> Thanks Huaxin for driving the release!
>
> On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng 
> wrote:
>
>> It's Great!
>> Congrats and thanks, huaxin!
>>
>>
>> -- 原始邮件 --
>> *发件人:* "huaxin gao" ;
>> *发送时间:* 2022年1月29日(星期六) 上午9:07
>> *收件人:* "dev";"user";
>> *主题:* [ANNOUNCE] Apache Spark 3.2.1 released
>>
>> We are happy to announce the availability of Spark 3.2.1!
>>
>> Spark 3.2.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.2 maintenance branch of Spark. We
>> strongly
>> recommend all 3.2 users to upgrade to this stable release.
>>
>> To download Spark 3.2.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-2-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Huaxin Gao
>>
>


Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Dongjoon Hyun
Thank you so much, Gengliang and all!

Dongjoon.

On Tue, Oct 19, 2021 at 8:48 AM Xiao Li  wrote:

> Thank you, Gengliang!
>
> Congrats to our community and all the contributors!
>
> Xiao
>
> Henrik Peng  于2021年10月19日周二 上午8:26写道:
>
>> Congrats and thanks!
>>
>>
>> Gengliang Wang 于2021年10月19日 周二下午10:16写道:
>>
>>> Hi all,
>>>
>>> Apache Spark 3.2.0 is the third release of the 3.x line. With tremendous
>>> contribution from the open-source community, this release managed to
>>> resolve in excess of 1,700 Jira tickets.
>>>
>>> We'd like to thank our contributors and users for their contributions
>>> and early feedback to this release. This release would not have been
>>> possible without you.
>>>
>>> To download Spark 3.2.0, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-2-0.html
>>>
>>


Re: [ANNOUNCE] Apache Spark 3.0.3 released

2021-06-25 Thread Dongjoon Hyun
Thank you, Yi!



On Thu, Jun 24, 2021 at 10:52 PM Yi Wu  wrote:

> We are happy to announce the availability of Spark 3.0.3!
>
> Spark 3.0.3 is a maintenance release containing stability fixes. This
> release is based on the branch-3.0 maintenance branch of Spark. We strongly
> recommend all 3.0 users to upgrade to this stable release.
>
> To download Spark 3.0.3, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-0-3.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Yi
>
>


Re: Missing module spark-hadoop-cloud in Maven central

2021-06-21 Thread Dongjoon Hyun
Hi, Stephen and Steve.

Apache Spark community starts to publish it as a snapshot and Apache Spark 
3.2.0 will be the first release has it.

- 
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-hadoop-cloud_2.12/3.2.0-SNAPSHOT/

Please check the snapshot artifacts and file an Apache Spark JIRA if you hit 
some issues.

Bests,
Dongjoon.

On 2021/06/02 19:05:29, Steve Loughran  wrote: 
> off the record: Really irritates me too, as it forces me to do local builds
> even though I shouldn't have to. Sometimes I do that for other reasons, but
> still.
> 
> Getting the cloud-storage module in was hard enough at the time that I
> wasn't going to push harder; I essentially stopped trying to get one in to
> spark after that and effectively being told to go and play in my own fork
> (*).
> 
> https://github.com/apache/spark/pull/12004#issuecomment-259020494
> 
> Given that effort almost failed, to then say "now include the artifact and
> releases" wasn't something I was going to do; I had everything I needed for
> my own build, and trying to add new PRs struck me as an exercise in
> confrontation and futility
> 
> Sean, if I do submit a PR which makes hadoop-cloud default on the right
> versions, but strips out the dependencies on the final tarball, would that
> get some attention?
> 
> (*) Sean of course, was a notable exception and very supportive.
> 
> 
> 
> 
> 
> 
> 
> On Wed, 2 Jun 2021 at 00:56, Stephen Coy  wrote:
> 
> > I have been building Apache Spark from source just so I can get this
> > dependency.
> >
> >
> >1. git checkout v3.1.1
> >2. dev/make-distribution.sh --name hadoop-cloud-3.2 --tgz -Pyarn
> >-Phadoop-3.2  -Pyarn -Phadoop-cloud
> >-Phive-thriftserver  -Dhadoop.version=3.2.0
> >
> >
> > It is kind of a nuisance having to do this though.
> >
> > Steve C
> >
> >
> > On 31 May 2021, at 10:34 pm, Sean Owen  wrote:
> >
> > I know it's not enabled by default when the binary artifacts are built,
> > but not exactly sure why it's not built separately at all. It's almost a
> > dependencies-only pom artifact, but there are two source files. Steve do
> > you have an angle on that?
> >
> > On Mon, May 31, 2021 at 5:37 AM Erik Torres  wrote:
> >
> >> Hi,
> >>
> >> I'm following this documentation
> >> 
> >>  to
> >> configure my Spark-based application to interact with Amazon S3. However, I
> >> cannot find the spark-hadoop-cloud module in Maven central for the
> >> non-commercial distribution of Apache Spark. From the documentation I would
> >> expect that I can get this module as a Maven dependency in my project.
> >> However, I ended up building the spark-hadoop-cloud module from the Spark's
> >> code
> >> 
> >> .
> >>
> >> Is this the expected way to setup the integration with Amazon S3? I think
> >> I'm missing something here.
> >>
> >> Thanks in advance!
> >>
> >> Erik
> >>
> >
> > This email contains confidential information of and is the copyright of
> > Infomedia. It must not be forwarded, amended or disclosed without consent
> > of the sender. If you received this message by mistake, please advise the
> > sender and delete all copies. Security of transmission on the internet
> > cannot be guaranteed, could be infected, intercepted, or corrupted and you
> > should ensure you have suitable antivirus protection in place. By sending
> > us your or any third party personal details, you consent to (or confirm you
> > have obtained consent from such third parties) to Infomedia’s privacy
> > policy. http://www.infomedia.com.au/privacy-policy/
> >
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Dongjoon Hyun
We are happy to announce the availability of Spark 3.1.2!

Spark 3.1.2 is a maintenance release containing stability fixes. This
release is based on the branch-3.1 maintenance branch of Spark. We strongly
recommend all 3.1 users to upgrade to this stable release.

To download Spark 3.1.2, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-1-2.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Dongjoon Hyun
It took a long time. Thank you, Hyukjin and all!

Bests,
Dongjoon.

On Wed, Mar 3, 2021 at 3:23 AM Gabor Somogyi 
wrote:

> Good to hear and great work Hyukjin! 
>
> On Wed, 3 Mar 2021, 11:15 Jungtaek Lim, 
> wrote:
>
>> Thanks Hyukjin for driving the huge release, and thanks everyone for
>> contributing the release!
>>
>> On Wed, Mar 3, 2021 at 6:54 PM angers zhu  wrote:
>>
>>> Great work, Hyukjin !
>>>
>>> Bests,
>>> Angers
>>>
>>> Wenchen Fan  于2021年3月3日周三 下午5:02写道:
>>>
 Great work and congrats!

 On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:

> Congrats, all!
>
> Bests,
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi is a
> unified multi-tenant JDBC interface for large-scale data processing and
> analytics, built on top of Apache Spark .*
> *spark-authorizer A
> Spark SQL extension which provides SQL Standard Authorization for **Apache
> Spark .*
> *spark-postgres  A library
> for reading data from and transferring data to Postgres / Greenplum with
> Spark SQL and DataFrames, 10~100x faster.*
> *spark-func-extras A
> library that brings excellent and useful functions from various modern
> database management systems to Apache Spark .*
>
>
>
> On 03/3/2021 15:11,Takeshi Yamamuro
>  wrote:
>
> Great work and Congrats, all!
>
> Bests,
> Takeshi
>
> On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
> wrote:
>
>>
>> Thanks Hyukjin and congratulations everyone on the release !
>>
>> Regards,
>> Mridul
>>
>> On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:
>>
>>> Great work, Hyukjin!
>>>
>>> On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
>>> wrote:
>>>
 We are excited to announce Spark 3.1.1 today.

 Apache Spark 3.1.1 is the second release of the 3.x line. This
 release adds
 Python type annotations and Python dependency management support as
 part of Project Zen.
 Other major updates include improved ANSI SQL compliance support,
 history server support
 in structured streaming, the general availability (GA) of
 Kubernetes and node decommissioning
 in Kubernetes and Standalone. In addition, this release continues
 to focus on usability, stability,
 and polish while resolving around 1500 tickets.

 We'd like to thank our contributors and users for their
 contributions and early feedback to
 this release. This release would not have been possible without you.

 To download Spark 3.1.1, head over to the download page:
 http://spark.apache.org/downloads.html

 To view the release notes:
 https://spark.apache.org/releases/spark-release-3-1-1.html


>
> --
> ---
> Takeshi Yamamuro
>
>


[ANNOUNCE] Announcing Apache Spark 3.0.2

2021-02-19 Thread Dongjoon Hyun
We are happy to announce the availability of Spark 3.0.2!

Spark 3.0.2 is a maintenance release containing stability fixes.
This release is based on the branch-3.0 maintenance branch of Spark.
We strongly recommend all 3.0 users to upgrade to this stable release.

To download Spark 3.0.2, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-0-2.html

We would like to acknowledge all community members for contributing
to this release. This release would not have been possible without you.

Dongjoon Hyun


[UPDATE] Apache Spark 3.1.0 Release Window

2020-10-12 Thread Dongjoon Hyun
Hi, All.

Apache Spark 3.1.0 Release Window is adjusted like the following today.
Please check the latest information on the official website.

-
https://github.com/apache/spark-website/commit/0cd0bdc80503882b4737db7e77cc8f9d17ec12ca
- https://spark.apache.org/versioning-policy.html

Bests,
Dongjoon.


Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-07 Thread Dongjoon Hyun
Thank you so much for your feedback, Koert.

Yes, SPARK-20202 was created in April 2017
and targeted for 3.1.0 since Nov 2019.

However, I believe Apache Spark 3.1.0 (Hadoop 3.2/Hive 2.3 distribution)
will work with old Hadoop 2.x clusters
if you isolated the classpath via SPARK-31960.

SPARK-31960 Only populate Hadoop classpath for no-hadoop build

Could you try with snapshot build?

Bests,
Dongjoon.




On Wed, Oct 7, 2020 at 3:24 PM Koert Kuipers  wrote:

> it seems to me with SPARK-20202 we are no longer planning to support
> hadoop2 + hive 1.2. is that correct?
>
> so basically spark 3.1 will no longer run on say CDH 5.x or HDP2.x with
> hive?
>
> my use case is building spark 3.1 and launching on these existing
> clusters that are not managed by me. e.g. i do not use the spark version
> provided by cloudera.
> however there are workarounds for me (using older spark version to extract
> out of hive, then switch to newer spark version) so i am not too worried
> about this. just making sure i understand.
>
> thanks
>
> On Sat, Oct 3, 2020 at 8:17 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> As of today, master branch (Apache Spark 3.1.0) resolved
>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>> According to the 3.1.0 release window, branch-3.1 will be
>> created on November 1st and enters QA period.
>>
>> Here are some notable updates I've been monitoring.
>>
>> *Language*
>> 01. SPARK-25075 Support Scala 2.13
>>   - Since SPARK-32926, Scala 2.13 build test has
>> become a part of GitHub Action jobs.
>>   - After SPARK-33044, Scala 2.13 test will be
>> a part of Jenkins jobs.
>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>> 03. SPARK-32082 Project Zen: Improving Python usability
>>   - 7 of 16 issues are resolved.
>> 04. SPARK-32073 Drop R < 3.5 support
>>   - This is done for Spark 3.0.1 and 3.1.0.
>>
>> *Dependency*
>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>   - This changes the default dist. for better cloud support
>> 06. SPARK-32981 Remove hive-1.2 distribution
>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>   - This will remove Hive 1.2.1 from source code
>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>
>> *Core*
>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>   - 11 of 15 issues are resolved
>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>   - 8 of 14 issues are resolved
>>
>> *Resource Manager*
>> 11. SPARK-33005 Kubernetes GA preparation
>>   - It is on the way and we are waiting for more feedback.
>>
>> *SQL*
>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>   to JSON/Avro
>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>   - 11 of 17 issues are resolved
>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>   and added more features in 3.1 but still we missed
>>   - All built-in DataSource v2 write paths are disabled
>> and v1 write is used instead.
>>   - Support partition pruning with subqueries
>>   - Support bucketing
>>
>> We still have one month before the feature freeze
>> and starting QA. If you are working for 3.1,
>> please consider the timeline and share your schedule
>> with the Apache Spark community. For the other stuff,
>> we can put it into 3.2 release scheduled in June 2021.
>>
>> Last not but least, I want to emphasize (7) once again.
>> We need to remove the forked unofficial Hive eventually.
>> Please let us know your reasons if you need to build
>> from Apache Spark 3.1 source code for Hive 1.2.
>>
>> https://github.com/apache/spark/pull/29936
>>
>> As I wrote in the above PR description, for old releases,
>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>> Hive 1.2-based distribution.
>>
>> Bests,
>> Dongjoon.
>>
>


Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Dongjoon Hyun
For Xiao's comment, I want to point out that Apache Spark 3.1.0 is
different from 2.3 or 2.4.

Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.

- Apache Spark 2.0.0 was released on July 26, 2016.
- Apache Spark 2.1.0 was released on December 28, 2016.

Bests,
Dongjoon.


On Sun, Oct 4, 2020 at 10:53 AM Dongjoon Hyun 
wrote:

> Thank you all.
>
> BTW, Xiao and Mridul, I'm wondering what date you have in your mind
> specifically.
>
> Usually, `Christmas and New Year season` doesn't give us much additional
> time.
>
> If you think so, could you make a PR for Apache Spark website according
> to your expectation?
>
> https://spark.apache.org/versioning-policy.html
>
> Bests,
> Dongjoon.
>
>
> On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan 
> wrote:
>
>>
>> +1 on pushing the branch cut for increased dev time to match previous
>> releases.
>>
>> Regards,
>> Mridul
>>
>> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:
>>
>>> Thank you for your updates.
>>>
>>> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
>>> the 3.1 branch cut, the feature development time window is less than 5
>>> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>>>
>>> Below are three highly desirable feature work I am watching. Hopefully,
>>> we can finish them before the branch cut.
>>>
>>>- Support push-based shuffle to improve shuffle efficiency:
>>>https://issues.apache.org/jira/browse/SPARK-30602
>>>- Unify create table syntax:
>>>https://issues.apache.org/jira/browse/SPARK-31257
>>>- Bloom filter join:
>>>https://issues.apache.org/jira/browse/SPARK-32268
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>>
>>> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道:
>>>
>>>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>>>> dropped R 3.5 and below at branch 2.4 as well.
>>>>
>>>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, 
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> As of today, master branch (Apache Spark 3.1.0) resolved
>>>>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>>>>> According to the 3.1.0 release window, branch-3.1 will be
>>>>> created on November 1st and enters QA period.
>>>>>
>>>>> Here are some notable updates I've been monitoring.
>>>>>
>>>>> *Language*
>>>>> 01. SPARK-25075 Support Scala 2.13
>>>>>   - Since SPARK-32926, Scala 2.13 build test has
>>>>> become a part of GitHub Action jobs.
>>>>>   - After SPARK-33044, Scala 2.13 test will be
>>>>> a part of Jenkins jobs.
>>>>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>>>>> 03. SPARK-32082 Project Zen: Improving Python usability
>>>>>   - 7 of 16 issues are resolved.
>>>>> 04. SPARK-32073 Drop R < 3.5 support
>>>>>   - This is done for Spark 3.0.1 and 3.1.0.
>>>>>
>>>>> *Dependency*
>>>>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>>>>   - This changes the default dist. for better cloud support
>>>>> 06. SPARK-32981 Remove hive-1.2 distribution
>>>>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>>>>   - This will remove Hive 1.2.1 from source code
>>>>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>>>>
>>>>> *Core*
>>>>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>>>>   - 11 of 15 issues are resolved
>>>>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>>>>   - 8 of 14 issues are resolved
>>>>>
>>>>> *Resource Manager*
>>>>> 11. SPARK-33005 Kubernetes GA preparation
>>>>>   - It is on the way and we are waiting for more feedback.
>>>>>
>>>>> *SQL*
>>>>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>>>>   to JSON/Avro
>>>>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>>>>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>>>>   - 11 of 17 issues are resolved
>>>>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>>>>   and added more features in 3.1 but still we missed
>>>>>   - All built-in DataSource v2 write paths are disabled
>>>>> and v1 write is used instead.
>>>>>   - Support partition pruning with subqueries
>>>>>   - Support bucketing
>>>>>
>>>>> We still have one month before the feature freeze
>>>>> and starting QA. If you are working for 3.1,
>>>>> please consider the timeline and share your schedule
>>>>> with the Apache Spark community. For the other stuff,
>>>>> we can put it into 3.2 release scheduled in June 2021.
>>>>>
>>>>> Last not but least, I want to emphasize (7) once again.
>>>>> We need to remove the forked unofficial Hive eventually.
>>>>> Please let us know your reasons if you need to build
>>>>> from Apache Spark 3.1 source code for Hive 1.2.
>>>>>
>>>>> https://github.com/apache/spark/pull/29936
>>>>>
>>>>> As I wrote in the above PR description, for old releases,
>>>>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>>>>> Hive 1.2-based distribution.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>


Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Dongjoon Hyun
Thank you all.

BTW, Xiao and Mridul, I'm wondering what date you have in your mind
specifically.

Usually, `Christmas and New Year season` doesn't give us much additional
time.

If you think so, could you make a PR for Apache Spark website according to
your expectation?

https://spark.apache.org/versioning-policy.html

Bests,
Dongjoon.


On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan  wrote:

>
> +1 on pushing the branch cut for increased dev time to match previous
> releases.
>
> Regards,
> Mridul
>
> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:
>
>> Thank you for your updates.
>>
>> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
>> the 3.1 branch cut, the feature development time window is less than 5
>> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>>
>> Below are three highly desirable feature work I am watching. Hopefully,
>> we can finish them before the branch cut.
>>
>>- Support push-based shuffle to improve shuffle efficiency:
>>https://issues.apache.org/jira/browse/SPARK-30602
>>- Unify create table syntax:
>>https://issues.apache.org/jira/browse/SPARK-31257
>>- Bloom filter join: https://issues.apache.org/jira/browse/SPARK-32268
>>
>> Thanks,
>>
>> Xiao
>>
>>
>> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道:
>>
>>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>>> dropped R 3.5 and below at branch 2.4 as well.
>>>
>>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, 
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> As of today, master branch (Apache Spark 3.1.0) resolved
>>>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>>>> According to the 3.1.0 release window, branch-3.1 will be
>>>> created on November 1st and enters QA period.
>>>>
>>>> Here are some notable updates I've been monitoring.
>>>>
>>>> *Language*
>>>> 01. SPARK-25075 Support Scala 2.13
>>>>   - Since SPARK-32926, Scala 2.13 build test has
>>>> become a part of GitHub Action jobs.
>>>>   - After SPARK-33044, Scala 2.13 test will be
>>>> a part of Jenkins jobs.
>>>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>>>> 03. SPARK-32082 Project Zen: Improving Python usability
>>>>   - 7 of 16 issues are resolved.
>>>> 04. SPARK-32073 Drop R < 3.5 support
>>>>   - This is done for Spark 3.0.1 and 3.1.0.
>>>>
>>>> *Dependency*
>>>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>>>   - This changes the default dist. for better cloud support
>>>> 06. SPARK-32981 Remove hive-1.2 distribution
>>>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>>>   - This will remove Hive 1.2.1 from source code
>>>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>>>
>>>> *Core*
>>>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>>>   - 11 of 15 issues are resolved
>>>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>>>   - 8 of 14 issues are resolved
>>>>
>>>> *Resource Manager*
>>>> 11. SPARK-33005 Kubernetes GA preparation
>>>>   - It is on the way and we are waiting for more feedback.
>>>>
>>>> *SQL*
>>>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>>>   to JSON/Avro
>>>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>>>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>>>   - 11 of 17 issues are resolved
>>>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>>>   and added more features in 3.1 but still we missed
>>>>   - All built-in DataSource v2 write paths are disabled
>>>> and v1 write is used instead.
>>>>   - Support partition pruning with subqueries
>>>>   - Support bucketing
>>>>
>>>> We still have one month before the feature freeze
>>>> and starting QA. If you are working for 3.1,
>>>> please consider the timeline and share your schedule
>>>> with the Apache Spark community. For the other stuff,
>>>> we can put it into 3.2 release scheduled in June 2021.
>>>>
>>>> Last not but least, I want to emphasize (7) once again.
>>>> We need to remove the forked unofficial Hive eventually.
>>>> Please let us know your reasons if you need to build
>>>> from Apache Spark 3.1 source code for Hive 1.2.
>>>>
>>>> https://github.com/apache/spark/pull/29936
>>>>
>>>> As I wrote in the above PR description, for old releases,
>>>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>>>> Hive 1.2-based distribution.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>


Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Dongjoon Hyun
Hi, All.

As of today, master branch (Apache Spark 3.1.0) resolved
852+ JIRA issues and 606+ issues are 3.1.0-only patches.
According to the 3.1.0 release window, branch-3.1 will be
created on November 1st and enters QA period.

Here are some notable updates I've been monitoring.

*Language*
01. SPARK-25075 Support Scala 2.13
  - Since SPARK-32926, Scala 2.13 build test has
become a part of GitHub Action jobs.
  - After SPARK-33044, Scala 2.13 test will be
a part of Jenkins jobs.
02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
03. SPARK-32082 Project Zen: Improving Python usability
  - 7 of 16 issues are resolved.
04. SPARK-32073 Drop R < 3.5 support
  - This is done for Spark 3.0.1 and 3.1.0.

*Dependency*
05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
  - This changes the default dist. for better cloud support
06. SPARK-32981 Remove hive-1.2 distribution
07. SPARK-20202 Remove references to org.spark-project.hive
  - This will remove Hive 1.2.1 from source code
08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)

*Core*
09. SPARK-27495 Support Stage level resource conf and scheduling
  - 11 of 15 issues are resolved
10. SPARK-25299 Use remote storage for persisting shuffle data
  - 8 of 14 issues are resolved

*Resource Manager*
11. SPARK-33005 Kubernetes GA preparation
  - It is on the way and we are waiting for more feedback.

*SQL*
12. SPARK-30648/SPARK-32346 Support filters pushdown
  to JSON/Avro
13. SPARK-32948/SPARK-32958 Add Json expression optimizer
14. SPARK-12312 Support JDBC Kerberos w/ keytab
  - 11 of 17 issues are resolved
15. SPARK-27589 DSv2 was mostly completed in 3.0
  and added more features in 3.1 but still we missed
  - All built-in DataSource v2 write paths are disabled
and v1 write is used instead.
  - Support partition pruning with subqueries
  - Support bucketing

We still have one month before the feature freeze
and starting QA. If you are working for 3.1,
please consider the timeline and share your schedule
with the Apache Spark community. For the other stuff,
we can put it into 3.2 release scheduled in June 2021.

Last not but least, I want to emphasize (7) once again.
We need to remove the forked unofficial Hive eventually.
Please let us know your reasons if you need to build
from Apache Spark 3.1 source code for Hive 1.2.

https://github.com/apache/spark/pull/29936

As I wrote in the above PR description, for old releases,
Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
Hive 1.2-based distribution.

Bests,
Dongjoon.


Re: [ANNOUNCE] Announcing Apache Spark 3.0.1

2020-09-11 Thread Dongjoon Hyun
It's great. Thank you, Ruifeng!

Bests,
Dongjoon.

On Fri, Sep 11, 2020 at 1:54 AM 郑瑞峰  wrote:

> Hi all,
>
> We are happy to announce the availability of Spark 3.0.1!
> Spark 3.0.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.0 maintenance branch of Spark. We strongly
> recommend all 3.0 users to upgrade to this stable release.
>
> To download Spark 3.0.1, head over to the download page:
> http://spark.apache.org/downloads.html
>
> Note that you might need to clear your browser cache or to use
> `Private`/`Incognito` mode according to your browsers.
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-0-1.html
>
> We would like to acknowledge all community members for contributing to
> this release. This release would not have been possible without you.
>
>
> Thanks,
> Ruifeng Zheng
>
>


Re: [ANNOUNCE] Apache Spark 2.4.6 released

2020-06-10 Thread Dongjoon Hyun
Thank you so much, Holden! :)

On Wed, Jun 10, 2020 at 6:59 PM Hyukjin Kwon  wrote:

> Yay!
>
> 2020년 6월 11일 (목) 오전 10:38, Holden Karau 님이 작성:
>
>> We are happy to announce the availability of Spark 2.4.6!
>>
>> Spark 2.4.6 is a maintenance release containing stability, correctness,
>> and security fixes.
>> This release is based on the branch-2.4 maintenance branch of Spark. We
>> strongly recommend all 2.4 users to upgrade to this stable release.
>>
>> To download Spark 2.4.6, head over to the download page:
>> http://spark.apache.org/downloads.html
>> Spark 2.4.6 is also available in Maven Central, PyPI, and CRAN.
>>
>> Note that you might need to clear your browser cache or
>> to use `Private`/`Incognito` mode according to your browsers.
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2.4.6.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>


Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Ur, are you comparing the number of SELECT statement with TRIM and CREATE
statements with `CHAR`?

> I looked up our usage logs (sorry I can't share this publicly) and trim
has at least four orders of magnitude higher usage than char.

We need to discuss more about what to do. This thread is what I expected
exactly. :)

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
I was merely pointing out that if we deviate away from SQL standard in any
way we are considered "wrong" or "incorrect". That argument itself is
flawed when plenty of other popular database systems also deviate away from
the standard on this specific behavior.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin  wrote:

> BTW I'm not opposing us sticking to SQL standard (I'm in general for it).
> I was merely pointing out that if we deviate away from SQL standard in any
> way we are considered "wrong" or "incorrect". That argument itself is
> flawed when plenty of other popular database systems also deviate away from
> the standard on this specific behavior.
>
>
>
>
> On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin  wrote:
>
>> I looked up our usage logs (sorry I can't share this publicly) and trim
>> has at least four orders of magnitude higher usage than char.
>>
>>
>> On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Stephen and Reynold.
>>>
>>> To Reynold.
>>>
>>> The way I see the following is a little different.
>>>
>>>   > CHAR is an undocumented data type without clearly defined
>>> semantics.
>>>
>>> Let me describe in Apache Spark User's View point.
>>>
>>> Apache Spark started to claim `HiveContext` (and `hql/hiveql` function)
>>> at Apache Spark 1.x without much documentation. In addition, there still
>>> exists an effort which is trying to keep it in 3.0.0 age.
>>>
>>>https://issues.apache.org/jira/browse/SPARK-31088
>>>Add back HiveContext and createExternalTable
>>>
>>> Historically, we tried to make many SQL-based customer migrate their
>>> workloads from Apache Hive into Apache Spark through `HiveContext`.
>>>
>>> Although Apache Spark didn't have a good document about the inconsistent
>>> behavior among its data sources, Apache Hive has been providing its
>>> documentation and many customers rely the behavior.
>>>
>>>   -
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
>>>
>>> At that time, frequently in on-prem Hadoop clusters by well-known
>>> vendors, many existing huge tables were created by Apache Hive, not Apache
>>> Spark. And, Apache Spark is used for boosting SQL performance with its
>>> *caching*. This was true because Apache Spark was added into the
>>> Hadoop-vendor products later than Apache Hive.
>>>
>>> Until the turning point at Apache Spark 2.0, we tried to catch up more
>>> features to be consistent at least with Hive tables in Apache Hive and
>>> Apache Spark because two SQL engines share the same tables.
>>>
>>> For the following, technically, while Apache Hive doesn't changed its
>>> existing behavior in this part, Apache Spark evolves inevitably by moving
>>> away from the original Apache Spark old behaviors one-by-one.
>>>
>>>   >  the value is already fucked up
>>>
>>> The following is the change log.
>>>
>>>   - When we switched the default value of `convertMetastoreParquet`.
>>> (at Apache Spark 1.2)
>>>   - When we switched the default value of `convertMetastoreOrc` (at
>>> Apache Spark 2.4)
>>>   - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
>>> `PARQUET` table at Apache Spark 3.0)
>>>
>>> To sum up, this has been a well-known issue in the community and among
>>> the customers.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy 
>>> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I’m kind of new around here, but I have had experience with all of all
>>>> the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft
>>>> SQL Server as well as Postgresql.
>>>>
>>>> They all support the notion of “ANSI padding” for CHAR columns - which
>>>> means that such columns are always space padded, and they default to having
>>>> this enabled (for

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Thank you, Stephen and Reynold.

To Reynold.

The way I see the following is a little different.

  > CHAR is an undocumented data type without clearly defined semantics.

Let me describe in Apache Spark User's View point.

Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at
Apache Spark 1.x without much documentation. In addition, there still
exists an effort which is trying to keep it in 3.0.0 age.

   https://issues.apache.org/jira/browse/SPARK-31088
   Add back HiveContext and createExternalTable

Historically, we tried to make many SQL-based customer migrate their
workloads from Apache Hive into Apache Spark through `HiveContext`.

Although Apache Spark didn't have a good document about the inconsistent
behavior among its data sources, Apache Hive has been providing its
documentation and many customers rely the behavior.

  -
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

At that time, frequently in on-prem Hadoop clusters by well-known vendors,
many existing huge tables were created by Apache Hive, not Apache Spark.
And, Apache Spark is used for boosting SQL performance with its *caching*.
This was true because Apache Spark was added into the Hadoop-vendor
products later than Apache Hive.

Until the turning point at Apache Spark 2.0, we tried to catch up more
features to be consistent at least with Hive tables in Apache Hive and
Apache Spark because two SQL engines share the same tables.

For the following, technically, while Apache Hive doesn't changed its
existing behavior in this part, Apache Spark evolves inevitably by moving
away from the original Apache Spark old behaviors one-by-one.

  >  the value is already fucked up

The following is the change log.

  - When we switched the default value of `convertMetastoreParquet`.
(at Apache Spark 1.2)
  - When we switched the default value of `convertMetastoreOrc` (at
Apache Spark 2.4)
  - When we switched `CREATE TABLE` itself. (Change `TEXT` table to
`PARQUET` table at Apache Spark 3.0)

To sum up, this has been a well-known issue in the community and among the
customers.

Bests,
Dongjoon.

On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy  wrote:

> Hi there,
>
> I’m kind of new around here, but I have had experience with all of all the
> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL
> Server as well as Postgresql.
>
> They all support the notion of “ANSI padding” for CHAR columns - which
> means that such columns are always space padded, and they default to having
> this enabled (for ANSI compliance).
>
> MySQL also supports it, but it defaults to leaving it disabled for
> historical reasons not unlike what we have here.
>
> In my opinion we should push toward standards compliance where possible
> and then document where it cannot work.
>
> If users don’t like the padding on CHAR columns then they should change to
> VARCHAR - I believe that was its purpose in the first place, and it does
> not dictate any sort of “padding".
>
> I can see why you might “ban” the use of CHAR columns where they cannot be
> consistently supported, but VARCHAR is a different animal and I would
> expect it to work consistently everywhere.
>
>
> Cheers,
>
> Steve C
>
> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun 
> wrote:
>
> Hi, Reynold.
> (And +Michael Armbrust)
>
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
>
> > Are we sure "not padding" is "incorrect"?
>
> Bests,
> Dongjoon.
>
>
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> 100% agree with Reynold.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin  wrote:
>>
>>> Are we sure "not padding" is "incorrect"?
>>>
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>>
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>>> <https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D=0>
>>>  :
>>> "Snowflake currently deviates from common CHAR semantics in that strings

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value
silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta 
wrote:

> Hi,
>
> 100% agree with Reynold.
>
>
> Regards,
> Gourav Sengupta
>
> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin  wrote:
>
>> Are we sure "not padding" is "incorrect"?
>>
>> I don't know whether ANSI SQL actually requires padding, but plenty of
>> databases don't actually pad.
>>
>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>> <https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.>
>>  :
>> "Snowflake currently deviates from common CHAR semantics in that strings
>> shorter than the maximum length are not space-padded at the end."
>>
>> MySQL:
>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Hi, Reynold.
>>>
>>> Please see the following for the context.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-31136
>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax"
>>>
>>> I raised the above issue according to the new rubric, and the banning
>>> was the proposed alternative to reduce the potential issue.
>>>
>>> Please give us your opinion since it's still PR.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin  wrote:
>>>
>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell
>>>> out of both new and old users?
>>>>
>>>> For old users, their old code that was working for char(3) would now
>>>> stop working.
>>>>
>>>> For new users, depending on whether the underlying metastore char(3) is
>>>> either supported but different from ansi Sql (which is not that big of a
>>>> deal if we explain it) or not supported.
>>>>
>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> Apache Spark has been suffered from a known consistency issue on
>>>>> `CHAR` type behavior among its usages and configurations. However, the
>>>>> evolution direction has been gradually moving forward to be consistent
>>>>> inside Apache Spark because we don't have `CHAR` offically. The following
>>>>> is the summary.
>>>>>
>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
>>>>> result.
>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>> Hive behavior.)
>>>>>
>>>>> spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>
>>>>> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>
>>>>> spark-sql> SELECT a, length(a) FROM t1;
>>>>> a   3
>>>>> spark-sql> SELECT a, length(a) FROM t2;
>>>>> a   3
>>>>> spark-sql> SELECT a, length(a) FROM t3;
>>>>> a 2
>>>>>
>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to
>>>>> Hive behavior.)
>>>>>
>>>>> spark-sql> SELECT a, length(a) FROM t1;
>>>>> a   3
>>>>> spark-sql> SELECT a, length(a) FROM t2;
>>>>> a 2
>>>>> spark-sql> SELECT a, length(a) FROM t3;
>>>>> a 2
>>>>>
>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
>>>>> became consistent.
>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>> fallback to Hive behavior.)
>>>>>
>>>>> spark-sql> SELECT a, length(a) FROM t1;
>>>>> a 2
>>>>> spark-sql> SELECT a, length(a) FROM t2;
>>>>> a 2
>>>>> spark-sql> SELECT a, length(a) FROM t3;
>>>>> a 2
>>>>>
>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in
>>>>> the following syntax to be safe.
>>>>>
>>>>> CREATE TABLE t(a CHAR(3));
>>>>> https://github.com/apache/spark/pull/27902
>>>>>
>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> References:
>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>
>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
>>>>> TABLE syntax", 2019/12/06
>>>>>
>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>
>>>>
>>


Re: FYI: The evolution on `CHAR` type behavior

2020-03-15 Thread Dongjoon Hyun
Hi, Reynold.

Please see the following for the context.

https://issues.apache.org/jira/browse/SPARK-31136
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax"

I raised the above issue according to the new rubric, and the banning was
the proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin  wrote:

> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
> of both new and old users?
>
> For old users, their old code that was working for char(3) would now stop
> working.
>
> For new users, depending on whether the underlying metastore char(3) is
> either supported but different from ansi Sql (which is not that big of a
> deal if we explain it) or not supported.
>
> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>> type behavior among its usages and configurations. However, the evolution
>> direction has been gradually moving forward to be consistent inside Apache
>> Spark because we don't have `CHAR` offically. The following is the summary.
>>
>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>> Hive behavior.)
>>
>> spark-sql> CREATE TABLE t1(a CHAR(3));
>> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>
>> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>
>> spark-sql> SELECT a, length(a) FROM t1;
>> a   3
>> spark-sql> SELECT a, length(a) FROM t2;
>> a   3
>> spark-sql> SELECT a, length(a) FROM t3;
>> a 2
>>
>> Since 2.4.0, `STORED AS ORC` became consistent.
>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>> behavior.)
>>
>> spark-sql> SELECT a, length(a) FROM t1;
>> a   3
>> spark-sql> SELECT a, length(a) FROM t2;
>> a 2
>> spark-sql> SELECT a, length(a) FROM t3;
>> a 2
>>
>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>> consistent.
>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>> fallback to Hive behavior.)
>>
>> spark-sql> SELECT a, length(a) FROM t1;
>> a 2
>> spark-sql> SELECT a, length(a) FROM t2;
>> a 2
>> spark-sql> SELECT a, length(a) FROM t3;
>> a 2
>>
>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>> following syntax to be safe.
>>
>> CREATE TABLE t(a CHAR(3));
>> https://github.com/apache/spark/pull/27902
>>
>> This email is sent out to inform you based on the new policy we voted.
>> The recommendation is always using Apache Spark's native type `String`.
>>
>> Bests,
>> Dongjoon.
>>
>> References:
>> 1. "CHAR implementation?", 2017/09/15
>>
>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>> syntax", 2019/12/06
>>
>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>
>


FYI: The evolution on `CHAR` type behavior

2020-03-14 Thread Dongjoon Hyun
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR`
type behavior among its usages and configurations. However, the evolution
direction has been gradually moving forward to be consistent inside Apache
Spark because we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive
behavior.)

spark-sql> CREATE TABLE t1(a CHAR(3));
spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t1;
a   3
spark-sql> SELECT a, length(a) FROM t2;
a   3
spark-sql> SELECT a, length(a) FROM t3;
a 2

Since 2.4.0, `STORED AS ORC` became consistent.
(`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
behavior.)

spark-sql> SELECT a, length(a) FROM t1;
a   3
spark-sql> SELECT a, length(a) FROM t2;
a 2
spark-sql> SELECT a, length(a) FROM t3;
a 2

Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
consistent.
(`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
fallback to Hive behavior.)

spark-sql> SELECT a, length(a) FROM t1;
a 2
spark-sql> SELECT a, length(a) FROM t2;
a 2
spark-sql> SELECT a, length(a) FROM t3;
a 2

In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
following syntax to be safe.

CREATE TABLE t(a CHAR(3));
https://github.com/apache/spark/pull/27902

This email is sent out to inform you based on the new policy we voted.
The recommendation is always using Apache Spark's native type `String`.

Bests,
Dongjoon.

References:
1. "CHAR implementation?", 2017/09/15

https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
syntax", 2019/12/06

https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E


Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-08 Thread Dongjoon Hyun
There was a typo in one URL. The correct release note URL is here.

https://spark.apache.org/releases/spark-release-2-4-5.html



On Sat, Feb 8, 2020 at 5:22 PM Dongjoon Hyun 
wrote:

> We are happy to announce the availability of Spark 2.4.5!
>
> Spark 2.4.5 is a maintenance release containing stability fixes. This
> release is based on the branch-2.4 maintenance branch of Spark. We strongly
> recommend all 2.4 users to upgrade to this stable release.
>
> To download Spark 2.4.5, head over to the download page:
> http://spark.apache.org/downloads.html
>
> Note that you might need to clear your browser cache or
> to use `Private`/`Incognito` mode according to your browsers.
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2.4.5.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Dongjoon Hyun
>


[ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-08 Thread Dongjoon Hyun
We are happy to announce the availability of Spark 2.4.5!

Spark 2.4.5 is a maintenance release containing stability fixes. This
release is based on the branch-2.4 maintenance branch of Spark. We strongly
recommend all 2.4 users to upgrade to this stable release.

To download Spark 2.4.5, head over to the download page:
http://spark.apache.org/downloads.html

Note that you might need to clear your browser cache or
to use `Private`/`Incognito` mode according to your browsers.

To view the release notes:
https://spark.apache.org/releases/spark-release-2.4.5.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


Re: [ANNOUNCE] Announcing Apache Spark 3.0.0-preview2

2019-12-24 Thread Dongjoon Hyun
Indeed! Thank you again, Yuming and all.

Bests,
Dongjoon.


On Tue, Dec 24, 2019 at 13:38 Takeshi Yamamuro 
wrote:

> Great work, Yuming!
>
> Bests,
> Takeshi
>
> On Wed, Dec 25, 2019 at 6:00 AM Xiao Li  wrote:
>
>> Thank you all. Happy Holidays!
>>
>> Xiao
>>
>> On Tue, Dec 24, 2019 at 12:53 PM Yuming Wang  wrote:
>>
>>> Hi all,
>>>
>>> To enable wide-scale community testing of the upcoming Spark 3.0
>>> release, the Apache Spark community has posted a new preview release of
>>> Spark 3.0. This preview is *not a stable release in terms of either API
>>> or functionality*, but it is meant to give the community early access
>>> to try the code that will become Spark 3.0. If you would like to test the
>>> release, please download it, and send feedback using either the mailing
>>> lists  or JIRA
>>> 
>>> .
>>>
>>> There are a lot of exciting new features added to Spark 3.0, including
>>> Dynamic Partition Pruning, Adaptive Query Execution, Accelerator-aware
>>> Scheduling, Data Source API with Catalog Supports, Vectorization in SparkR,
>>> support of Hadoop 3/JDK 11/Scala 2.12, and many more. For a full list of
>>> major features and changes in Spark 3.0.0-preview2, please check the thread(
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-3-0-preview-release-feature-list-and-major-changes-td28050.html
>>>  and
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-3-0-preview-release-2-td28491.html
>>> ).
>>>
>>> We'd like to thank our contributors and users for their contributions
>>> and early feedback to this release. This release would not have been
>>> possible without you.
>>>
>>> To download Spark 3.0.0-preview2, head over to the download page:
>>> https://archive.apache.org/dist/spark/spark-3.0.0-preview2
>>>
>>> Happy Holidays.
>>>
>>> Yuming
>>>
>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: [VOTE] Shall we release ORC 1.4.5rc1?

2019-12-06 Thread Dongjoon Hyun
+1 for Apache ORC 1.4.5 release.

Thank you for making the release.

I'd like to mention some notable changes here.
Apache ORC 1.4.5 is not a drop-in replacement for 1.4.4 because of the
following.

  ORC-498: ReaderImpl and RecordReaderImpl open separate file handles.

Applications should be updated accordingly. Otherwise, file system leakages
occur.
For example, Apache Spark 2.3.5-SNAPSHOT is currently using v1.4.4 and will
not work with v1.4.5.

In short, there is a breaking change between v1.4.4 and v1.4.5 like the
breaking change between v1.5.5 and 1.5.6.
For the required change, please refer Owen's Apache Spark upgrade patch.

  [SPARK-28208][BUILD][SQL] Upgrade to ORC 1.5.6 including closing the
ORC readers

https://github.com/apache/spark/commit/dfb0a8bb048d43f8fd1fb05b1027bd2fc7438dbc

Bests,
Dongjoon.


On Fri, Dec 6, 2019 at 4:19 PM Alan Gates  wrote:

> +1.  Did a build on ubuntu 16, checked the signatures and hashes.  Reviewed
> the license changes.
>
> Alan.
>
> On Fri, Dec 6, 2019 at 1:41 PM Owen O'Malley 
> wrote:
>
> > All,
> >Ok, I backported a few more fixes in to rc1:
> >
> >- ORC-480
> >- ORC-552
> >- ORC-576
> >
> >
> > Should we release the following artifacts as ORC 1.4.5?
> >
> > tar: http://home.apache.org/~omalley/orc-1.4.5/
> > tag: https://github.com/apache/orc/releases/tag/release-1.4.5rc1
> > jiras: https://issues.apache.org/jira/browse/ORC/fixforversion/12345479
> >
> > Thanks!
> >
>


[ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-01 Thread Dongjoon Hyun
We are happy to announce the availability of Spark 2.4.4!

Spark 2.4.4 is a maintenance release containing stability fixes. This
release is based on the branch-2.4 maintenance branch of Spark. We strongly
recommend all 2.4 users to upgrade to this stable release.

To download Spark 2.4.4, head over to the download page:
http://spark.apache.org/downloads.html

Note that you might need to clear your browser cache or
to use `Private`/`Incognito` mode according to your browsers.

To view the release notes:
https://spark.apache.org/releases/spark-release-2-4-4.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


[VOTE][RESULT] Spark 2.4.4 (RC3)

2019-08-30 Thread Dongjoon Hyun
Hi, All.

The vote passes. Thanks to all who helped with this release 2.4.4!
It was very intensive vote with +11 (including +8 PMC votes) and no -1.
I'll follow up later with a release announcement once everything is
published.

+1 (* = binding):

Dongjoon Hyun
Kazuaki Ishizaki
Sean Owen*
Wenchen Fan*
DB Tsai*
Holden Karau*
Marcelo Vanzin*
Takeshi Yamamuro
Hyukjin Kwon*
Felix Cheung*
Xiao Li*

+0: None

-1: None

Bests,
Dongjoon.


JDK11 Support in Apache Spark

2019-08-24 Thread Dongjoon Hyun
Hi, All.

Thanks to your many many contributions,
Apache Spark master branch starts to pass on JDK11 as of today.
(with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6)


https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/
(JDK11 is used for building and testing.)

We already verified all UTs (including PySpark/SparkR) before.

Please feel free to use JDK11 in order to build/test/run `master` branch and
share your experience including any issues. It will help Apache Spark 3.0.0
release.

For the follow-ups, please follow
https://issues.apache.org/jira/browse/SPARK-24417 .
The next step is `how to support JDK8/JDK11 together in a single artifact`.

Bests,
Dongjoon.


Re: Release Apache Spark 2.4.4

2019-08-14 Thread Dongjoon Hyun
Thank you, DB, Takeshi, Hyukjin, Sean, Kazuaki, Holden, Wenchen!
I'll create tag for 2.4.4-rc1 next Monday.

For SPARK-27234, it looks like that to me, too.

Thanks,
Dongjoon.


On Wed, Aug 14, 2019 at 9:13 AM Holden Karau  wrote:

> That looks like more of a feature than a bug fix unless I’m missing
> something?
>
> On Tue, Aug 13, 2019 at 11:58 PM Hyukjin Kwon  wrote:
>
>> Adding Shixiong
>>
>> WDYT?
>>
>> 2019년 8월 14일 (수) 오후 2:30, Terry Kim 님이 작성:
>>
>>> Can the following be included?
>>>
>>> [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch
>>> in EpochTracker (to support Python UDFs)
>>> <https://github.com/apache/spark/pull/24946>
>>>
>>> Thanks,
>>> Terry
>>>
>>> On Tue, Aug 13, 2019 at 10:24 PM Wenchen Fan 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Wed, Aug 14, 2019 at 12:52 PM Holden Karau 
>>>> wrote:
>>>>
>>>>> +1
>>>>> Does anyone have any critical fixes they’d like to see in 2.4.4?
>>>>>
>>>>> On Tue, Aug 13, 2019 at 5:22 PM Sean Owen  wrote:
>>>>>
>>>>>> Seems fine to me if there are enough valuable fixes to justify another
>>>>>> release. If there are any other important fixes imminent, it's fine to
>>>>>> wait for those.
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 13, 2019 at 6:16 PM Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>> >
>>>>>> > Hi, All.
>>>>>> >
>>>>>> > Spark 2.4.3 was released three months ago (8th May).
>>>>>> > As of today (13th August), there are 112 commits (75 JIRAs) in
>>>>>> `branch-24` since 2.4.3.
>>>>>> >
>>>>>> > It would be great if we can have Spark 2.4.4.
>>>>>> > Shall we start `2.4.4 RC1` next Monday (19th August)?
>>>>>> >
>>>>>> > Last time, there was a request for K8s issue and now I'm waiting
>>>>>> for SPARK-27900.
>>>>>> > Please let me know if there is another issue.
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Dongjoon.
>>>>>>
>>>>>> -
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Release Apache Spark 2.4.4

2019-08-13 Thread Dongjoon Hyun
Hi, All.

Spark 2.4.3 was released three months ago (8th May).
As of today (13th August), there are 112 commits (75 JIRAs) in `branch-24`
since 2.4.3.

It would be great if we can have Spark 2.4.4.
Shall we start `2.4.4 RC1` next Monday (19th August)?

Last time, there was a request for K8s issue and now I'm waiting for
SPARK-27900.
Please let me know if there is another issue.

Thanks,
Dongjoon.


Re: Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-16 Thread Dongjoon Hyun
Thank you for volunteering for 2.3.4 release manager, Kazuaki!
It's great to see a new release manager in advance. :D

Thank you for reply, Stavros.
In addition to that issue, I'm also monitoring some other K8s issues and
PRs.
But, I'm not sure we can have that because some PRs seems to fail at
building consensus (even for 3.0.0).
In any way, could you ping the reviewers once more on those PRs which you
have concerns?
If it is merged into `branch-2.4`, it will be Apache Spark 2.4.4 of course.

Bests,
Dongjoon.


On Tue, Jul 16, 2019 at 4:00 AM Kazuaki Ishizaki 
wrote:

> Thank you Dongjoon for being a release manager.
>
> If the assumed dates are ok, I would like to volunteer for an 2.3.4
> release manager.
>
> Best Regards,
> Kazuaki Ishizaki,
>
>
>
> From:Dongjoon Hyun 
> To:dev , "user @spark" <
> user@spark.apache.org>, Apache Spark PMC 
> Date:2019/07/13 07:18
> Subject:[EXTERNAL] Re: Release Apache Spark 2.4.4 before 3.0.0
> --
>
>
>
> Thank you, Jacek.
>
> BTW, I added `@private` since we need PMC's help to make an Apache Spark
> release.
>
> Can I get more feedbacks from the other PMC members?
>
> Please me know if you have any concerns (e.g. Release date or Release
> manager?)
>
> As one of the community members, I assumed the followings (if we are on
> schedule).
>
> - 2.4.4 at the end of July
> - 2.3.4 at the end of August (since 2.3.0 was released at the end of
> February 2018)
> - 3.0.0 (possibily September?)
> - 3.1.0 (January 2020?)
>
> Bests,
> Dongjoon.
>
>
> On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski <*ja...@japila.pl*
> > wrote:
> Hi,
>
> Thanks Dongjoon Hyun for stepping up as a release manager!
> Much appreciated.
>
> If there's a volunteer to cut a release, I'm always to support it.
>
> In addition, the more frequent releases the better for end users so they
> have a choice to upgrade and have all the latest fixes or wait. It's their
> call not ours (when we'd keep them waiting).
>
> My big 2 yes'es for the release!
>
> Jacek
>
>
> On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun, <*dongjoon.h...@gmail.com*
> > wrote:
> Hi, All.
>
> Spark 2.4.3 was released two months ago (8th May).
>
> As of today (9th July), there exist 45 fixes in `branch-2.4` including the
> following correctness or blocker issues.
>
> - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
> decimals not fitting in long
> - SPARK-26045 Error in the spark 2.4 release package with the
> spark-avro_2.11 dependency
> - SPARK-27798 from_avro can modify variables in other rows in local
> mode
> - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
> - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries
> - SPARK-28308 CalendarInterval sub-second part should be padded before
> parsing
>
> It would be great if we can have Spark 2.4.4 before we are going to get
> busier for 3.0.0.
> If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll
> it next Monday. (15th July).
> How do you think about this?
>
> Bests,
> Dongjoon.
>
>


Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-15 Thread Dongjoon Hyun
Hi, Apache Spark PMC members.

Can we cut Apache Spark 2.4.4 next Monday (22nd July)?

Bests,
Dongjoon.


On Fri, Jul 12, 2019 at 3:18 PM Dongjoon Hyun 
wrote:

> Thank you, Jacek.
>
> BTW, I added `@private` since we need PMC's help to make an Apache Spark
> release.
>
> Can I get more feedbacks from the other PMC members?
>
> Please me know if you have any concerns (e.g. Release date or Release
> manager?)
>
> As one of the community members, I assumed the followings (if we are on
> schedule).
>
> - 2.4.4 at the end of July
> - 2.3.4 at the end of August (since 2.3.0 was released at the end of
> February 2018)
> - 3.0.0 (possibily September?)
> - 3.1.0 (January 2020?)
>
> Bests,
> Dongjoon.
>
>
> On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Thanks Dongjoon Hyun for stepping up as a release manager!
>> Much appreciated.
>>
>> If there's a volunteer to cut a release, I'm always to support it.
>>
>> In addition, the more frequent releases the better for end users so they
>> have a choice to upgrade and have all the latest fixes or wait. It's their
>> call not ours (when we'd keep them waiting).
>>
>> My big 2 yes'es for the release!
>>
>> Jacek
>>
>>
>> On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun,  wrote:
>>
>>> Hi, All.
>>>
>>> Spark 2.4.3 was released two months ago (8th May).
>>>
>>> As of today (9th July), there exist 45 fixes in `branch-2.4` including
>>> the following correctness or blocker issues.
>>>
>>> - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
>>> decimals not fitting in long
>>> - SPARK-26045 Error in the spark 2.4 release package with the
>>> spark-avro_2.11 dependency
>>> - SPARK-27798 from_avro can modify variables in other rows in local
>>> mode
>>> - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
>>> - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist
>>> entries
>>> - SPARK-28308 CalendarInterval sub-second part should be padded
>>> before parsing
>>>
>>> It would be great if we can have Spark 2.4.4 before we are going to get
>>> busier for 3.0.0.
>>> If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll
>>> it next Monday. (15th July).
>>> How do you think about this?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>


Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-12 Thread Dongjoon Hyun
Thank you, Jacek.

BTW, I added `@private` since we need PMC's help to make an Apache Spark
release.

Can I get more feedbacks from the other PMC members?

Please me know if you have any concerns (e.g. Release date or Release
manager?)

As one of the community members, I assumed the followings (if we are on
schedule).

- 2.4.4 at the end of July
- 2.3.4 at the end of August (since 2.3.0 was released at the end of
February 2018)
- 3.0.0 (possibily September?)
- 3.1.0 (January 2020?)

Bests,
Dongjoon.


On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski  wrote:

> Hi,
>
> Thanks Dongjoon Hyun for stepping up as a release manager!
> Much appreciated.
>
> If there's a volunteer to cut a release, I'm always to support it.
>
> In addition, the more frequent releases the better for end users so they
> have a choice to upgrade and have all the latest fixes or wait. It's their
> call not ours (when we'd keep them waiting).
>
> My big 2 yes'es for the release!
>
> Jacek
>
>
> On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun,  wrote:
>
>> Hi, All.
>>
>> Spark 2.4.3 was released two months ago (8th May).
>>
>> As of today (9th July), there exist 45 fixes in `branch-2.4` including
>> the following correctness or blocker issues.
>>
>> - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
>> decimals not fitting in long
>> - SPARK-26045 Error in the spark 2.4 release package with the
>> spark-avro_2.11 dependency
>> - SPARK-27798 from_avro can modify variables in other rows in local
>> mode
>> - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
>> - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries
>> - SPARK-28308 CalendarInterval sub-second part should be padded
>> before parsing
>>
>> It would be great if we can have Spark 2.4.4 before we are going to get
>> busier for 3.0.0.
>> If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll
>> it next Monday. (15th July).
>> How do you think about this?
>>
>> Bests,
>> Dongjoon.
>>
>


Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-11 Thread Dongjoon Hyun
Additionally, one more correctness patch landed yesterday.

- SPARK-28015 Check stringToDate() consumes entire input for the 
and -[m]m formats

Bests,
Dongjoon.


On Tue, Jul 9, 2019 at 10:11 AM Dongjoon Hyun 
wrote:

> Thank you for the reply, Sean. Sure. 2.4.x should be a LTS version.
>
> The main reason of 2.4.4 release (before 3.0.0) is to have a better basis
> for comparison to 3.0.0.
> For example, SPARK-27798 had an old bug, but its correctness issue is only
> exposed at Spark 2.4.3.
> It would be great if we can have a better basis.
>
> Bests,
> Dongjoon.
>
>
> On Tue, Jul 9, 2019 at 9:52 AM Sean Owen  wrote:
>
>> We will certainly want a 2.4.4 release eventually. In fact I'd expect
>> 2.4.x gets maintained for longer than the usual 18 months, as it's the
>> last 2.x branch.
>> It doesn't need to happen before 3.0, but could. Usually maintenance
>> releases happen 3-4 months apart and the last one was 2 months ago. If
>> these are significant issues, sure. It'll probably be August before
>> it's out anyway.
>>
>> On Tue, Jul 9, 2019 at 11:15 AM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > Spark 2.4.3 was released two months ago (8th May).
>> >
>> > As of today (9th July), there exist 45 fixes in `branch-2.4` including
>> the following correctness or blocker issues.
>> >
>> > - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
>> decimals not fitting in long
>> > - SPARK-26045 Error in the spark 2.4 release package with the
>> spark-avro_2.11 dependency
>> > - SPARK-27798 from_avro can modify variables in other rows in local
>> mode
>> > - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
>> > - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist
>> entries
>> > - SPARK-28308 CalendarInterval sub-second part should be padded
>> before parsing
>> >
>> > It would be great if we can have Spark 2.4.4 before we are going to get
>> busier for 3.0.0.
>> > If it's okay, I'd like to volunteer for an 2.4.4 release manager to
>> roll it next Monday. (15th July).
>> > How do you think about this?
>> >
>> > Bests,
>> > Dongjoon.
>>
>


Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-09 Thread Dongjoon Hyun
Thank you for the reply, Sean. Sure. 2.4.x should be a LTS version.

The main reason of 2.4.4 release (before 3.0.0) is to have a better basis
for comparison to 3.0.0.
For example, SPARK-27798 had an old bug, but its correctness issue is only
exposed at Spark 2.4.3.
It would be great if we can have a better basis.

Bests,
Dongjoon.


On Tue, Jul 9, 2019 at 9:52 AM Sean Owen  wrote:

> We will certainly want a 2.4.4 release eventually. In fact I'd expect
> 2.4.x gets maintained for longer than the usual 18 months, as it's the
> last 2.x branch.
> It doesn't need to happen before 3.0, but could. Usually maintenance
> releases happen 3-4 months apart and the last one was 2 months ago. If
> these are significant issues, sure. It'll probably be August before
> it's out anyway.
>
> On Tue, Jul 9, 2019 at 11:15 AM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > Spark 2.4.3 was released two months ago (8th May).
> >
> > As of today (9th July), there exist 45 fixes in `branch-2.4` including
> the following correctness or blocker issues.
> >
> > - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
> decimals not fitting in long
> > - SPARK-26045 Error in the spark 2.4 release package with the
> spark-avro_2.11 dependency
> > - SPARK-27798 from_avro can modify variables in other rows in local
> mode
> > - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
> > - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist
> entries
> > - SPARK-28308 CalendarInterval sub-second part should be padded
> before parsing
> >
> > It would be great if we can have Spark 2.4.4 before we are going to get
> busier for 3.0.0.
> > If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll
> it next Monday. (15th July).
> > How do you think about this?
> >
> > Bests,
> > Dongjoon.
>


Release Apache Spark 2.4.4 before 3.0.0

2019-07-09 Thread Dongjoon Hyun
Hi, All.

Spark 2.4.3 was released two months ago (8th May).

As of today (9th July), there exist 45 fixes in `branch-2.4` including the
following correctness or blocker issues.

- SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
decimals not fitting in long
- SPARK-26045 Error in the spark 2.4 release package with the
spark-avro_2.11 dependency
- SPARK-27798 from_avro can modify variables in other rows in local mode
- SPARK-27907 HiveUDAF should return NULL in case of 0 rows
- SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries
- SPARK-28308 CalendarInterval sub-second part should be padded before
parsing

It would be great if we can have Spark 2.4.4 before we are going to get
busier for 3.0.0.
If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll it
next Monday. (15th July).
How do you think about this?

Bests,
Dongjoon.


Re: Exposing JIRA issue types at GitHub PRs

2019-06-17 Thread Dongjoon Hyun
Thank you, Hyukjin !

On Sun, Jun 16, 2019 at 4:12 PM Hyukjin Kwon  wrote:

> Labels look good and useful.
>
> On Sat, 15 Jun 2019, 02:36 Dongjoon Hyun,  wrote:
>
>> Now, you can see the exposed component labels (ordered by the number of
>> PRs) here and click the component to search.
>>
>> https://github.com/apache/spark/labels?sort=count-desc
>>
>> Dongjoon.
>>
>>
>> On Fri, Jun 14, 2019 at 1:15 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> JIRA and PR is ready for reviews.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA issue
>>> component types at GitHub PRs)
>>> https://github.com/apache/spark/pull/24871
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.
>>>>
>>>> Sure, we can do whatever we want.
>>>>
>>>> I'll wait for more feedbacks and proceed to the next steps.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
>>>> wrote:
>>>>
>>>>> Hi Dongjoon,
>>>>> Thanks for the proposal! I like the idea. Maybe we can extend it to
>>>>> component too and to some jira labels such as correctness which may be
>>>>> worth to highlight in PRs too. My only concern is that in many cases JIRAs
>>>>> are created not very carefully so they may be incorrect at the moment of
>>>>> the pr creation and it may be updated later: so keeping them in sync may 
>>>>> be
>>>>> an extra effort..
>>>>>
>>>>> On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:
>>>>>
>>>>>> Seems like a good idea. Can we test this with a component first?
>>>>>>
>>>>>> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> Since we use both Apache JIRA and GitHub actively for Apache Spark
>>>>>>> contributions, we have lots of JIRAs and PRs consequently. One specific
>>>>>>> thing I've been longing to see is `Jira Issue Type` in GitHub.
>>>>>>>
>>>>>>> How about exposing JIRA issue types at GitHub PRs as GitHub
>>>>>>> `Labels`? There are two main benefits:
>>>>>>> 1. It helps the communication between the contributors and reviewers
>>>>>>> with more information.
>>>>>>> (In some cases, some people only visit GitHub to see the PR and
>>>>>>> commits)
>>>>>>> 2. `Labels` is searchable. We don't need to visit Apache Jira to
>>>>>>> search PRs to see a specific type.
>>>>>>> (For example, the reviewers can see and review 'BUG' PRs first
>>>>>>> by using `is:open is:pr label:BUG`.)
>>>>>>>
>>>>>>> Of course, this can be done automatically without human
>>>>>>> intervention. Since we already have GitHub Jenkins job to access
>>>>>>> JIRA/GitHub, that job can add the labels from the beginning. If needed, 
>>>>>>> I
>>>>>>> can volunteer to update the script.
>>>>>>>
>>>>>>> To show the demo, I labeled several PRs manually. You can see the
>>>>>>> result right now in Apache Spark PR page.
>>>>>>>
>>>>>>>   - https://github.com/apache/spark/pulls
>>>>>>>
>>>>>>> If you're surprised due to those manual activities, I want to
>>>>>>> apologize for that. I hope we can take advantage of the existing GitHub
>>>>>>> features to serve Apache Spark community in a way better than yesterday.
>>>>>>>
>>>>>>> How do you think about this specific suggestion?
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon
>>>>>>>
>>>>>>> PS. I saw that `Request Review` and `Assign` features are already
>>>>>>> used for some purposes, but these feature are out of the scope in this
>>>>>>> email.
>>>>>>>
>>>>>>


Re: Exposing JIRA issue types at GitHub PRs

2019-06-14 Thread Dongjoon Hyun
Now, you can see the exposed component labels (ordered by the number of
PRs) here and click the component to search.

https://github.com/apache/spark/labels?sort=count-desc

Dongjoon.


On Fri, Jun 14, 2019 at 1:15 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> JIRA and PR is ready for reviews.
>
> https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA issue
> component types at GitHub PRs)
> https://github.com/apache/spark/pull/24871
>
> Bests,
> Dongjoon.
>
>
> On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.
>>
>> Sure, we can do whatever we want.
>>
>> I'll wait for more feedbacks and proceed to the next steps.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
>> wrote:
>>
>>> Hi Dongjoon,
>>> Thanks for the proposal! I like the idea. Maybe we can extend it to
>>> component too and to some jira labels such as correctness which may be
>>> worth to highlight in PRs too. My only concern is that in many cases JIRAs
>>> are created not very carefully so they may be incorrect at the moment of
>>> the pr creation and it may be updated later: so keeping them in sync may be
>>> an extra effort..
>>>
>>> On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:
>>>
>>>> Seems like a good idea. Can we test this with a component first?
>>>>
>>>> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> Since we use both Apache JIRA and GitHub actively for Apache Spark
>>>>> contributions, we have lots of JIRAs and PRs consequently. One specific
>>>>> thing I've been longing to see is `Jira Issue Type` in GitHub.
>>>>>
>>>>> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
>>>>> There are two main benefits:
>>>>> 1. It helps the communication between the contributors and reviewers
>>>>> with more information.
>>>>> (In some cases, some people only visit GitHub to see the PR and
>>>>> commits)
>>>>> 2. `Labels` is searchable. We don't need to visit Apache Jira to
>>>>> search PRs to see a specific type.
>>>>> (For example, the reviewers can see and review 'BUG' PRs first by
>>>>> using `is:open is:pr label:BUG`.)
>>>>>
>>>>> Of course, this can be done automatically without human intervention.
>>>>> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
>>>>> can add the labels from the beginning. If needed, I can volunteer to 
>>>>> update
>>>>> the script.
>>>>>
>>>>> To show the demo, I labeled several PRs manually. You can see the
>>>>> result right now in Apache Spark PR page.
>>>>>
>>>>>   - https://github.com/apache/spark/pulls
>>>>>
>>>>> If you're surprised due to those manual activities, I want to
>>>>> apologize for that. I hope we can take advantage of the existing GitHub
>>>>> features to serve Apache Spark community in a way better than yesterday.
>>>>>
>>>>> How do you think about this specific suggestion?
>>>>>
>>>>> Bests,
>>>>> Dongjoon
>>>>>
>>>>> PS. I saw that `Request Review` and `Assign` features are already used
>>>>> for some purposes, but these feature are out of the scope in this email.
>>>>>
>>>>


Re: Exposing JIRA issue types at GitHub PRs

2019-06-14 Thread Dongjoon Hyun
Hi, All.

JIRA and PR is ready for reviews.

https://issues.apache.org/jira/browse/SPARK-28051 (Exposing JIRA issue
component types at GitHub PRs)
https://github.com/apache/spark/pull/24871

Bests,
Dongjoon.


On Thu, Jun 13, 2019 at 10:48 AM Dongjoon Hyun 
wrote:

> Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.
>
> Sure, we can do whatever we want.
>
> I'll wait for more feedbacks and proceed to the next steps.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido 
> wrote:
>
>> Hi Dongjoon,
>> Thanks for the proposal! I like the idea. Maybe we can extend it to
>> component too and to some jira labels such as correctness which may be
>> worth to highlight in PRs too. My only concern is that in many cases JIRAs
>> are created not very carefully so they may be incorrect at the moment of
>> the pr creation and it may be updated later: so keeping them in sync may be
>> an extra effort..
>>
>> On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:
>>
>>> Seems like a good idea. Can we test this with a component first?
>>>
>>> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> Since we use both Apache JIRA and GitHub actively for Apache Spark
>>>> contributions, we have lots of JIRAs and PRs consequently. One specific
>>>> thing I've been longing to see is `Jira Issue Type` in GitHub.
>>>>
>>>> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
>>>> There are two main benefits:
>>>> 1. It helps the communication between the contributors and reviewers
>>>> with more information.
>>>> (In some cases, some people only visit GitHub to see the PR and
>>>> commits)
>>>> 2. `Labels` is searchable. We don't need to visit Apache Jira to search
>>>> PRs to see a specific type.
>>>> (For example, the reviewers can see and review 'BUG' PRs first by
>>>> using `is:open is:pr label:BUG`.)
>>>>
>>>> Of course, this can be done automatically without human intervention.
>>>> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
>>>> can add the labels from the beginning. If needed, I can volunteer to update
>>>> the script.
>>>>
>>>> To show the demo, I labeled several PRs manually. You can see the
>>>> result right now in Apache Spark PR page.
>>>>
>>>>   - https://github.com/apache/spark/pulls
>>>>
>>>> If you're surprised due to those manual activities, I want to apologize
>>>> for that. I hope we can take advantage of the existing GitHub features to
>>>> serve Apache Spark community in a way better than yesterday.
>>>>
>>>> How do you think about this specific suggestion?
>>>>
>>>> Bests,
>>>> Dongjoon
>>>>
>>>> PS. I saw that `Request Review` and `Assign` features are already used
>>>> for some purposes, but these feature are out of the scope in this email.
>>>>
>>>


Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Dongjoon Hyun
Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.

Sure, we can do whatever we want.

I'll wait for more feedbacks and proceed to the next steps.

Bests,
Dongjoon.


On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido  wrote:

> Hi Dongjoon,
> Thanks for the proposal! I like the idea. Maybe we can extend it to
> component too and to some jira labels such as correctness which may be
> worth to highlight in PRs too. My only concern is that in many cases JIRAs
> are created not very carefully so they may be incorrect at the moment of
> the pr creation and it may be updated later: so keeping them in sync may be
> an extra effort..
>
> On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:
>
>> Seems like a good idea. Can we test this with a component first?
>>
>> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since we use both Apache JIRA and GitHub actively for Apache Spark
>>> contributions, we have lots of JIRAs and PRs consequently. One specific
>>> thing I've been longing to see is `Jira Issue Type` in GitHub.
>>>
>>> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
>>> There are two main benefits:
>>> 1. It helps the communication between the contributors and reviewers
>>> with more information.
>>> (In some cases, some people only visit GitHub to see the PR and
>>> commits)
>>> 2. `Labels` is searchable. We don't need to visit Apache Jira to search
>>> PRs to see a specific type.
>>> (For example, the reviewers can see and review 'BUG' PRs first by
>>> using `is:open is:pr label:BUG`.)
>>>
>>> Of course, this can be done automatically without human intervention.
>>> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
>>> can add the labels from the beginning. If needed, I can volunteer to update
>>> the script.
>>>
>>> To show the demo, I labeled several PRs manually. You can see the result
>>> right now in Apache Spark PR page.
>>>
>>>   - https://github.com/apache/spark/pulls
>>>
>>> If you're surprised due to those manual activities, I want to apologize
>>> for that. I hope we can take advantage of the existing GitHub features to
>>> serve Apache Spark community in a way better than yesterday.
>>>
>>> How do you think about this specific suggestion?
>>>
>>> Bests,
>>> Dongjoon
>>>
>>> PS. I saw that `Request Review` and `Assign` features are already used
>>> for some purposes, but these feature are out of the scope in this email.
>>>
>>


Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Dongjoon Hyun
Hi, All.

Since we use both Apache JIRA and GitHub actively for Apache Spark
contributions, we have lots of JIRAs and PRs consequently. One specific
thing I've been longing to see is `Jira Issue Type` in GitHub.

How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`? There
are two main benefits:
1. It helps the communication between the contributors and reviewers with
more information.
(In some cases, some people only visit GitHub to see the PR and commits)
2. `Labels` is searchable. We don't need to visit Apache Jira to search PRs
to see a specific type.
(For example, the reviewers can see and review 'BUG' PRs first by using
`is:open is:pr label:BUG`.)

Of course, this can be done automatically without human intervention. Since
we already have GitHub Jenkins job to access JIRA/GitHub, that job can add
the labels from the beginning. If needed, I can volunteer to update the
script.

To show the demo, I labeled several PRs manually. You can see the result
right now in Apache Spark PR page.

  - https://github.com/apache/spark/pulls

If you're surprised due to those manual activities, I want to apologize for
that. I hope we can take advantage of the existing GitHub features to serve
Apache Spark community in a way better than yesterday.

How do you think about this specific suggestion?

Bests,
Dongjoon

PS. I saw that `Request Review` and `Assign` features are already used for
some purposes, but these feature are out of the scope in this email.


[ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-14 Thread Dongjoon Hyun
We are happy to announce the availability of Spark 2.2.3!

Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2
maintenance branch of Spark. We strongly recommend all 2.2.x users to
upgrade to this stable release.

To download Spark 2.2.3, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-2-3.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.

Bests,
Dongjoon.


[VOTE][RESULT] Spark 2.2.3 (RC1)

2019-01-11 Thread Dongjoon Hyun
Hi, All.

The vote passes. Thanks to all who helped with this release 2.2.3 (the
final 2.2.x)!
I'll follow up later with a release announcement once everything is
published.

+1 (* = binding):

DB Tsai*
Wenchen Fan*
Dongjoon Hyun
Denny Lee
Sean Owen*
Hyukjin Kwon
John Zhuge

+0: None

-1: None

Bests,
Dongjoon.


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Dongjoon Hyun
Finally, thank you all. Especially, thanks to the release manager, Wenchen!

Bests,
Dongjoon.


On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:

> + user list
>
> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:
>
>> resend
>>
>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan  wrote:
>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Wenchen Fan 
>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>> To: Spark dev list 
>>>
>>>
>>> Hi all,
>>>
>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
>>> adds Barrier Execution Mode for better integration with deep learning
>>> frameworks, introduces 30+ built-in and higher-order functions to deal with
>>> complex data type easier, improves the K8s integration, along with
>>> experimental Scala 2.12 support. Other major updates include the built-in
>>> Avro data source, Image data source, flexible streaming sinks, elimination
>>> of the 2GB block size limitation during transfer, Pandas UDF improvements.
>>> In addition, this release continues to focus on usability, stability, and
>>> polish while resolving around 1100 tickets.
>>>
>>> We'd like to thank our contributors and users for their contributions
>>> and early feedback to this release. This release would not have been
>>> possible without you.
>>>
>>> To download Spark 2.4.0, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> PS: If you see any issues with the release notes, webpage or published
>>> artifacts, please contact me directly off-list.
>>>
>>


Re: Metastore problem on Spark2.3 with Hive3.0

2018-09-17 Thread Dongjoon Hyun
Hi, Jerry.

There is a JIRA issue for that,
https://issues.apache.org/jira/browse/SPARK-24360 .

So far, it's in progress for Hive 3.1.0 Metastore for Apache Spark 2.5.0.
You can track that issue there.

Bests,
Dongjoon.


On Mon, Sep 17, 2018 at 7:01 PM 白也诗无敌 <445484...@qq.com> wrote:

> Hi, guys
>   I am using Spark2.3 and I meet the metastore problem.
>   It looks like something about the compatibility cause Spark2.3 still use
> the hive-metastore-1.2.1-spark2.
>   Is there any solution?
>   The Hive metastore version is 3.0 and the stacktrace is below:
>
> org.apache.thrift.TApplicationException: Required field 'filesAdded' is
> unset! Struct:InsertEventRequestData(filesAdded:null)
> at
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_fire_listener_event(ThriftHiveMetastore.java:4182)
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.fire_listener_event(ThriftHiveMetastore.java:4169)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.fireListenerEvent(HiveMetaStoreClient.java:1954)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy5.fireListenerEvent(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.fireInsertEvent(Hive.java:1947)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1673)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:847)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:757)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
> at
> org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:756)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:829)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:827)
> at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:416)
> at
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:403)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
> at org.apache.spark.sql.Dataset.(Dataset.scala:190)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:355)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at
> 

Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

2018-03-27 Thread Dongjoon Hyun
You may hit SPARK-23355 (convertMetastore should not ignore table properties).

Since it's a known Spark issue for all Hive tables (Parquet/ORC), could you 
check that too?

Bests,
Dongjoon.

On 2018/03/28 01:00:55, Dongjoon Hyun <dongj...@apache.org> wrote: 
> Hi, Eric.
> 
> For me, Spark 2.3 works correctly like the following. Could you give us some 
> reproducible example?
> 
> ```
> scala> sql("set spark.sql.orc.impl=native")
> 
> scala> sql("set spark.sql.orc.compression.codec=zlib")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> 
> scala> spark.range(10).write.orc("/tmp/zlib_test")
> 
> scala> spark.read.orc("/tmp/zlib_test").show
> +---+
> | id|
> +---+
> |  8|
> |  9|
> |  5|
> |  0|
> |  3|
> |  4|
> |  6|
> |  7|
> |  1|
> |  2|
> +---+
> 
> scala> sc.version
> res4: String = 2.3.0
> ```
> 
> Bests,
> Dongjoon.
> 
> 
> On 2018/03/23 15:03:29, Eirik Thorsnes <eirik.thors...@uni.no> wrote: 
> > Hi all,
> > 
> > I'm trying the new ORC native in Spark 2.3
> > (org.apache.spark.sql.execution.datasources.orc).
> > 
> > I've compiled Spark 2.3 from the git branch-2.3 as of March 20th.
> > I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4.
> > 
> > *NOTE*: the error only occurs with zlib compression, and I see that with
> > Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec
> > SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code?
> > 
> > I can write using the new native codepath without errors, but *reading*
> > zlib-compressed ORC, either the newly written ORC-files *or* older
> > ORC-files written with Spark 2.2/1.6 I get the following exception.
> > 
> > === cut =
> > 2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path:
> > hdfs://.../year=1999/part-r-0-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc,
> > range: 0-134217728, partition values: [1999]
> > 2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from
> > hdfs://.../year=1999/part-r-0-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc
> > with {include: [true, true, true, true, true, true, true, true, true],
> > offset: 0, length: 134217728}
> > 2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not
> > provided -- using file schema
> > struct<datetime:timestamp,lon:float,lat:float,u10:smallint,v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint>
> > 
> > 2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage
> > 1.0 (TID 1)
> > java.nio.BufferUnderflowException
> > at java.nio.Buffer.nextGetIndex(Buffer.java:500)
> > at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249)
> > at
> > org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248)
> > at
> > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:58)
> > at
> > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
> > at
> > org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.nextVector(TreeReaderFactory.java:976)
> > at
> > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
> > at
> > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
> > at
> > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:186)
> > at
> > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114)
> > at
> > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> > at
> > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> > at
> > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
> > at
> > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> > at
> > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
> > Source)
> > at
> > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> > Source)
> > at
> > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> > at
> > org.ap

Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

2018-03-27 Thread Dongjoon Hyun
Hi, Eric.

For me, Spark 2.3 works correctly like the following. Could you give us some 
reproducible example?

```
scala> sql("set spark.sql.orc.impl=native")

scala> sql("set spark.sql.orc.compression.codec=zlib")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.range(10).write.orc("/tmp/zlib_test")

scala> spark.read.orc("/tmp/zlib_test").show
+---+
| id|
+---+
|  8|
|  9|
|  5|
|  0|
|  3|
|  4|
|  6|
|  7|
|  1|
|  2|
+---+

scala> sc.version
res4: String = 2.3.0
```

Bests,
Dongjoon.


On 2018/03/23 15:03:29, Eirik Thorsnes  wrote: 
> Hi all,
> 
> I'm trying the new ORC native in Spark 2.3
> (org.apache.spark.sql.execution.datasources.orc).
> 
> I've compiled Spark 2.3 from the git branch-2.3 as of March 20th.
> I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4.
> 
> *NOTE*: the error only occurs with zlib compression, and I see that with
> Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec
> SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code?
> 
> I can write using the new native codepath without errors, but *reading*
> zlib-compressed ORC, either the newly written ORC-files *or* older
> ORC-files written with Spark 2.2/1.6 I get the following exception.
> 
> === cut =
> 2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path:
> hdfs://.../year=1999/part-r-0-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc,
> range: 0-134217728, partition values: [1999]
> 2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from
> hdfs://.../year=1999/part-r-0-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc
> with {include: [true, true, true, true, true, true, true, true, true],
> offset: 0, length: 134217728}
> 2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not
> provided -- using file schema
> struct
> 
> 2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage
> 1.0 (TID 1)
> java.nio.BufferUnderflowException
> at java.nio.Buffer.nextGetIndex(Buffer.java:500)
> at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249)
> at
> org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248)
> at
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:58)
> at
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
> at
> org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.nextVector(TreeReaderFactory.java:976)
> at
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815)
> at
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184)
> at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:186)
> at
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114)
> at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> 

Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-28 Thread Dongjoon Hyun
Hi, Nicolas.

Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901
(Feature parity for ORC with Parquet).
For your questions, the following three are related.

1. spark.sql.orc.impl="native"
By default, `native` ORC implementation (based on the latest ORC 1.4.1)
is added.
The old one is `hive` implementation.

2. spark.sql.orc.enableVectorizedReader="true"
By default, `native` ORC implementation uses Vectorized Reader code
path if possible.
Please note that `Vectorization(Parquet/ORC) in Apache Spark` is only
supported only for simple data types.

3. spark.sql.hive.convertMetastoreOrc=true
Like Parquet, by default, Hive tables are converted into file-based
data sources to use Vectorization technique.

Bests,
Dongjoon.



On Sun, Jan 28, 2018 at 4:15 AM, Nicolas Paris <nipari...@gmail.com> wrote:

> Hi
>
> Thanks for this work.
>
> Will this affect both:
> 1) spark.read.format("orc").load("...")
> 2) spark.sql("select ... from my_orc_table_in_hive")
>
> ?
>
>
> Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait :
> > Hi, All.
> >
> > Vectorized ORC Reader is now supported in Apache Spark 2.3.
> >
> > https://issues.apache.org/jira/browse/SPARK-16060
> >
> > It has been a long journey. From now, Spark can read ORC files faster
> without
> > feature penalty.
> >
> > Thank you for all your support, especially Wenchen Fan.
> >
> > It's done by two commits.
> >
> > [SPARK-16060][SQL] Support Vectorized ORC Reader
> > https://github.com/apache/spark/commit/
> f44ba910f58083458e1133502e193a
> > 9d6f2bf766
> >
> > [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized
> orc
> > reader
> > https://github.com/apache/spark/commit/
> eaac60a1e20e29084b7151ffca964c
> > faa5ba99d1
> >
> > Please check OrcReadBenchmark for the final speed-up from `Hive built-in
> ORC`
> > to `Native ORC Vectorized`.
> >
> > https://github.com/apache/spark/blob/master/sql/hive/
> src/test/scala/org/
> > apache/spark/sql/hive/orc/OrcReadBenchmark.scala
> >
> > Thank you.
> >
> > Bests,
> > Dongjoon.
>


Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-10 Thread Dongjoon Hyun
Hi, All.

Vectorized ORC Reader is now supported in Apache Spark 2.3.

https://issues.apache.org/jira/browse/SPARK-16060

It has been a long journey. From now, Spark can read ORC files faster
without feature penalty.

Thank you for all your support, especially Wenchen Fan.

It's done by two commits.

[SPARK-16060][SQL] Support Vectorized ORC Reader
https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a
9d6f2bf766

[SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc
reader
https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c
faa5ba99d1

Please check OrcReadBenchmark for the final speed-up from `Hive built-in
ORC` to `Native ORC Vectorized`.

https://github.com/apache/spark/blob/master/sql/hive/
src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

Thank you.

Bests,
Dongjoon.


Apache Spark 2.3 and Apache ORC 1.4 finally

2017-12-05 Thread Dongjoon Hyun
Hi, All.

Today, Apache Spark starts to use Apache ORC 1.4 as a `native` ORC
implementation.

SPARK-20728 Make OrcFileFormat configurable between `sql/hive` and
`sql/core`.
-
https://github.com/apache/spark/commit/326f1d6728a7734c228d8bfaa69442a1c7b92e9b

Thank you so much for all your supports for this!

I'll proceed more ORC issues in order to make a synergy between both
communities.

Please see https://issues.apache.org/jira/browse/SPARK-20901 to see the
updates

Bests,
Dongjoon.


Re: Spark Project build Issues.(Intellij)

2017-06-28 Thread Dongjoon Hyun
Did you follow the guide in `IDE Setup` -> `IntelliJ` section of
http://spark.apache.org/developer-tools.html ?

Bests,
Dongjoon.

On Wed, Jun 28, 2017 at 5:13 PM, satyajit vegesna <
satyajit.apas...@gmail.com> wrote:

> Hi All,
>
> When i try to build source code of apache spark code from
> https://github.com/apache/spark.git, i am getting below errors,
>
> Error:(9, 14) EventBatch is already defined as object EventBatch
> public class EventBatch extends org.apache.avro.specific.SpecificRecordBase
> implements org.apache.avro.specific.SpecificRecord {
> Error:(9, 14) EventBatch is already defined as class EventBatch
> public class EventBatch extends org.apache.avro.specific.SpecificRecordBase
> implements org.apache.avro.specific.SpecificRecord {
> /Users/svegesna/svegesna/dev/scala/spark/external/flume-
> sink/target/scala-2.11/src_managed/main/compiled_avro/
> org/apache/spark/streaming/flume/sink/SparkFlumeProtocol.java
> Error:(26, 18) SparkFlumeProtocol is already defined as object
> SparkFlumeProtocol
> public interface SparkFlumeProtocol {
> Error:(26, 18) SparkFlumeProtocol is already defined as trait
> SparkFlumeProtocol
> public interface SparkFlumeProtocol {
> /Users/svegesna/svegesna/dev/scala/spark/external/flume-
> sink/target/scala-2.11/src_managed/main/compiled_avro/
> org/apache/spark/streaming/flume/sink/SparkSinkEvent.java
> Error:(9, 14) SparkSinkEvent is already defined as object SparkSinkEvent
> public class SparkSinkEvent extends 
> org.apache.avro.specific.SpecificRecordBase
> implements org.apache.avro.specific.SpecificRecord {
> Error:(9, 14) SparkSinkEvent is already defined as class SparkSinkEvent
> public class SparkSinkEvent extends 
> org.apache.avro.specific.SpecificRecordBase
> implements org.apache.avro.specific.SpecificRecord {
>
> Would like to know , if i can successfully build the project, so that i
> can test and debug some of spark's functionalities.
>
> Regards,
> Satyajit.
>


Fwd: Question about SPARK-11374 (skip.header.line.count)

2016-12-08 Thread Dongjoon Hyun
+dev

I forget to add @user.

Dongjoon.

-- Forwarded message -
From: Dongjoon Hyun <dongj...@apache.org>
Date: Thu, Dec 8, 2016 at 16:00
Subject: Question about SPARK-11374 (skip.header.line.count)
To: <d...@spark.apache.org>


Hi, All.



Could you give me some opinion?



There is an old SPARK issue, SPARK-11374, about removing header lines from
text file.

Currently, Spark supports removing CSV header lines by the following way.



```

scala> spark.read.option("header","true").csv("/data").show

+---+---+

| c1| c2|

+---+---+

|  1|  a|

|  2|  b|

+---+---+

```



In SQL world, we can support that like the Hive way,
`skip.header.line.count`.



```

scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data'
TBLPROPERTIES('skip.header.line.count'='1')")

scala> sql("SELECT * FROM t1").show

+---+-+

| id|value|

+---+-+

|  1|a|

|  2|b|

+---+-+

```



Although I made a PR for this based on the JIRA issue, I want to know this
is really needed feature.

Is it need for your use cases? Or, it's enough for you to remove them in a
preprocessing stage.

If this is too old and not proper in these days, I'll close the PR and JIRA
issue as WON'T FIX.



Thank you for all in advance!



Bests,

Dongjoon.



-

To unsubscribe e-mail: dev-unsubscr...@spark.apache.org