from:"Yuming Wang"

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-30 Thread Yuming Wang

+1

On Tue, Apr 30, 2024 at 3:31 PM Ye Xianjin  wrote:

> +1
> Sent from my iPhone
>
> On Apr 30, 2024, at 3:23 PM, DB Tsai  wrote:
>
> 
> +1
>
> On Apr 29, 2024, at 8:01 PM, Wenchen Fan  wrote:
>
> 
> To add more color:
>
> Spark data source table and Hive Serde table are both stored in the Hive
> metastore and keep the data files in the table directory. The only
> difference is they have different "table provider", which means Spark will
> use different reader/writer. Ideally the Spark native data source
> reader/writer is faster than the Hive Serde ones.
>
> What's more, the default format of Hive Serde is text. I don't think
> people want to use text format tables in production. Most people will add
> `STORED AS parquet` or `USING parquet` explicitly. By setting this config
> to false, we have a more reasonable default behavior: creating Parquet
> tables (or whatever is specified by `spark.sql.sources.default`).
>
> On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan  wrote:
>
>> @Mich Talebzadeh  there seems to be a
>> misunderstanding here. The Spark native data source table is still stored
>> in the Hive metastore, it's just that Spark will use a different (and
>> faster) reader/writer for it. `hive-site.xml` should work as it is today.
>>
>> On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon 
>> wrote:
>>
>>> +1
>>>
>>> It's a legacy conf that we should eventually remove it away. Spark
>>> should create Spark table by default, not Hive table.
>>>
>>> Mich, for your workload, you can simply switch that conf off if it
>>> concerns you. We also enabled ANSI as well (that you agreed on). It's a bit
>>> akwakrd to stop in the middle for this compatibility reason during making
>>> Spark sound. The compatibility has been tested in production for a long
>>> time so I don't see any particular issue about the compatibility case you
>>> mentioned.
>>>
>>> On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>

 Hi @Wenchen Fan 

 Thanks for your response. I believe we have not had enough time to
 "DISCUSS" this matter.

 Currently in order to make Spark take advantage of Hive, I create a
 soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is
 3.1.1

  /opt/spark/conf/hive-site.xml ->
 /data6/hduser/hive-3.1.1/conf/hive-site.xml

 This works fine for me in my lab. So in the future if we opt to use the
 setting "spark.sql.legacy.createHiveTableByDefault" to False, there will
 not be a need for this logical link.?
 On the face of it, this looks fine but in real life it may require a
 number of changes to the old scripts. Hence my concern.
 As a matter of interest has anyone liaised with the Hive team to ensure
 they have introduced the additional changes you outlined?

 HTH

 Mich Talebzadeh,
 Technologist | Architect | Data Engineer  | Generative AI | FinCrime
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Sun, 28 Apr 2024 at 09:34, Wenchen Fan  wrote:

> @Mich Talebzadeh  thanks for sharing your
> concern!
>
> Note: creating Spark native data source tables is usually Hive
> compatible as well, unless we use features that Hive does not support
> (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to
> create Spark native table in this case, instead of creating Hive table and
> fail.
>
> On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan  wrote:
>
>> +1 (non-binding)
>>
>> Thanks,
>> Cheng Pan
>>
>> On Sat, Apr 27, 2024 at 9:29 AM Holden Karau 
>> wrote:
>> >
>> > +1
>> >
>> > Twitter: https://twitter.com/holdenkarau
>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> >
>> >
>> > On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun <
>> dongj...@apache.org> wrote:
>> >> >
>> >> > I'll start with my +1.
>> >> >
>> >> > Dongjoon.
>> >> >
>> >> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
>> >> > > Please vote on SPARK-46122 to set
>> spark.sql.legacy.createHiveTableByDefault
>> >> > > to `false` by default. The technical scope is defined in the
>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Yuming Wang

+1

On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek  wrote:

> Of course, I can't think of a scenario of thousands of tables with single
> in memory Spark cluster with in memory catalog.
> Thanks for the help!
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>>
>>
>> Agreed. In scenarios where most of the interactions with the catalog are
>> related to query planning, saving and metadata management, the choice of
>> catalog implementation may have less impact on query runtime performance.
>> This is because the time spent on metadata operations is generally
>> minimal compared to the time spent on actual data fetching, processing, and
>> computation.
>> However, if we consider scalability and reliability concerns, especially
>> as the size and complexity of data and query workload grow. While an
>> in-memory catalog may offer excellent performance for smaller workloads,
>> it will face limitations in handling larger-scale deployments with
>> thousands of tables, partitions, and users. Additionally, durability and
>> persistence are crucial considerations, particularly in production
>> environments where data integrity
>> and availability are crucial. In-memory catalog implementations may lack
>> durability, meaning that metadata changes could be lost in the event of a
>> system failure or restart. Therefore, while in-memory catalog
>> implementations can provide speed and efficiency for certain use cases, we
>> ought to consider the requirements for scalability, reliability, and data
>> durability when choosing a catalog solution for production deployments. In
>> many cases, a combination of in-memory and disk-based catalog solutions may
>> offer the best balance of performance and resilience for demanding large
>> scale workloads.
>>
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek  wrote:
>>
>>> Of course, but it's in memory and not persisted which is much faster,
>>> and as I said- I believe that most of the interaction with it is during the
>>> planning and save and not actual query run operations, and they are short
>>> and minimal compared to data fetching and manipulation so I don't believe
>>> it will have big impact on query run...
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>
 Well, I will be surprised because Derby database is single threaded and
 won't be much of a use here.

 Most Hive metastore in the commercial world utilise postgres or Oracle
 for metastore that are battle proven, replicated and backed up.

 Mich Talebzadeh,
 Technologist | Architect | Data Engineer  | Generative AI | FinCrime
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek 
 wrote:

> Yes, in memory hive catalog backed by local Derby DB.
> And again, I presume that most metadata related parts are during
> planning and not actual run, so I don't see why it should strongly affect
> query performance.
>
> Thanks,
>
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>> With regard to your point below
>>
>> "The thing I'm missing is this: let's say that the output format I
>> choose is delta lake or iceberg or whatever format that uses parquet. 
>> Where
>> does the catalog implementation (which holds metadata afaik, same 
>> metadata
>> that iceberg and delta lake save for their tables about their columns)
>> comes into play and why should it affect performance? "
>>
>> The catalog implementation comes into play regardless of the output
>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>>

Re: Please unlock Jira ticket for SPARK-24815, Dynamic resource allocation for structured streaming

2024-02-26 Thread Yuming Wang

Unlocked.

On Tue, Feb 27, 2024 at 11:47 AM Mich Talebzadeh 
wrote:

>
> Hi,
>
> Can a committer please unlock this SPIP? It is for Dynamic resource
> allocation for structured streaming that has got 6 votes. it was locked
> because of inactivity by GitHub actions
>
> [SPARK-24815] Structured Streaming should support dynamic allocation - ASF
> JIRA (apache.org) 
>
> For now I have volunteered to mentor the team until a committer volunteers
> to take it over. This should not be that strenuous  hopefully.
>
> Thanks
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>

Re: [VOTE] Release Spark 3.3.4 (RC1)

2023-12-10 Thread Yuming Wang

+1

On Mon, Dec 11, 2023 at 5:55 AM Dongjoon Hyun  wrote:

> +1
>
> Dongjoon
>
> On 2023/12/08 21:41:00 Dongjoon Hyun wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 3.3.4.
> >
> > The vote is open until December 15th 1AM (PST) and passes if a majority
> +1
> > PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.3.4
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org/
> >
> > The tag to be voted on is v3.3.4-rc1 (commit
> > 18db204995b32e87a650f2f09f9bcf047ddafa90)
> > https://github.com/apache/spark/tree/v3.3.4-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> >
> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-bin/
> >
> >
> > Signatures used for Spark RCs can be found in this file:
> >
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> >
> > The staging repository for this release can be found at:
> >
> > https://repository.apache.org/content/repositories/orgapachespark-1451/
> >
> >
> > The documentation corresponding to this release can be found at:
> >
> > https://dist.apache.org/repos/dist/dev/spark/v3.3.4-rc1-docs/
> >
> >
> > The list of bug fixes going into 3.3.4 can be found at the following URL:
> >
> > https://issues.apache.org/jira/projects/SPARK/versions/12353505
> >
> >
> > This release is using the release script of the tag v3.3.4-rc1.
> >
> >
> > FAQ
> >
> >
> > =
> >
> > How can I help test this release?
> >
> > =
> >
> >
> >
> > If you are a Spark user, you can help us test this release by taking
> >
> > an existing Spark workload and running on this release candidate, then
> >
> > reporting any regressions.
> >
> >
> >
> > If you're working in PySpark you can set up a virtual env and install
> >
> > the current RC and see if anything important breaks, in the Java/Scala
> >
> > you can add the staging repository to your projects resolvers and test
> >
> > with the RC (make sure to clean up the artifact cache before/after so
> >
> > you don't end up building with a out of date RC going forward).
> >
> >
> >
> > ===
> >
> > What should happen to JIRA tickets still targeting 3.3.4?
> >
> > ===
> >
> >
> >
> > The current list of open tickets targeted at 3.3.4 can be found at:
> >
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > Version/s" = 3.3.4
> >
> >
> > Committers should look at those and triage. Extremely important bug
> >
> > fixes, documentation, and API tweaks that impact compatibility should
> >
> > be worked on immediately. Everything else please retarget to an
> >
> > appropriate release.
> >
> >
> >
> > ==
> >
> > But my bug isn't fixed?
> >
> > ==
> >
> >
> >
> > In order to make timely releases, we will typically not hold the
> >
> > release unless the bug in question is a regression from the previous
> >
> > release. That being said, if there is something which is a regression
> >
> > that has not been correctly targeted please ping me or a committer to
> >
> > help target the issue.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-14 Thread Yuming Wang

+1

On Wed, Nov 15, 2023 at 2:44 AM Holden Karau  wrote:

> +1
>
> On Tue, Nov 14, 2023 at 10:21 AM DB Tsai  wrote:
>
>> +1
>>
>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>
>> On Nov 14, 2023, at 10:14 AM, Vakaris Baškirov <
>> vakaris.bashki...@gmail.com> wrote:
>>
>> +1 (non-binding)
>>
>>
>> On Tue, Nov 14, 2023 at 8:03 PM Chao Sun  wrote:
>>
>>> +1
>>>
>>> On Tue, Nov 14, 2023 at 9:52 AM L. C. Hsieh  wrote:
>>> >
>>> > +1
>>> >
>>> > On Tue, Nov 14, 2023 at 9:46 AM Ye Zhou  wrote:
>>> > >
>>> > > +1(Non-binding)
>>> > >
>>> > > On Tue, Nov 14, 2023 at 9:42 AM L. C. Hsieh 
>>> wrote:
>>> > >>
>>> > >> Hi all,
>>> > >>
>>> > >> I’d like to start a vote for SPIP: An Official Kubernetes Operator
>>> for
>>> > >> Apache Spark.
>>> > >>
>>> > >> The proposal is to develop an official Java-based Kubernetes
>>> operator
>>> > >> for Apache Spark to automate the deployment and simplify the
>>> lifecycle
>>> > >> management and orchestration of Spark applications and Spark
>>> clusters
>>> > >> on k8s at prod scale.
>>> > >>
>>> > >> This aims to reduce the learning curve and operation overhead for
>>> > >> Spark users so they can concentrate on core Spark logic.
>>> > >>
>>> > >> Please also refer to:
>>> > >>
>>> > >>- Discussion thread:
>>> > >> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
>>> > >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-45923
>>> > >>- SPIP doc:
>>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>>> > >>
>>> > >>
>>> > >> Please vote on the SPIP for the next 72 hours:
>>> > >>
>>> > >> [ ] +1: Accept the proposal as an official SPIP
>>> > >> [ ] +0
>>> > >> [ ] -1: I don’t think this is a good idea because …
>>> > >>
>>> > >>
>>> > >> Thank you!
>>> > >>
>>> > >> Liang-Chi Hsieh
>>> > >>
>>> > >>
>>> -
>>> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >>
>>> > >
>>> > >
>>> > > --
>>> > >
>>> > > Zhou, Ye  周晔
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-09 Thread Yuming Wang

+1

On Fri, Nov 10, 2023 at 10:01 AM Ilan Filonenko  wrote:

> +1
>
> On Thu, Nov 9, 2023 at 7:43 PM Ryan Blue  wrote:
>
>> +1
>>
>> On Thu, Nov 9, 2023 at 4:23 PM Hussein Awala  wrote:
>>
>>> +1 for creating an official Kubernetes operator for Apache Spark
>>>
>>> On Fri, Nov 10, 2023 at 12:38 AM huaxin gao 
>>> wrote:
>>>
 +1

 On Thu, Nov 9, 2023 at 3:14 PM DB Tsai  wrote:

> +1
>
> To be completely transparent, I am employed in the same department as
> Zhou at Apple.
>
> I support this proposal, provided that we witness community adoption
> following the release of the Flink Kubernetes operator, streamlining Flink
> deployment on Kubernetes.
>
> A well-maintained official Spark Kubernetes operator is essential for
> our Spark community as well.
>
> DB Tsai  |  https://www.dbtsai.com/
> 
>  |  PGP 42E5B25A8F7A82C1
>
> On Nov 9, 2023, at 12:05 PM, Zhou Jiang 
> wrote:
>
> Hi Spark community,
> I'm reaching out to initiate a conversation about the possibility of
> developing a Java-based Kubernetes operator for Apache Spark. Following 
> the
> operator pattern (
> https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
> ),
> Spark users may manage applications and related components seamlessly 
> using
> native tools like kubectl. The primary goal is to simplify the Spark user
> experience on Kubernetes, minimizing the learning curve and operational
> complexities and therefore enable users to focus on the Spark application
> development.
> Although there are several open-source Spark on Kubernetes operators
> available, none of them are officially integrated into the Apache Spark
> project. As a result, these operators may lack active support and
> development for new features. Within this proposal, our aim is to 
> introduce
> a Java-based Spark operator as an integral component of the Apache Spark
> project. This solution has been employed internally at Apple for multiple
> years, operating millions of executors in real production environments. 
> The
> use of Java in this solution is intended to accommodate a wider user and
> contributor audience, especially those who are familiar with Scala.
> Ideally, this operator should have its dedicated repository, similar
> to Spark Connect Golang or Spark Docker, allowing it to maintain a loose
> connection with the Spark release cycle. This model is also followed by 
> the
> Apache Flink Kubernetes operator.
> We believe that this project holds the potential to evolve into a
> thriving community project over the long run. A comparison can be drawn
> with the Flink Kubernetes Operator: Apple has open-sourced internal Flink
> Kubernetes operator, making it a part of the Apache Flink project (
> https://github.com/apache/flink-kubernetes-operator
> ).
> This move has gained wide industry adoption and contributions from the
> community. In a mere year, the Flink operator has garnered more than 600
> stars and has attracted contributions from over 80 contributors. This
> showcases the level of community interest and collaborative momentum that
> can be achieved in similar scenarios.
> More details can be found at SPIP doc : Spark Kubernetes Operator
> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>

Re: Apache Spark 3.4.2 (?)

2023-11-06 Thread Yuming Wang

+1

On Tue, Nov 7, 2023 at 3:55 AM Santosh Pingale
 wrote:

> Makes sense given the nature of those commits.
>
> On Mon, Nov 6, 2023, 7:52 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Apache Spark 3.4.1 tag was created on Jun 19th and `branch-3.4` has 103
>> commits including important security and correctness patches like
>> SPARK-44251, SPARK-44805, and SPARK-44940.
>>
>> https://github.com/apache/spark/releases/tag/v3.4.1
>>
>> $ git log --oneline v3.4.1..HEAD | wc -l
>> 103
>>
>> SPARK-44251 Potential for incorrect results or NPE when full outer
>> USING join has null key value
>> SPARK-44805 Data lost after union using
>> spark.sql.parquet.enableNestedColumnVectorizedReader=true
>> SPARK-44940 Improve performance of JSON parsing when
>> "spark.sql.json.enablePartialResults" is enabled
>>
>> Currently, I'm checking the following open correctness issues. I'd like
>> to propose to release Apache Spark 3.4.2 after resolving them and volunteer
>> as the release manager for Apache Spark 3.4.2. If there are no additional
>> blockers, the first tentative RC1 vote date is November 13rd (Monday). If
>> it takes some time to resolve the open correctness issues, we can start the
>> vote after Thanksgiving holiday.
>>
>> SPARK-44512 dataset.sort.select.write.partitionBy sorts wrong column
>> SPARK-45282 Join loses records for cached datasets
>>
>> WDTY?
>>
>> Dongjoon.
>>
>

Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Yuming Wang

+1.

On Tue, Sep 12, 2023 at 10:57 AM yangjie01 
wrote:

> +1
>
>
>
> *发件人**: *Jia Fan 
> *日期**: *2023年9月12日 星期二 10:08
> *收件人**: *Ruifeng Zheng 
> *抄送**: *Hyukjin Kwon , Xiao Li ,
> Mridul Muralidharan , Peter Toth ,
> Spark dev list , Yuanjian Li  >
> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC5)
>
>
>
> +1
>
>
>
> Ruifeng Zheng  于2023年9月12日周二 08:46写道：
>
> +1
>
>
>
> On Tue, Sep 12, 2023 at 7:24 AM Hyukjin Kwon  wrote:
>
> +1
>
>
>
> On Tue, Sep 12, 2023 at 7:05 AM Xiao Li  wrote:
>
> +1
>
>
>
> Xiao
>
>
>
> Yuanjian Li  于2023年9月11日周一 10:53写道：
>
> @Peter Toth  I've looked into the details of this
> issue, and it appears that it's neither a regression in version 3.5.0 nor a
> correctness issue. It's a bug related to a new feature. I think we can fix
> this in 3.5.1 and list it as a known issue of the Scala client of Spark
> Connect in 3.5.0.
>
> Mridul Muralidharan  于2023年9月10日周日 04:12写道：
>
>
>
> +1
>
>
>
> Signatures, digests, etc check out fine.
>
> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>
>
>
> Regards,
>
> Mridul
>
>
>
> On Sat, Sep 9, 2023 at 10:02 AM Yuanjian Li 
> wrote:
>
> Please vote on releasing the following candidate(RC5) as Apache Spark
> version 3.5.0.
>
>
>
> The vote is open until 11:59pm Pacific time *Sep 11th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
>
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
>
> The tag to be voted on is v3.5.0-rc5 (commit
> ce5ddad990373636e94071e7cef2f31021add07b):
>
> https://github.com/apache/spark/tree/v3.5.0-rc5
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/
>
>
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1449
>
>
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/
>
>
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>
>
>
> This release is using the release script of the tag v3.5.0-rc5.
>
>
>
> FAQ
>
>
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
>
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
>
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.0
>
>
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
>
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
>
>
> Thanks,
>
> Yuanjian Li
>
>

Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-07 Thread Yuming Wang

+1.

On Thu, Sep 7, 2023 at 10:33 PM yangjie01 
wrote:

> +1
>
>
>
> *发件人**: *Gengliang Wang 
> *日期**: *2023年9月7日 星期四 12:53
> *收件人**: *Yuanjian Li 
> *抄送**: *Xiao Li , "her...@databricks.com.invalid"
> , Spark dev list 
> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC4)
>
>
>
> +1
>
>
>
> On Wed, Sep 6, 2023 at 9:46 PM Yuanjian Li  wrote:
>
> +1 (non-binding)
>
> Xiao Li  于2023年9月6日周三 15:27写道：
>
> +1
>
>
>
> Xiao
>
>
>
> Herman van Hovell  于2023年9月6日周三 22:08写道：
>
> Tested connect, and everything looks good.
>
>
>
> +1
>
>
>
> On Wed, Sep 6, 2023 at 8:11 AM Yuanjian Li  wrote:
>
> Please vote on releasing the following candidate(RC4) as Apache Spark
> version 3.5.0.
>
>
>
> The vote is open until 11:59pm Pacific time *Sep 8th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
>
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
>
> The tag to be voted on is v3.5.0-rc4 (commit
> c2939589a29dd0d6a2d3d31a8d833877a37ee02a):
>
> https://github.com/apache/spark/tree/v3.5.0-rc4
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-bin/
>
>
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1448
>
>
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/
>
>
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>
>
>
> This release is using the release script of the tag v3.5.0-rc4.
>
>
>
> FAQ
>
>
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
>
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
>
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.5.0
>
>
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
>
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
>
>
> Thanks,
>
> Yuanjian Li
>
>

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-30 Thread Yuming Wang

It seems can not check signature:

yumwang@G9L07H60PK Downloads % gpg --keyserver hkps://keys.openpgp.org
--recv-key FC3AE3A7EAA1BAC98770840E7E1ABCC53AAA2216
gpg: key 7E1ABCC53AAA2216: no user ID
gpg: Total number processed: 1
yumwang@G9L07H60PK Downloads % gpg --batch --verify
spark-3.5.0-bin-hadoop3.tgz.asc spark-3.5.0-bin-hadoop3.tgz
gpg: Signature made 二  8/29 14:46:14 2023 CST
gpg:using RSA key FC3AE3A7EAA1BAC98770840E7E1ABCC53AAA2216
gpg:issuer "liyuanj...@apache.org"
gpg: Can't check signature: No public key



On Thu, Aug 31, 2023 at 11:36 AM Sean Owen  wrote:

> It worked fine after I ran it again I included "package test" instead of
> "test" (I had previously run "install") +1
>
> On Wed, Aug 30, 2023 at 6:06 AM yangjie01  wrote:
>
>> Hi, Sean
>>
>>
>>
>> I have performed testing with Java 17 and Scala 2.13 using maven (`mvn
>> clean install` and `mvn package test`), and have not encountered the issue
>> you mentioned.
>>
>>
>>
>> The test for the connect module depends on the `spark-protobuf` module to
>> complete the `package,` was it successful? Or could you provide the test
>> command for me to verify?
>>
>>
>>
>> Thanks,
>>
>> Jie Yang
>>
>>
>>
>> *发件人**: *Dipayan Dev 
>> *日期**: *2023年8月30日 星期三 17:01
>> *收件人**: *Sean Owen 
>> *抄送**: *Yuanjian Li , Spark dev list <
>> dev@spark.apache.org>
>> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC3)
>>
>>
>>
>> Can we fix this bug in Spark 3.5.0?
>>
>> https://issues.apache.org/jira/browse/SPARK-44884
>> 
>>
>>
>>
>>
>> On Wed, Aug 30, 2023 at 11:51 AM Sean Owen  wrote:
>>
>> It looks good except that I'm getting errors running the Spark Connect
>> tests at the end (Java 17, Scala 2.13) It looks like I missed something
>> necessary to build; is anyone getting this?
>>
>>
>>
>> [ERROR] [Error]
>> /tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apache/spark/sql/protobuf/protos/TestProto.java:9:46:
>>  error: package org.sparkproject.spark_protobuf.protobuf does not exist
>>
>>
>>
>> On Tue, Aug 29, 2023 at 11:25 AM Yuanjian Li 
>> wrote:
>>
>> Please vote on releasing the following candidate(RC3) as Apache Spark
>> version 3.5.0.
>>
>>
>>
>> The vote is open until 11:59pm Pacific time *Aug 31st* and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>>
>>
>> [ ] +1 Release this package as Apache Spark 3.5.0
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>> 
>>
>>
>>
>> The tag to be voted on is v3.5.0-rc3 (commit
>> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>>
>> https://github.com/apache/spark/tree/v3.5.0-rc3
>> 
>>
>>
>>
>> The release files, including signatures, digests, etc. can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
>> 
>>
>>
>>
>> Signatures used for Spark RCs can be found in this file:
>>
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> 
>>
>>
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1447
>> 
>>
>>
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
>> 
>>
>>
>>
>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>
>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>> 
>>
>>
>>
>> This release is using the release script of the tag v3.5.0-rc3.
>>
>>
>>
>> FAQ
>>
>>
>>
>> =
>>
>> How can I help test this release?
>>
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>>
>> an existing Spark workload and running on this release candidate, then
>>
>> reporting any regressions.
>>
>>
>>
>> If you're working in PySpark you can set up a virtual env and install
>>
>> the current RC and see if anything important breaks, in the

[ANNOUNCE] Apache Spark 3.3.3 released

2023-08-22 Thread Yuming Wang

We are happy to announce the availability of Apache Spark 3.3.3!

Spark 3.3.3 is a maintenance release containing stability fixes. This
release is based on the branch-3.3 maintenance branch of Spark. We strongly
recommend all 3.3 users to upgrade to this stable release.

To download Spark 3.3.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-3-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

[VOTE][RESULT] Release Spark 3.3.3 (RC1)

2023-08-14 Thread Yuming Wang

The vote passes with 7 +1s (4 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:

- Yuming Wang *
- Jie Yang
- Dongjoon Hyun *
- Liang-Chi Hsieh *
- Cheng Pan
- Mridul Muralidharan *
- Jia Fan

+0: None

-1: None

Re: [VOTE] Release Apache Spark 3.3.3 (RC1)

2023-08-09 Thread Yuming Wang

+1 myself.

On Tue, Aug 8, 2023 at 12:41 AM Dongjoon Hyun 
wrote:

> Thank you, Yuming.
>
> Dongjoon.
>
> On Mon, Aug 7, 2023 at 9:30 AM yangjie01  wrote:
>
>> HI，Dongjoon and Yuming
>>
>>
>>
>> I submitted a PR a few days ago to try to fix this issue:
>> https://github.com/apache/spark/pull/42167. The reason for the failure
>> is that the branch daily test and the master use the same yml file.
>>
>>
>>
>> Jie Yang
>>
>>
>>
>> *发件人**: *Dongjoon Hyun 
>> *日期**: *2023年8月8日 星期二 00:18
>> *收件人**: *Yuming Wang 
>> *抄送**: *dev 
>> *主题**: *Re: [VOTE] Release Apache Spark 3.3.3 (RC1)
>>
>>
>>
>> Hi, Yuming.
>>
>>
>>
>> One of the community GitHub Action test pipelines is unhealthy
>> consistently due to Python mypy linter.
>>
>>
>>
>> https://github.com/apache/spark/actions/workflows/build_branch33.yml
>> <https://mailshield.baidu.com/check?q=zL6yo8WBsL15wzkqifGHCZlkv7KqucJxpuNp8neenIT6Re6167OIO8%2fCYlTH0k%2b29wZ%2fDuFIdfwQCHRIDBzTS292DGk6EvIh>
>>
>>
>>
>> It seems due to the pipeline difference between the same Python mypy
>> linter already pass in commit build,
>>
>>
>>
>> Dongjoon.
>>
>>
>>
>>
>>
>> On Fri, Aug 4, 2023 at 8:09 PM Yuming Wang  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.3.3.
>>
>> The vote is open until 11:59pm Pacific time August 10th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org
>> <https://mailshield.baidu.com/check?q=cUpKoLnajWahunV4UDIAXHiHyx%2f5wSVGtwwdag%3d%3d>
>>
>> The tag to be voted on is v3.3.3-rc1 (commit
>> 8c2b3319c6734250ff9d72f3d7e5cab56b142195):
>> https://github.com/apache/spark/tree/v3.3.3-rc1
>> <https://mailshield.baidu.com/check?q=8FCIKpLCdZkaDTtrM2i6z6MozYaNPIUxXbtoz6UY4Dd9HDZ%2fGD1yoiMERdI6DE0Tv%2bgl0w%3d%3d>
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-bin
>> <https://mailshield.baidu.com/check?q=E6K9wCUIl7R2GWg35cz6FTdyOlAIldH1DzrC5lMm5vEz7tsnGbtOoOh3Xhjgt%2bKmRTfJyMzbsWs8FQuvjrnyEw%3d%3d>
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> <https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d>
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1445
>> <https://mailshield.baidu.com/check?q=qwIV%2bgL7su%2fhDHaSq3L7D4SvWg6hop35lQ6SmnXKIqkCT%2b5Z2apQOzuDyyPx6aoUTTbwled13%2b5ajYiObU6S6Fie%2bMXccPyMOLOrKg%3d%3d>
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-docs
>> <https://mailshield.baidu.com/check?q=8J9mpKGDzLZWyCARq00pdYmMTZ7Xg2gOIhMdnfDmdhOphsDhxGAe3BboUHQltnOgRUrIx2ycA8%2b%2fDX2SG1gd6g%3d%3d>
>>
>> The list of bug fixes going into 3.3.3 can be found at the following URL:
>> https://s.apache.org/rjci4
>> <https://mailshield.baidu.com/check?q=CDSiusCyO4bcrg80RMEGb9gnL5P2xcxAWMuq6OOUhbc%3d>
>>
>> This release is using the release script of the tag v3.3.3-rc1.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.3?
>> ===
>> The current list of open tickets targeted at 3.3.3 can be found at:
>> https://issues.apache.org/jira/projects/SPARK
>> <https://mailshield.baidu.com/check?q=4UU

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-08 Thread Yuming Wang

-1. I found a NoClassDefFoundError bug:
https://issues.apache.org/jira/browse/SPARK-44719.

On Mon, Aug 7, 2023 at 11:24 AM yangjie01 
wrote:

>
>
> I submitted a PR last week to try and solve this issue:
> https://github.com/apache/spark/pull/42236.
>
>
>
> *发件人**: *Sean Owen 
> *日期**: *2023年8月7日 星期一 11:05
> *收件人**: *Yuanjian Li 
> *抄送**: *Spark dev list 
> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC1)
>
>
> --
>
> *【外部邮件】信息安全要牢记，账号密码不传递！*
> --
>
>
>
> Let's keep testing 3.5.0 of course while that change is going in. (See
> https://github.com/apache/spark/pull/42364#issuecomment-1666878287
> 
> )
>
>
>
> Otherwise testing is pretty much as usual, except I get this test failure
> in Connect, which is new. Anyone else? this is Java 8, Scala 2.13, Debian
> 12.
>
>
>
> - from_protobuf_messageClassName_options *** FAILED ***
>   org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS]
> Could not load Protobuf class with name
> org.apache.spark.connect.proto.StorageLevel.
> org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf
> Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar
> with Protobuf classes needs to be shaded (com.google.protobuf.* -->
> org.sparkproject.spark_protobuf.protobuf.*).
>   at
> org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3554)
>   at
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:198)
>   at
> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:156)
>   at
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
>   at
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
>   at
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
>   at
> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
>   at
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
>   at
> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:73)
>   at scala.collection.immutable.List.map(List.scala:246)
>
>
>
> On Sat, Aug 5, 2023 at 5:42 PM Sean Owen  wrote:
>
> I'm still testing other combinations, but it looks like tests fail on Java
> 17 after building with Java 8, which should be a normal supported
> configuration.
>
> This is described at https://github.com/apache/spark/pull/41943
> 
> and looks like it is resolved by moving back to Scala 2.13.8 for now.
>
> Unless I'm missing something we need to fix this for 3.5 or it's not clear
> the build will run on Java 17.
>
>
>
> On Fri, Aug 4, 2023 at 5:45 PM Yuanjian Li  wrote:
>
> Please vote on releasing the following candidate(RC1) as Apache Spark
> version 3.5.0.
>
>
>
> The vote is open until 11:59pm Pacific time *Aug 9th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
>
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
>
>
>
> The tag to be voted on is v3.5.0-rc1 (commit
> 7e862c01fc9a1d3b47764df8b6a4b5c4cafb0807):
>
> https://github.com/apache/spark/tree/v3.5.0-rc1
> 
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-bin/
> 
>
>
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1444
> 
>
>
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-docs/
>

Re: Welcome two new Apache Spark committers

2023-08-06 Thread Yuming Wang

Congratulations!

On Mon, Aug 7, 2023 at 11:11 AM Kent Yao  wrote:

> Congrats! Peter and Xiduo!
>
> Cheng Pan  于2023年8月7日周一 11:01写道：
> >
> > Congratulations! Peter and Xiduo!
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > > On Aug 7, 2023, at 10:58, Gengliang Wang  wrote:
> > >
> > > Congratulations! Peter and Xiduo!
> >
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[VOTE] Release Apache Spark 3.3.3 (RC1)

2023-08-04 Thread Yuming Wang

Please vote on releasing the following candidate as Apache Spark version
3.3.3.

The vote is open until 11:59pm Pacific time August 10th and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.3.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org

The tag to be voted on is v3.3.3-rc1 (commit
8c2b3319c6734250ff9d72f3d7e5cab56b142195):
https://github.com/apache/spark/tree/v3.3.3-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-bin

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1445

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.3-rc1-docs

The list of bug fixes going into 3.3.3 can be found at the following URL:
https://s.apache.org/rjci4

This release is using the release script of the tag v3.3.3-rc1.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.3.3?
===
The current list of open tickets targeted at 3.3.3 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.3.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Re: Time for Spark 3.3.3 release?

2023-07-31 Thread Yuming Wang

Thank you. I will prepare 3.3.3-rc1 soon.

On Sun, Jul 30, 2023 at 12:15 AM Dongjoon Hyun 
wrote:

> +1
>
> Thank you for volunteering, Yuming.
>
> Dongjoon
>
>
> On Fri, Jul 28, 2023 at 11:35 AM Yuming Wang  wrote:
>
>> Hi Spark devs,
>>
>> Since Apache Spark 3.3.2 tag creation (Feb 11), 60 patches
>> <https://github.com/apache/spark/compare/v3.3.2...branch-3.3> have
>> arrived at branch-3.3.
>>
>> Shall we make a new release, Apache Spark 3.3.3, as the third release at
>> branch-3.3?
>> I'd like to volunteer as the release manager for Apache Spark 3.3.3.
>>
>>
>>

Time for Spark 3.3.3 release?

2023-07-28 Thread Yuming Wang

Hi Spark devs,

Since Apache Spark 3.3.2 tag creation (Feb 11), 60 patches
 have arrived
at branch-3.3.

Shall we make a new release, Apache Spark 3.3.3, as the third release at
branch-3.3?
I'd like to volunteer as the release manager for Apache Spark 3.3.3.

Re: [VOTE] Release Spark 3.4.1 (RC1)

2023-06-22 Thread Yuming Wang

+1.

On Thu, Jun 22, 2023 at 4:41 PM Jacek Laskowski  wrote:

> +1
>
> Builds and runs fine on Java 17, macOS.
>
> $ ./dev/change-scala-version.sh 2.13
> $ mvn \
> -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano,connect
> \
> -DskipTests \
> clean install
>
> $ python/run-tests --parallelism=1 --testnames 'pyspark.sql.session
> SparkSession.sql'
> ...
> Tests passed in 28 second
>
> Pozdrawiam,
> Jacek Laskowski
> 
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Tue, Jun 20, 2023 at 4:41 AM Dongjoon Hyun  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.4.1.
>>
>> The vote is open until June 23rd 1AM (PST) and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.4.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v3.4.1-rc1 (commit
>> 6b1ff22dde1ead51cbf370be6e48a802daae58b6)
>> https://github.com/apache/spark/tree/v3.4.1-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1443/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.4.1-rc1-docs/
>>
>> The list of bug fixes going into 3.4.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12352874
>>
>> This release is using the release script of the tag v3.4.1-rc1.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.4.1?
>> ===
>>
>> The current list of open tickets targeted at 3.4.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.4.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>

Re: Apache Spark 4.0.0 Dev Item Planning (SPARK-44111)

2023-06-20 Thread Yuming Wang

Thank you Dongjoon. I'd like to add these items.

*Support for more SQL syntax*
SPARK-31561  Add QUALIFY
clause
SPARK-24497  Support
recursive SQL
SPARK-32064  Support
temporary table

*Improve Query performance in specific scenarios*
SPARK-8682  Range Join
for Spark SQL. We have a blog in Chinese
 about this optimization.
SPARK-38506  Push
partial aggregation through join


On Wed, Jun 21, 2023 at 4:42 AM Dongjoon Hyun  wrote:

> Hi, All.
>
> As a continuation of our previous discussion, the official Apache Spark
> 4.0 Plan JIRA is created today in order to collect the community dev items.
> Feel free to add your work items, ideas, suggestions, aspirations and
> interests. We will moderate together.
>
> https://issues.apache.org/jira/browse/SPARK-44111
> Prepare Apache Spark 4.0.0
>
> In addition, we are going to include all left-over items which Apache
> Spark 3.5 cannot include on July 16th (Feature Freeze,
> https://spark.apache.org/versioning-policy.html)
>
>
>
> === PREVIOUS THREADS ===
>
> 2023-05-28 Apache Spark 3.5.0 Expectations (?)
> https://lists.apache.org/thread/3x6dh17bmy20n3frtt3crgxjydnxh2o0
>
> 2023-05-30 Apache Spark 4.0 Timeframe?
> https://lists.apache.org/thread/xhkgj60j361gdpywoxxz7qspp2w80ry6
>
> 2023-06-05 ASF policy violation and Scala version issues
> https://lists.apache.org/thread/k7gr65wt0fwtldc7hp7bd0vkg1k93rrb
>
> 2023-06-12 [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)
> https://lists.apache.org/thread/r0zn6rd8y25yn2dg59ktw3ttrwxzqrfb
>
> 2023-06-16 [VOTE][RESULT] Release Plan for Apache Spark 4.0.0 (June 2024)
> https://lists.apache.org/thread/5vfof0nm82gt5b2k2o0ws944hofz232g
>

Re: Apache Spark 3.4.1 Release?

2023-06-08 Thread Yuming Wang

+1.

On Fri, Jun 9, 2023 at 7:14 AM Chao Sun  wrote:

> +1 too
>
> On Thu, Jun 8, 2023 at 2:34 PM kazuyuki tanimura
>  wrote:
> >
> > +1 (non-binding), Thank you Dongjoon
> >
> > Kazu
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-10 Thread Yuming Wang

+1.

On Tue, Apr 11, 2023 at 12:17 AM Mridul Muralidharan 
wrote:

> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>
> Regards,
> Mridul
>
>
> On Mon, Apr 10, 2023 at 10:34 AM huaxin gao 
> wrote:
>
>> +1
>>
>> On Mon, Apr 10, 2023 at 8:17 AM Chao Sun  wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Mon, Apr 10, 2023 at 7:07 AM yangjie01  wrote:
>>>
 +1 (non-binding)



 *发件人**: *Sean Owen 
 *日期**: *2023年4月10日 星期一 21:19
 *收件人**: *Dongjoon Hyun 
 *抄送**: *"dev@spark.apache.org" 
 *主题**: *Re: [VOTE] Release Apache Spark 3.2.4 (RC1)



 +1 from me



 On Sun, Apr 9, 2023 at 7:19 PM Dongjoon Hyun 
 wrote:

 I'll start with my +1.

 I verified the checksum, signatures of the artifacts, and
 documentations.
 Also, ran the tests with YARN and K8s modules.

 Dongjoon.

 On 2023/04/09 23:46:10 Dongjoon Hyun wrote:
 > Please vote on releasing the following candidate as Apache Spark
 version
 > 3.2.4.
 >
 > The vote is open until April 13th 1AM (PST) and passes if a majority
 +1 PMC
 > votes are cast, with a minimum of 3 +1 votes.
 >
 > [ ] +1 Release this package as Apache Spark 3.2.4
 > [ ] -1 Do not release this package because ...
 >
 > To learn more about Apache Spark, please see
 https://spark.apache.org/
 
 >
 > The tag to be voted on is v3.2.4-rc1 (commit
 > 0ae10ac18298d1792828f1d59b652ef17462d76e)
 > https://github.com/apache/spark/tree/v3.2.4-rc1
 
 >
 > The release files, including signatures, digests, etc. can be found
 at:
 > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-bin/
 
 >
 > Signatures used for Spark RCs can be found in this file:
 > https://dist.apache.org/repos/dist/dev/spark/KEYS
 
 >
 > The staging repository for this release can be found at:
 >
 https://repository.apache.org/content/repositories/orgapachespark-1442/
 
 >
 > The documentation corresponding to this release can be found at:
 > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-docs/
 
 >
 > The list of bug fixes going into 3.2.4 can be found at the following
 URL:
 > https://issues.apache.org/jira/projects/SPARK/versions/12352607
 
 >
 > This release is using the release script of the tag v3.2.4-rc1.
 >
 > FAQ
 >
 > =
 > How can I help test this release?
 > =
 >
 > If you are a Spark user, you can help us test this release by taking
 > an existing Spark workload and running on this release candidate, then
 > reporting any regressions.
 >
 > If you're working in PySpark you can set up a virtual env and install
 > the current RC and see if anything important breaks, in the Java/Scala
 > you can add the staging repository to your projects resolvers and test
 > with the RC (make sure to clean up the artifact cache before/after so
 > you don't end up building with a out of date RC going forward).
 >
 > ===
 > What should happen to JIRA tickets still targeting 3.2.4?
 > ===
 >
 > The current list of open tickets targeted at 3.2.4 can be found at:
 > https://issues.apache.org/jira/projects/SPARK
 
 and search for "Target
 > Version/s" = 3.2.4
 >
 > Committers should look at those and triage. Extremely important bug
 > fixes, documentation, and API tweaks that impact compatibility should
 > be worked on immediately. Everything else please retarget to an
 > appropriate release.
 >
 > ==
 > But my bug isn't fixed?
 > ==
 >
 > In order to make timely releases, we will

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-10 Thread Yuming Wang

+1.

On Tue, Apr 11, 2023 at 9:14 AM Yikun Jiang  wrote:

> +1 (non-binding)
>
> Also ran the docker image related test (signatures/standalone/k8s) with
> rc7: https://github.com/apache/spark-docker/pull/32
>
> Regards,
> Yikun
>
>
> On Tue, Apr 11, 2023 at 4:44 AM Jacek Laskowski  wrote:
>
>> +1
>>
>> * Built fine with Scala 2.13
>> and -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano
>> * Ran some demos on Java 17
>> * Mac mini / Apple M2 Pro / Ventura 13.3.1
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>
>> On Sat, Apr 8, 2023 at 1:30 AM Xinrong Meng 
>> wrote:
>>
>>> Please vote on releasing the following candidate(RC7) as Apache Spark
>>> version 3.4.0.
>>>
>>> The vote is open until 11:59pm Pacific time *April 12th* and passes if
>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.4.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.4.0-rc7 (commit
>>> 87a5442f7ed96b11051d8a9333476d080054e5a0):
>>> https://github.com/apache/spark/tree/v3.4.0-rc7
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1441
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/
>>>
>>> The list of bug fixes going into 3.4.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>>>
>>> This release is using the release script of the tag v3.4.0-rc7.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.4.0?
>>> ===
>>> The current list of open tickets targeted at 3.4.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.4.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>> Thanks,
>>> Xinrong Meng
>>>
>>

sbt build is broken because repo is not available

2023-04-07 Thread Yuming Wang

Hi all,

sbt build is broken because repo is not available. Please see:
https://github.com/sbt/sbt/issues/7202.

Re: Apache Spark 3.2.4 EOL Release?

2023-04-05 Thread Yuming Wang

+1

On Wed, Apr 5, 2023 at 9:09 AM Xinrong Meng 
wrote:

> +1
>
> Hyukjin Kwon 于2023年4月4日 周二下午5:31写道：
>
>> +1
>>
>> On Wed, 5 Apr 2023 at 07:31, Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1
>>> Sounds good to me.
>>>
>>> Thanks,
>>> Mridul
>>>
>>>
>>> On Tue, Apr 4, 2023 at 1:39 PM huaxin gao 
>>> wrote:
>>>
 +1

 On Tue, Apr 4, 2023 at 11:17 AM Chao Sun  wrote:

> +1
>
> On Tue, Apr 4, 2023 at 11:12 AM Holden Karau 
> wrote:
>
>> +1
>>
>> On Tue, Apr 4, 2023 at 11:04 AM L. C. Hsieh 
>> wrote:
>>
>>> +1
>>>
>>> Sounds good and thanks Dongjoon for driving this.
>>>
>>> On 2023/04/04 17:24:54 Dongjoon Hyun wrote:
>>> > Hi, All.
>>> >
>>> > Since Apache Spark 3.2.0 passed RC7 vote on October 12, 2021,
>>> branch-3.2
>>> > has been maintained and served well until now.
>>> >
>>> > - https://github.com/apache/spark/releases/tag/v3.2.0 (tagged on
>>> Oct 6,
>>> > 2021)
>>> > - https://lists.apache.org/thread/jslhkh9sb5czvdsn7nz4t40xoyvznlc7
>>> >
>>> > As of today, branch-3.2 has 62 additional patches after v3.2.3 and
>>> reaches
>>> > the end-of-life this month according to the Apache Spark release
>>> cadence. (
>>> > https://spark.apache.org/versioning-policy.html)
>>> >
>>> > $ git log --oneline v3.2.3..HEAD | wc -l
>>> > 62
>>> >
>>> > With the upcoming Apache Spark 3.4, I hope the users can get a
>>> chance to
>>> > have these last bits of Apache Spark 3.2.x, and I'd like to
>>> propose to have
>>> > Apache Spark 3.2.4 EOL Release next week and volunteer as the
>>> release
>>> > manager. WDTY? Please let me know if you need more patches on
>>> branch-3.2.
>>> >
>>> > Thanks,
>>> > Dongjoon.
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-04-03 Thread Yuming Wang

+1

On Tue, Apr 4, 2023 at 3:46 AM L. C. Hsieh  wrote:

> +1
>
> Thanks Xinrong.
>
> On Mon, Apr 3, 2023 at 12:35 PM Dongjoon Hyun 
> wrote:
> >
> > +1
> >
> > I also verified that RC5 has SBOM artifacts.
> >
> >
> https://repository.apache.org/content/repositories/orgapachespark-1439/org/apache/spark/spark-core_2.12/3.4.0/spark-core_2.12-3.4.0-cyclonedx.json
> >
> https://repository.apache.org/content/repositories/orgapachespark-1439/org/apache/spark/spark-core_2.13/3.4.0/spark-core_2.13-3.4.0-cyclonedx.json
> >
> > Thanks,
> > Dongjoon.
> >
> >
> >
> > On Mon, Apr 3, 2023 at 1:57 AM yangjie01  wrote:
> >>
> >> +1, checked Java 17 + Scala 2.13 + Python 3.10.10.
> >>
> >>
> >>
> >> 发件人: Herman van Hovell 
> >> 日期: 2023年3月31日 星期五 12:12
> >> 收件人: Sean Owen 
> >> 抄送: Xinrong Meng , dev 
> >> 主题: Re: [VOTE] Release Apache Spark 3.4.0 (RC5)
> >>
> >>
> >>
> >> +1
> >>
> >>
> >>
> >> On Thu, Mar 30, 2023 at 11:05 PM Sean Owen  wrote:
> >>
> >> +1 same result from me as last time.
> >>
> >>
> >>
> >> On Thu, Mar 30, 2023 at 3:21 AM Xinrong Meng 
> wrote:
> >>
> >> Please vote on releasing the following candidate(RC5) as Apache Spark
> version 3.4.0.
> >>
> >> The vote is open until 11:59pm Pacific time April 4th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >>
> >> [ ] +1 Release this package as Apache Spark 3.4.0
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v3.4.0-rc5 (commit
> f39ad617d32a671e120464e4a75986241d72c487):
> >> https://github.com/apache/spark/tree/v3.4.0-rc5
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1439
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-docs/
> >>
> >> The list of bug fixes going into 3.4.0 can be found at the following
> URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12351465
> >>
> >> This release is using the release script of the tag v3.4.0-rc5.
> >>
> >>
> >>
> >>
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> >> the current RC and see if anything important breaks, in the Java/Scala
> >> you can add the staging repository to your projects resolvers and test
> >> with the RC (make sure to clean up the artifact cache before/after so
> >> you don't end up building with an out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 3.4.0?
> >> ===
> >> The current list of open tickets targeted at 3.4.0 can be found at:
> >> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something which is a regression
> >> that has not been correctly targeted please ping me or a committer to
> >> help target the issue.
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Xinrong Meng
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: please help why there is big difference in partitionFilter in spark2.4.2 and spark3.1.3.

2023-03-05 Thread Yuming Wang

Hi Liyun,

This is because of this change:
https://issues.apache.org/jira/browse/SPARK-27638.
You can set spark.sql.legacy.typeCoercion.datetimeToString.enabled to true to
restore the old behavior.

On Mon, Mar 6, 2023 at 10:27 AM zhangliyun  wrote:

> Hi all
>
>
>   i have a spark sql  , before in  spark 2.4.2 it runs correctly, when i
> upgrade to spark 3.1.3, it has some problem.
>
>  the sql
>
>  ```
>
> select * from eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly
> where dt >= date_sub('${today}',30);
>
>
> ```
>
> it will load the data of past 30 days of table
> eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly, here
> today='2023-03-01'
>
>
> in spark2  i saw the physical plan  the partition Filter is PartitionFilters:
> [isnotnull(dt#1461), (dt#1461 >= 2023-01-31)]
>
>  +- *(4) FileScan parquet 
> eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly[disputeid#1327,statuswork#1330,opTs#1457,trailSeqno#1459,trailRba#1460,dt#1461,hr#1462]
>  Batched: true, Format: Parquet, Location: 
> PrunedInMemoryFileIndex[gs://pypl-bkt-prd-row-std-gds-non-edw-tables/apps/risk/eds/eds_risk/eds_r...,
>  PartitionCount: 805, PartitionFilters: [isnotnull(dt#1461), (dt#1461 >= 
> 2023-01-31)], PushedFilters: [IsNotNull(disputeid)], ReadSchema: 
> struct
>
>
>
> in spark3 , i saw the physical plan ,  the partitionFilter is 
> [isnotnull(dt#1602),
> (cast(dt#1602 as date) >= 19387)]
> ```
>
> (8) Scan parquet eds_rds.cdh_prpc63cgudba_pp_index_disputecasedetails_hourly
> Output [7]: [disputeid#1468, statuswork#1471, opTs#1598, trailSeqno#1600, 
> trailRba#1601, dt#1602, hr#1603]
> Batched: true
> Location: InMemoryFileIndex 
> [gs://pypl-bkt-prd-row-std-gds-non-edw-tables/apps/risk/eds/eds_risk/eds_rds/cdh/prpc63cgudba_pp_index_disputecasedetails/dt=2023-01-30/hr=00,
>  ... 784 entries]
> PartitionFilters: [isnotnull(dt#1602), (cast(dt#1602 as date) >= 19387)]
> PushedFilters: [IsNotNull(disputeid)]
> ReadSchema: 
> struct
>
>
> ```
>
> here i want to ask why there is big difference in partitionFitler in
> spark2 and spark3,  i guess most my spark configure is similar in spark2
> and spark3 to run the same sql
>

Re: [VOTE][SPIP] Lazy Materialization for Parquet Read Performance Improvement

2023-02-13 Thread Yuming Wang

+1

On Tue, Feb 14, 2023 at 11:27 AM Prem Sahoo  wrote:

> +1
>
> On Mon, Feb 13, 2023 at 8:13 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Mon, Feb 13, 2023 at 3:49 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> +1 for me
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 13 Feb 2023 at 23:18, huaxin gao  wrote:
>>>
 +1

 On Mon, Feb 13, 2023 at 3:09 PM Dongjoon Hyun 
 wrote:

> +1
>
> Dongjoon
>
> On 2023/02/13 22:52:59 "L. C. Hsieh" wrote:
> > Hi all,
> >
> > I'd like to start the vote for SPIP: Lazy Materialization for Parquet
> > Read Performance Improvement.
> >
> > The high summary of the SPIP is that it proposes an improvement to
> the
> > Parquet reader with lazy materialization which only materializes
> (i.e.
> > decompress, de-code, etc...) necessary values. For Spark-SQL filter
> > operations, evaluating the filters first and lazily materializing
> only
> > the used values can save computation wastes and improve the read
> > performance.
> >
> > References:
> >
> > JIRA ticket https://issues.apache.org/jira/browse/SPARK-42256
> > SPIP doc
> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
> > Discussion thread
> > https://lists.apache.org/thread/5yf2ylqhcv94y03m7gp3mgf3q0fp6gw6
> >
> > Please vote on the SPIP for the next 72 hours:
> >
> > [ ] +1: Accept the proposal as an official SPIP
> > [ ] +0
> > [ ] -1: I don’t think this is a good idea because …
> >
> > Thank you!
> >
> > Liang-Chi Hsieh
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-12 Thread Yuming Wang

+1.

On Mon, Feb 13, 2023 at 11:52 AM yangjie01  wrote:

> +1, Test 3.3.2-rc1 with Java 17 + Scala 2.13 + Python 3.10, all test
> passed.
>
>
>
> Yang Jie
>
>
>
> *发件人**: *Yikun Jiang 
> *日期**: *2023年2月13日 星期一 11:47
> *收件人**: *Spark dev list 
> *抄送**: *"L. C. Hsieh" 
> *主题**: *Re: [VOTE] Release Spark 3.3.2 (RC1)
>
>
>
> +1, Test 3.3.2-rc1 with spark-docker:
>
> - Downloading rc4 tgz, validate the key.
>
> - Extract bin and build image
>
> - Run K8s IT, standalone test of R/Python/Scala/All image [1]
>
>
>
> [1] https://github.com/apache/spark-docker/pull/29
> 
>
>
>
> Regards,
>
> Yikun
>
>
>
>
>
> On Mon, Feb 13, 2023 at 10:25 AM yangjie01  wrote:
>
> Which Python version do you use for testing? When I use the latest Python
> 3.11, I can reproduce similar test failures (43 tests of sql module fail),
> but when I use python 3.10, they will succeed
>
>
>
> YangJie
>
>
>
> *发件人**: *Bjørn Jørgensen 
> *日期**: *2023年2月13日 星期一 05:09
> *收件人**: *Sean Owen 
> *抄送**: *"L. C. Hsieh" , Spark dev list <
> dev@spark.apache.org>
> *主题**: *Re: [VOTE] Release Spark 3.3.2 (RC1)
>
>
>
> Tried it one more time and the same result.
>
>
>
> On another box with Manjaro
>
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.3.2:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [01:50
> min]
> [INFO] Spark Project Tags . SUCCESS [
> 17.359 s]
> [INFO] Spark Project Sketch ... SUCCESS [
> 12.517 s]
> [INFO] Spark Project Local DB . SUCCESS [
> 14.463 s]
> [INFO] Spark Project Networking ... SUCCESS [01:07
> min]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>  9.013 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
>  8.184 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 10.454 s]
> [INFO] Spark Project Core . SUCCESS [23:58
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [
> 21.218 s]
> [INFO] Spark Project GraphX ... SUCCESS [01:24
> min]
> [INFO] Spark Project Streaming  SUCCESS [04:57
> min]
> [INFO] Spark Project Catalyst . SUCCESS [08:00
> min]
> [INFO] Spark Project SQL .. SUCCESS [
>  01:02 h]
> [INFO] Spark Project ML Library ... SUCCESS [14:38
> min]
> [INFO] Spark Project Tools  SUCCESS [
>  4.394 s]
> [INFO] Spark Project Hive . SUCCESS [53:43
> min]
> [INFO] Spark Project REPL . SUCCESS [01:16
> min]
> [INFO] Spark Project Assembly . SUCCESS [
>  2.186 s]
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [
> 16.150 s]
> [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:34
> min]
> [INFO] Kafka 0.10+ Source for Structured Streaming  SUCCESS [32:55
> min]
> [INFO] Spark Project Examples . SUCCESS [
> 23.800 s]
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [
>  7.301 s]
> [INFO] Spark Avro . SUCCESS [01:19
> min]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time:  03:31 h
> [INFO] Finished at: 2023-02-12T21:54:20+01:00
> [INFO]
> 
> [bjorn@amd7g spark-3.3.2]$  java -version
> openjdk version "17.0.6" 2023-01-17
> OpenJDK Runtime Environment (build 17.0.6+10)
> OpenJDK 64-Bit Server VM (build 17.0.6+10, mixed mode)
>
>
>
>
>
> :)
>
>
>
> So I'm +1
>
>
>
>
>
> søn. 12. feb. 2023 kl. 12:53 skrev Bjørn Jørgensen <
> bjornjorgen...@gmail.com>:
>
> I use ubuntu rolling
>
> $ java -version
> openjdk version "17.0.6" 2023-01-17
> OpenJDK Runtime Environment (build 17.0.6+10-Ubuntu-0ubuntu1)
> OpenJDK 64-Bit Server VM (build 17.0.6+10-Ubuntu-0ubuntu1, mixed mode,
> sharing)
>
>
>
> I have reboot now and restart ./build/mvn clean package
>
>
>
>
>
>
>
> søn. 12. feb. 2023 kl. 04:47 skrev Sean Owen :
>
> +1 The tests and all results were the same as ever for me (Java 11, Scala
> 2.13, Ubuntu 22.04)
>
> I also didn't see that issue ... maybe somehow locale related? which could
> still be a bug.
>
>
>
> On Sat, Feb 11, 2023 at 8:49 PM L. C. Hsieh  wrote:
>
> Thank you for testing it.
>
> I was going to run it again but still didn't see any errors.
>
> I also checked CI (and looked again now) on branch-3.3 before cutting RC.
>
> BTW, I

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread Yuming Wang

+1.

On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura
 wrote:

> Great! Much appreciated, Mitch!
>
> Kazu
>
> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh 
> wrote:
>
> Thanks, Kazu.
>
> I followed that template link and indeed as you pointed out it is a common
> template. If it works then it is what it is.
>
> I will be going through your design proposals and hopefully we can review
> it.
>
> Regards,
>
> Mich
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura 
> wrote:
>
>> Thank you Mich. I followed the instruction at
>> https://spark.apache.org/improvement-proposals.html and used its
>> template.
>> While we are open to revise our design doc, it seems more like you are
>> proposing the community to change the instruction per se?
>>
>> Kazu
>>
>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> Thanks for these proposals. good suggestions. Is this style of breaking
>> down your approach standard?
>>
>> My view would be that perhaps it makes more sense to follow the industry
>> established approach of breaking down your technical proposal  into:
>>
>>
>>1. Background
>>2. Objective
>>3. Scope
>>4. Constraints
>>5. Assumptions
>>6. Reporting
>>7. Deliverables
>>8. Timelines
>>9. Appendix
>>
>> Your current approach using below
>>
>> Q1. What are you trying to do? Articulate your objectives using
>> absolutely no jargon. What are you trying to achieve?
>> Q2. What problem is this proposal NOT designed to solve? What issues the
>> suggested proposal is not going to address
>> Q3. How is it done today, and what are the limits of current practice?
>> Q4. What is new in your approach approach and why do you think it will be
>> successful succeed?
>> Q5. Who cares? If you are successful, what difference will it make? If
>> your proposal succeeds, what tangible benefits will it add?
>> Q6. What are the risks?
>> Q7. How long will it take?
>> Q8. What are the midterm and final “exams” to check for success?
>>
>>
>> May not do  justice to your proposal.
>>
>> HTH
>>
>> Mich
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>> ktanim...@apple.com.invalid> wrote:
>>
>>> Hi everyone,
>>>
>>> I would like to start a discussion on “Lazy Materialization for Parquet
>>> Read Performance Improvement"
>>>
>>> Chao and I propose a Parquet reader with lazy materialization. For
>>> Spark-SQL filter operations, evaluating the filters first and lazily
>>> materializing only the used values can save computation wastes and improve
>>> the read performance.
>>> The current implementation of Spark requires the read values to
>>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>>> applying the filters even though the filters may eventually throw away many
>>> values.
>>>
>>> We made our design doc as follows.
>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>>> SPIP Doc:
>>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>
>>> Liang-Chi was kind enough to shepherd this effort.
>>>
>>> Thank you
>>> Kazu
>>>
>>
>>
>

Re: Time for release v3.3.2

2023-01-30 Thread Yuming Wang

+1

On Tue, Jan 31, 2023 at 12:18 PM yangjie01  wrote:

> +1 Thanks Liang-Chi!
>
>
>
> YangJie
>
>
>
> *发件人**: *huaxin gao 
> *日期**: *2023年1月31日 星期二 10:03
> *收件人**: *Dongjoon Hyun 
> *抄送**: *Hyukjin Kwon , Chao Sun ,
> "L. C. Hsieh" , Spark dev list 
> *主题**: *Re: Time for release v3.3.2
>
>
>
> +1 Thanks Liang-Chi!
>
>
>
> On Mon, Jan 30, 2023 at 6:01 PM Dongjoon Hyun 
> wrote:
>
> +1
>
>
>
> Thank you so much, Liang-Chi.
>
> 3.3.2 release will help 3.4.0 release too because they share many bug
> fixes.
>
>
>
> Dongjoon
>
>
>
>
>
> On Mon, Jan 30, 2023 at 5:56 PM Hyukjin Kwon  wrote:
>
> +100!
>
>
>
> On Tue, 31 Jan 2023 at 10:54, Chao Sun  wrote:
>
> +1, thanks Liang-Chi for volunteering!
>
> Chao
>
> On Mon, Jan 30, 2023 at 5:51 PM L. C. Hsieh  wrote:
> >
> > Hi Spark devs,
> >
> > As you know, it has been 4 months since Spark 3.3.1 was released on
> > 2022/10, it seems a good time to think about next maintenance release,
> > i.e. Spark 3.3.2.
> >
> > I'm thinking of the release of Spark 3.3.2 this Feb (2023/02).
> >
> > What do you think?
> >
> > I am willing to volunteer for Spark 3.3.2 if there is consensus about
> > this maintenance release.
> >
> > Thank you.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Spark 3.2.3 (RC1)

2022-11-16 Thread Yuming Wang

+1

On Wed, Nov 16, 2022 at 2:28 PM Yang,Jie(INF)  wrote:

> I switched Scala 2.13 to Scala 2.12 today. The test is still in progress
> and it has not been hung.
>
>
>
> Yang Jie
>
>
>
> *发件人**: *Dongjoon Hyun 
> *日期**: *2022年11月16日 星期三 01:17
> *收件人**: *"Yang,Jie(INF)" 
> *抄送**: *huaxin gao , "L. C. Hsieh" <
> vii...@gmail.com>, Chao Sun , dev <
> dev@spark.apache.org>
> *主题**: *Re: [VOTE] Release Spark 3.2.3 (RC1)
>
>
>
> Did you hit that in Scala 2.12, too?
>
>
>
> Dongjoon.
>
>
>
> On Tue, Nov 15, 2022 at 4:36 AM Yang,Jie(INF)  wrote:
>
> Hi, all
>
>
>
> I test v3.2.3 with following command:
>
>
>
> ```
>
> dev/change-scala-version.sh 2.13
>
> build/mvn clean install -Phadoop-3 -Phadoop-cloud -Pmesos -Pyarn
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive
> -Pscala-2.13 -fn
>
> ```
>
>
>
> The testing environment is:
>
>
>
> OS: CentOS 6u3 Final
>
> Java: zulu 11.0.17
>
> Python: 3.9.7
>
> Scala: 2.13
>
>
>
> The above test command has been executed twice, and all times hang in the
> following stack:
>
>
>
> ```
>
> "ScalaTest-main-running-JoinSuite" #1 prio=5 os_prio=0 cpu=312870.06ms
> elapsed=1552.65s tid=0x7f2ddc02d000 nid=0x7132 waiting on condition
> [0x7f2de3929000]
>
>java.lang.Thread.State: WAITING (parking)
>
>at jdk.internal.misc.Unsafe.park(java.base@11.0.17/Native Method)
>
>- parking to wait for  <0x000790d00050> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>
>at java.util.concurrent.locks.LockSupport.park(java.base@11.0.17
> /LockSupport.java:194)
>
>at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.17
> /AbstractQueuedSynchronizer.java:2081)
>
>at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.17
> /LinkedBlockingQueue.java:433)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:275)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$Lambda$9429/0x000802269840.apply(Unknown
> Source)
>
>at
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:228)
>
>- locked <0x000790d00208> (a java.lang.Object)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:370)
>
>at
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecute(AdaptiveSparkPlanExec.scala:355)
>
>at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
>
>at
> org.apache.spark.sql.execution.SparkPlan$$Lambda$8573/0x000801f99c40.apply(Unknown
> Source)
>
>at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
>
>at
> org.apache.spark.sql.execution.SparkPlan$$Lambda$8574/0x000801f9a040.apply(Unknown
> Source)
>
>at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>
>at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
>
>at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
>
>at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:172)
>
>- locked <0x000790d00218> (a
> org.apache.spark.sql.execution.QueryExecution)
>
>at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:171)
>
>at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3247)
>
>- locked <0x000790d002d8> (a org.apache.spark.sql.Dataset)
>
>at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3245)
>
>at
> org.apache.spark.sql.QueryTest$.$anonfun$getErrorMessageInCheckAnswer$1(QueryTest.scala:265)
>
>at
> org.apache.spark.sql.QueryTest$$$Lambda$8564/0x000801f94440.apply$mcJ$sp(Unknown
> Source)
>
>at
> scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.scala:17)
>
>at
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>
>at
> org.apache.spark.sql.QueryTest$.getErrorMessageInCheckAnswer(QueryTest.scala:265)
>
>at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:242)
>
>at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:151)
>
>at org.apache.spark.sql.JoinSuite.checkAnswer(JoinSuite.scala:58)
>
>at
> org.apache.spark.sql.JoinSuite.$anonfun$new$138(JoinSuite.scala:1062)
>
>at
> org.apache.spark.sql.JoinSuite$$Lambda$2827/0x0008013d5840.apply$mcV$sp(Unknown
> Source)
>
>at
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
>
>at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>
>at

Re: [VOTE][SPIP] Better Spark UI scalability and Driver stability for large applications

2022-11-16 Thread Yuming Wang

+1, non-binding

On Wed, Nov 16, 2022 at 8:12 PM Yang,Jie(INF)  wrote:

> +1, non-binding
>
>
>
> Yang Jie
>
>
>
> *发件人**: *Mridul Muralidharan 
> *日期**: *2022年11月16日 星期三 17:35
> *收件人**: *Kent Yao 
> *抄送**: *Gengliang Wang , dev 
> *主题**: *Re: [VOTE][SPIP] Better Spark UI scalability and Driver stability
> for large applications
>
>
>
>
>
> +1
>
>
>
> Would be great to see history server performance improvements and lower
> resource utilization at driver !
>
>
>
> Regards,
>
> Mridul
>
>
>
> On Wed, Nov 16, 2022 at 2:38 AM Kent Yao  wrote:
>
> +1, non-binding
>
> Gengliang Wang  于2022年11月16日周三 16:36写道：
> >
> > Hi all,
> >
> > I’d like to start a vote for SPIP: "Better Spark UI scalability and
> Driver stability for large applications"
> >
> > The goal of the SPIP is to improve the Driver's stability by supporting
> storing Spark's UI data on RocksDB. Furthermore, to fasten the read and
> write operations on RocksDB, it introduces a new Protobuf serializer.
> >
> > Please also refer to the following:
> >
> > Previous discussion in the dev mailing list: [DISCUSS] SPIP: Better
> Spark UI scalability and Driver stability for large applications
> > Design Doc: Better Spark UI scalability and Driver stability for large
> applications
> > JIRA: SPARK-41053
> >
> >
> > Please vote on the SPIP for the next 72 hours:
> >
> > [ ] +1: Accept the proposal as an official SPIP
> > [ ] +0
> > [ ] -1: I don’t think this is a good idea because …
> >
> > Kind Regards,
> > Gengliang
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Yuming Wang

We are happy to announce the availability of Apache Spark 3.3.1!

Spark 3.3.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.3 maintenance branch of Spark. We strongly
recommend all 3.3 users to upgrade to this stable release.

To download Spark 3.3.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-3-1.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

[VOTE][RESULT] Release Spark 3.3.1 (RC4)

2022-10-22 Thread Yuming Wang

The vote passes with 11 +1s (6 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:
- Sean Owen (*)
- Yang,Jie
- Dongjoon Hyun (*)
- L. C. Hsieh (*)
- Gengliang Wang (*)
- Thomas graves (*)
- Chao Sun
- Wenchen Fan (*)
- Yikun Jiang
- Cheng Pan
- Yuming Wang

+0: None

-1: None

Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-22 Thread Yuming Wang

+1 for me. Passed Delta unit tests
<https://github.com/delta-io/delta/pull/1382>, Iceberg unit tests
<https://github.com/apache/iceberg/pull/5783> and Hudi unit tests
<https://github.com/apache/hudi/pull/6707>.

On Sat, Oct 22, 2022 at 8:30 PM Yuming Wang  wrote:

>
> @Mridul Muralidharan  I can't reproduce this issue.
> This is my github action
> <https://github.com/wangyum/test-spark-3.3.1/blob/main/.github/workflows/blank.yml>
> job.
>
> On Sat, Oct 22, 2022 at 9:00 AM Mridul Muralidharan 
> wrote:
>
>>
>> My desktop is running Ubuntu 22.04.1 LTS, with JAVA_HOME pointing to
>> jdk1.8.0_341
>> I ran build with '-Pyarn -Pmesos -Pkubernetes' profiles [1] and with
>> $HOME/.m2 cleaned up.
>>
>> Regards,
>> Mridul
>>
>> [1] ARGS="-Pyarn -Pmesos -Pkubernetes"; ./build/mvn $ARGS clean &&
>> ./build/mvn -DskipTests $ARGS package 2>&1 | tee build_output.txt  &&
>> ./build/mvn  $ARGS package 2>&1 | tee test_output.txt
>>
>> On Fri, Oct 21, 2022 at 11:17 AM Dongjoon Hyun 
>> wrote:
>>
>>> Could you provide your environment and test profile? Both community CIs
>>> look fine to me.
>>>
>>> GitHub Action:
>>> https://github.com/apache/spark/actions?query=branch%3Abranch-3.3
>>> Apple Silicon Jenkins Farm:
>>> https://apache-spark.s3.fr-par.scw.cloud/BRANCH-3.3.html
>>>
>>> Dongjoon.
>>>
>>>
>>> On Fri, Oct 21, 2022 at 8:48 AM Mridul Muralidharan 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>   I saw a couple of test failures I have not observed before:
>>>>
>>>> a) FsHistoryProviderSuite -  "SPARK-33146: don't let one bad rolling
>>>> log folder prevent loading other applications"
>>>> b) MesosClusterSchedulerSuite - "accept/decline offers with driver
>>>> constraints"
>>>>
>>>> I ended up 'ignore''ing them to make the build pass, but did anything
>>>> change to cause them to fail/be flakey ?
>>>>
>>>> Rest of the validation and build went fine.
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Oct 18, 2022 at 10:28 PM Cheng Pan  wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> - Passed Apache Kyuubi (Incubating) integration tests[1]
>>>>> - Run some jobs on our internal K8s cluster
>>>>>
>>>>> [1] https://github.com/apache/incubator-kyuubi/pull/3507
>>>>>
>>>>> Thanks,
>>>>> Cheng Pan
>>>>>
>>>>> On Wed, Oct 19, 2022 at 9:13 AM Yikun Jiang 
>>>>> wrote:
>>>>> >
>>>>> > +1, also test passed with spark-docker workflow (downloading rc4
>>>>> tgz, extract, build image, run K8s IT)
>>>>> >
>>>>> > [1] https://github.com/Yikun/spark-docker/pull/9
>>>>> >
>>>>> > Regards,
>>>>> > Yikun
>>>>> >
>>>>> > On Wed, Oct 19, 2022 at 8:59 AM Wenchen Fan 
>>>>> wrote:
>>>>> >>
>>>>> >> +1
>>>>> >>
>>>>> >> On Wed, Oct 19, 2022 at 4:59 AM Chao Sun 
>>>>> wrote:
>>>>> >>>
>>>>> >>> +1. Thanks Yuming!
>>>>> >>>
>>>>> >>> Chao
>>>>> >>>
>>>>> >>> On Tue, Oct 18, 2022 at 1:18 PM Thomas graves 
>>>>> wrote:
>>>>> >>> >
>>>>> >>> > +1. Ran internal test suite.
>>>>> >>> >
>>>>> >>> > Tom
>>>>> >>> >
>>>>> >>> > On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang 
>>>>> wrote:
>>>>> >>> > >
>>>>> >>> > > Please vote on releasing the following candidate as Apache
>>>>> Spark version 3.3.1.
>>>>> >>> > >
>>>>> >>> > > The vote is open until 11:59pm Pacific time October 21th and
>>>>> passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>> >>> > >
>>>>> >>> > > [ ] +1 Release this package as Apache Spark 3.3.1
>>>&

Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-22 Thread Yuming Wang

@Mridul Muralidharan  I can't reproduce this issue. This
is my github action
<https://github.com/wangyum/test-spark-3.3.1/blob/main/.github/workflows/blank.yml>
job.

On Sat, Oct 22, 2022 at 9:00 AM Mridul Muralidharan 
wrote:

>
> My desktop is running Ubuntu 22.04.1 LTS, with JAVA_HOME pointing to
> jdk1.8.0_341
> I ran build with '-Pyarn -Pmesos -Pkubernetes' profiles [1] and with
> $HOME/.m2 cleaned up.
>
> Regards,
> Mridul
>
> [1] ARGS="-Pyarn -Pmesos -Pkubernetes"; ./build/mvn $ARGS clean &&
> ./build/mvn -DskipTests $ARGS package 2>&1 | tee build_output.txt  &&
> ./build/mvn  $ARGS package 2>&1 | tee test_output.txt
>
> On Fri, Oct 21, 2022 at 11:17 AM Dongjoon Hyun 
> wrote:
>
>> Could you provide your environment and test profile? Both community CIs
>> look fine to me.
>>
>> GitHub Action:
>> https://github.com/apache/spark/actions?query=branch%3Abranch-3.3
>> Apple Silicon Jenkins Farm:
>> https://apache-spark.s3.fr-par.scw.cloud/BRANCH-3.3.html
>>
>> Dongjoon.
>>
>>
>> On Fri, Oct 21, 2022 at 8:48 AM Mridul Muralidharan 
>> wrote:
>>
>>> Hi,
>>>
>>>   I saw a couple of test failures I have not observed before:
>>>
>>> a) FsHistoryProviderSuite -  "SPARK-33146: don't let one bad rolling log
>>> folder prevent loading other applications"
>>> b) MesosClusterSchedulerSuite - "accept/decline offers with driver
>>> constraints"
>>>
>>> I ended up 'ignore''ing them to make the build pass, but did anything
>>> change to cause them to fail/be flakey ?
>>>
>>> Rest of the validation and build went fine.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Oct 18, 2022 at 10:28 PM Cheng Pan  wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> - Passed Apache Kyuubi (Incubating) integration tests[1]
>>>> - Run some jobs on our internal K8s cluster
>>>>
>>>> [1] https://github.com/apache/incubator-kyuubi/pull/3507
>>>>
>>>> Thanks,
>>>> Cheng Pan
>>>>
>>>> On Wed, Oct 19, 2022 at 9:13 AM Yikun Jiang 
>>>> wrote:
>>>> >
>>>> > +1, also test passed with spark-docker workflow (downloading rc4 tgz,
>>>> extract, build image, run K8s IT)
>>>> >
>>>> > [1] https://github.com/Yikun/spark-docker/pull/9
>>>> >
>>>> > Regards,
>>>> > Yikun
>>>> >
>>>> > On Wed, Oct 19, 2022 at 8:59 AM Wenchen Fan 
>>>> wrote:
>>>> >>
>>>> >> +1
>>>> >>
>>>> >> On Wed, Oct 19, 2022 at 4:59 AM Chao Sun  wrote:
>>>> >>>
>>>> >>> +1. Thanks Yuming!
>>>> >>>
>>>> >>> Chao
>>>> >>>
>>>> >>> On Tue, Oct 18, 2022 at 1:18 PM Thomas graves 
>>>> wrote:
>>>> >>> >
>>>> >>> > +1. Ran internal test suite.
>>>> >>> >
>>>> >>> > Tom
>>>> >>> >
>>>> >>> > On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang 
>>>> wrote:
>>>> >>> > >
>>>> >>> > > Please vote on releasing the following candidate as Apache
>>>> Spark version 3.3.1.
>>>> >>> > >
>>>> >>> > > The vote is open until 11:59pm Pacific time October 21th and
>>>> passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>> >>> > >
>>>> >>> > > [ ] +1 Release this package as Apache Spark 3.3.1
>>>> >>> > > [ ] -1 Do not release this package because ...
>>>> >>> > >
>>>> >>> > > To learn more about Apache Spark, please see
>>>> https://spark.apache.org
>>>> >>> > >
>>>> >>> > > The tag to be voted on is v3.3.1-rc4 (commit
>>>> fbbcf9434ac070dd4ced4fb9efe32899c6db12a9):
>>>> >>> > > https://github.com/apache/spark/tree/v3.3.1-rc4
>>>> >>> > >
>>>> >>> > > The release files, including signatures, digests, etc. can be
>>>> found at:
>>>> >>> > > https://dist.apache.org/repos/dist

Re: Apache Spark 3.2.3 Release?

2022-10-18 Thread Yuming Wang

+1

On Wed, Oct 19, 2022 at 4:17 AM kazuyuki tanimura
 wrote:

> +1 Thanks Chao!
>
>
> Kazu
>
> On Oct 18, 2022, at 11:48 AM, Gengliang Wang  wrote:
>
> +1. Thanks Chao!
>
> On Tue, Oct 18, 2022 at 11:45 AM huaxin gao 
> wrote:
>
>> +1 Thanks Chao!
>>
>> Huaxin
>>
>> On Tue, Oct 18, 2022 at 11:29 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> Thank you for volunteering, Chao!
>>>
>>> Dongjoon.
>>>
>>>
>>> On Tue, Oct 18, 2022 at 9:55 AM Sean Owen  wrote:
>>>
 OK by me, if someone is willing to drive it.

 On Tue, Oct 18, 2022 at 11:47 AM Chao Sun  wrote:

> Hi All,
>
> It's been more than 3 months since 3.2.2 (tagged at Jul 11) was
> released There are now 66 patches accumulated in branch-3.2, including
> 2 correctness issues.
>
> Is it a good time to start a new release? If there's no objection, I'd
> like to volunteer as the release manager for the 3.2.3 release, and
> start preparing the first RC next week.
>
> # Correctness issues
>
> SPARK-39833Filtered parquet data frame count() and show() produce
> inconsistent results when spark.sql.parquet.filterPushdown is true
> SPARK-40002.   Limit improperly pushed down through window using ntile
> function
>
> Best,
> Chao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>

[VOTE] Release Spark 3.3.1 (RC4)

2022-10-16 Thread Yuming Wang

Please vote on releasing the following candidate as Apache Spark version 3.3.1.

The vote is open until 11:59pm Pacific time October 21th and passes if
a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org

The tag to be voted on is v3.3.1-rc4 (commit
fbbcf9434ac070dd4ced4fb9efe32899c6db12a9):
https://github.com/apache/spark/tree/v3.3.1-rc4

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-bin

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1430

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc4-docs

The list of bug fixes going into 3.3.1 can be found at the following URL:
https://s.apache.org/ttgz6

This release is using the release script of the tag v3.3.1-rc4.


FAQ

==
What happened to v3.3.1-rc3?
==
A performance regression(SPARK-40703) was found after tagging
v3.3.1-rc3, which the Iceberg community hopes Spark 3.3.1 could fix.
So we skipped the vote on v3.3.1-rc3.

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.3.1?
===
The current list of open tickets targeted at 3.3.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.3.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Re: Welcome Yikun Jiang as a Spark committer

2022-10-07 Thread Yuming Wang

Congratulations Yikun!

On Sat, Oct 8, 2022 at 12:40 PM Hyukjin Kwon  wrote:

> Hi all,
>
> The Spark PMC recently added Yikun Jiang as a committer on the project.
> Yikun is the major contributor of the infrastructure and GitHub Actions in
> Apache Spark as well as Kubernates and PySpark.
> He has put a lot of effort into stabilizing and optimizing the builds
> so we all can work together in Apache Spark more
> efficiently and effectively. He's also driving the SPIP for Docker
> official image in Apache Spark as well for users and developers.
> Please join me in welcoming Yikun!
>
>

Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-10-05 Thread Yuming Wang

Hi All,

Thank you all for testing and voting!

There's a -1 vote here, so I think this RC fails. I will prepare for
RC3 soon.

On Tue, Oct 4, 2022 at 6:34 AM Mridul Muralidharan  wrote:

> +1 from me, with a few comments.
>
> I saw the following failures, are these known issues/flakey tests ?
>
> * PersistenceEngineSuite.ZooKeeperPersistenceEngine
> Looks like a port conflict issue from a quick look into logs (conflict
> with starting admin port at 8080) - is this expected behavior for the test ?
> I worked around it by shutting down the process which was using the port -
> though did not investigate deeply.
>
> * org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite was aborted
> It is expecting these artifacts in $HOME/.m2/repository
>
> 1. tomcat#jasper-compiler;5.5.23!jasper-compiler.jar
> 2. tomcat#jasper-runtime;5.5.23!jasper-runtime.jar
> 3. commons-el#commons-el;1.0!commons-el.jar
> 4. org.apache.hive#hive-exec;2.3.7!hive-exec.jar
>
> I worked around it by adding them locally explicitly - we should probably
> add them as test dependency ?
> Not sure if this changed in this release though (I had cleaned my local
> .m2 recently)
>
> Other than this, rest looks good to me.
>
> Regards,
> Mridul
>
>
> On Wed, Sep 28, 2022 at 2:56 PM Sean Owen  wrote:
>
>> +1 from me, same result as last RC.
>>
>> On Wed, Sep 28, 2022 at 12:21 AM Yuming Wang  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version 
>>> 3.3.1.
>>>
>>> The vote is open until 11:59pm Pacific time October 3th and passes if a 
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.3.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org
>>>
>>> The tag to be voted on is v3.3.1-rc2 (commit 
>>> 1d3b8f7cb15283a1e37ecada6d751e17f30647ce):
>>> https://github.com/apache/spark/tree/v3.3.1-rc2
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-bin
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1421
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-docs
>>>
>>> The list of bug fixes going into 3.3.1 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12351710
>>>
>>> This release is using the release script of the tag v3.3.1-rc2.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.3.1?
>>> ===
>>> The current list of open tickets targeted at 3.3.1 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>>> Version/s" = 3.3.1
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>
>>>
>>>

[VOTE] Release Spark 3.3.1 (RC2)

2022-09-27 Thread Yuming Wang

Please vote on releasing the following candidate as Apache Spark version 3.3.1.

The vote is open until 11:59pm Pacific time October 3th and passes if
a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org

The tag to be voted on is v3.3.1-rc2 (commit
1d3b8f7cb15283a1e37ecada6d751e17f30647ce):
https://github.com/apache/spark/tree/v3.3.1-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-bin

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1421

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-docs

The list of bug fixes going into 3.3.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12351710

This release is using the release script of the tag v3.3.1-rc2.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.3.1?
===
The current list of open tickets targeted at 3.3.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.3.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Re: [VOTE] Release Spark 3.3.1 (RC1)

2022-09-23 Thread Yuming Wang

Hi All,

The voting for Spark 3.3.1 RC1 has failed and I will prepare RC2 soon.


On Mon, Sep 19, 2022 at 8:53 AM Dongjoon Hyun 
wrote:

> I also agree with Chao on that issue.
>
> SPARK-39833 landed at 3.3.1 and 3.2.3 to avoid a correctness issue at the
> cost of perf regression.
> Luckily, SPARK-40169 provided a correct fix and removed the main
> workaround code of SPARK-39833 before the official release.
>
> -1 for Apache Spark 3.3.1 RC1.
>
> Dongjoon.
>
>
> On Sun, Sep 18, 2022 at 10:08 AM Chao Sun  wrote:
>
>> It'd be really nice if we can include
>> https://issues.apache.org/jira/browse/SPARK-40169 in this release,
>> since otherwise it'll introduce a perf regression with Parquet column
>> index disabled.
>>
>> On Sat, Sep 17, 2022 at 2:08 PM Sean Owen  wrote:
>> >
>> > +1 LGTM. I tested Scala 2.13 + Java 11 on Ubuntu 22.04. I get the same
>> results as usual.
>> >
>> > On Sat, Sep 17, 2022 at 2:42 AM Yuming Wang  wrote:
>> >>
>> >> Please vote on releasing the following candidate as Apache Spark
>> version 3.3.1.
>> >>
>> >> The vote is open until 11:59pm Pacific time September 22th and passes
>> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 3.3.1
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see https://spark.apache.org
>> >>
>> >> The tag to be voted on is v3.3.1-rc1 (commit
>> ea1a426a889626f1ee1933e3befaa975a2f0a072):
>> >> https://github.com/apache/spark/tree/v3.3.1-rc1
>> >>
>> >> The release files, including signatures, digests, etc. can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc1-bin
>> >>
>> >> Signatures used for Spark RCs can be found in this file:
>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >>
>> >> The staging repository for this release can be found at:
>> >> https://repository.apache.org/content/repositories/orgapachespark-1418
>> >>
>> >> The documentation corresponding to this release can be found at:
>> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc1-docs
>> >>
>> >> The list of bug fixes going into 3.3.1 can be found at the following
>> URL:
>> >> https://issues.apache.org/jira/projects/SPARK/versions/12351710
>> >>
>> >> This release is using the release script of the tag v3.3.1-rc1.
>> >>
>> >>
>> >> FAQ
>> >>
>> >> =
>> >> How can I help test this release?
>> >> =
>> >> If you are a Spark user, you can help us test this release by taking
>> >> an existing Spark workload and running on this release candidate, then
>> >> reporting any regressions.
>> >>
>> >> If you're working in PySpark you can set up a virtual env and install
>> >> the current RC and see if anything important breaks, in the Java/Scala
>> >> you can add the staging repository to your projects resolvers and test
>> >> with the RC (make sure to clean up the artifact cache before/after so
>> >> you don't end up building with a out of date RC going forward).
>> >>
>> >> ===
>> >> What should happen to JIRA tickets still targeting 3.3.1?
>> >> ===
>> >> The current list of open tickets targeted at 3.3.1 can be found at:
>> >> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.3.1
>> >>
>> >> Committers should look at those and triage. Extremely important bug
>> >> fixes, documentation, and API tweaks that impact compatibility should
>> >> be worked on immediately. Everything else please retarget to an
>> >> appropriate release.
>> >>
>> >> ==
>> >> But my bug isn't fixed?
>> >> ==
>> >> In order to make timely releases, we will typically not hold the
>> >> release unless the bug in question is a regression from the previous
>> >> release. That being said, if there is something which is a regression
>> >> that has not been correctly targeted please ping me or a committer to
>> >> help target the issue.
>> >>
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-18 Thread Yuming Wang

+1.

On Mon, Sep 19, 2022 at 9:44 AM Kent Yao  wrote:

> +1
>
> Gengliang Wang  于2022年9月19日周一 09:23写道：
> >
> > +1, thanks for the work!
> >
> > On Sun, Sep 18, 2022 at 6:20 PM Hyukjin Kwon 
> wrote:
> >>
> >> +1
> >>
> >> On Mon, 19 Sept 2022 at 09:15, Yikun Jiang  wrote:
> >>>
> >>> Hi, all
> >>>
> >>>
> >>> I would like to start the discussion for supporting Docker Official
> Image for Spark.
> >>>
> >>>
> >>> This SPIP is proposed to add Docker Official Image(DOI) to ensure the
> Spark Docker images meet the quality standards for Docker images, to
> provide these Docker images for users who want to use Apache Spark via
> Docker image.
> >>>
> >>>
> >>> There are also several Apache projects that release the Docker
> Official Images, such as: flink, storm, solr, zookeeper, httpd (with 50M+
> to 1B+ download for each). From the huge download statistics, we can see
> the real demands of users, and from the support of other apache projects,
> we should also be able to do it.
> >>>
> >>>
> >>> After support:
> >>>
> >>> The Dockerfile will still be maintained by the Apache Spark community
> and reviewed by Docker.
> >>>
> >>> The images will be maintained by the Docker community to ensure the
> quality standards for Docker images of the Docker community.
> >>>
> >>>
> >>> It will also reduce the extra docker images maintenance effort (such
> as frequently rebuilding, image security update) of the Apache Spark
> community.
> >>>
> >>>
> >>> See more in SPIP DOC:
> https://docs.google.com/document/d/1nN-pKuvt-amUcrkTvYAQ-bJBgtsWb9nAkNoVNRM2S2o
> >>>
> >>>
> >>> cc: Ruifeng (co-author) and Hyukjin (shepherd)
> >>>
> >>>
> >>> Regards,
> >>> Yikun
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[VOTE] Release Spark 3.3.1 (RC1)

2022-09-17 Thread Yuming Wang

Please vote on releasing the following candidate as Apache Spark version
3.3.1.

The vote is open until 11:59pm Pacific time September 22th and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org

The tag to be voted on is v3.3.1-rc1 (commit
ea1a426a889626f1ee1933e3befaa975a2f0a072):
https://github.com/apache/spark/tree/v3.3.1-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc1-bin

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1418

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc1-docs

The list of bug fixes going into 3.3.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12351710

This release is using the release script of the tag v3.3.1-rc1.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.3.1?
===
The current list of open tickets targeted at 3.3.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.3.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Re: Time for Spark 3.3.1 release?

2022-09-13 Thread Yuming Wang

Thank you all.

I will be preparing 3.3.1 RC1 soon.

On Tue, Sep 13, 2022 at 12:09 PM John Zhuge  wrote:

> +1
>
> On Mon, Sep 12, 2022 at 9:08 PM Yang,Jie(INF)  wrote:
>
>> +1
>>
>>
>>
>> Thanks Yuming ~
>>
>>
>>
>> *发件人**: *Hyukjin Kwon 
>> *日期**: *2022年9月13日 星期二 08:19
>> *收件人**: *Gengliang Wang 
>> *抄送**: *"L. C. Hsieh" , Dongjoon Hyun <
>> dongjoon.h...@gmail.com>, Yuming Wang , dev <
>> dev@spark.apache.org>
>> *主题**: *Re: Time for Spark 3.3.1 release?
>>
>>
>>
>> +1
>>
>>
>>
>> On Tue, 13 Sept 2022 at 06:45, Gengliang Wang  wrote:
>>
>> +1.
>>
>> Thank you, Yuming!
>>
>>
>>
>> On Mon, Sep 12, 2022 at 12:10 PM L. C. Hsieh  wrote:
>>
>> +1
>>
>> Thanks Yuming!
>>
>> On Mon, Sep 12, 2022 at 11:50 AM Dongjoon Hyun 
>> wrote:
>> >
>> > +1
>> >
>> > Thanks,
>> > Dongjoon.
>> >
>> > On Mon, Sep 12, 2022 at 6:38 AM Yuming Wang  wrote:
>> >>
>> >> Hi, All.
>> >>
>> >>
>> >>
>> >> Since Apache Spark 3.3.0 tag creation (Jun 10), new 138 patches
>> including 7 correctness patches arrived at branch-3.3.
>> >>
>> >>
>> >>
>> >> Shall we make a new release, Apache Spark 3.3.1, as the second release
>> at branch-3.3? I'd like to volunteer as the release manager for Apache
>> Spark 3.3.1.
>> >>
>> >>
>> >>
>> >> All changes:
>> >>
>> >> https://github.com/apache/spark/compare/v3.3.0...branch-3.3
>> <https://mailshield.baidu.com/check?q=WzRnV6InLAPdBDRyJZecGtPwF02t%2bnFNwOI8oAyGcb60kX%2bRCS6N3SUnFjTdf47bb94KmZHTTKE%2bBHbIT27Rog%3d%3d>
>> >>
>> >>
>> >>
>> >> Correctness issues:
>> >>
>> >> SPARK-40149: Propagate metadata columns through Project
>> >>
>> >> SPARK-40002: Don't push down limit through window using ntile
>> >>
>> >> SPARK-39976: ArrayIntersect should handle null in left expression
>> correctly
>> >>
>> >> SPARK-39833: Disable Parquet column index in DSv1 to fix a correctness
>> issue in the case of overlapping partition and data columns
>> >>
>> >> SPARK-39061: Set nullable correctly for Inline output attributes
>> >>
>> >> SPARK-39887: RemoveRedundantAliases should keep aliases that make the
>> output of projection nodes unique
>> >>
>> >> SPARK-38614: Don't push down limit through window that's using
>> percent_rank
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
> John Zhuge
>

Time for Spark 3.3.1 release?

2022-09-12 Thread Yuming Wang

Hi, All.



Since Apache Spark 3.3.0 tag creation (Jun 10), new 138 patches including 7
correctness patches arrived at branch-3.3.



Shall we make a new release, Apache Spark 3.3.1, as the second release at
branch-3.3? I'd like to volunteer as the release manager for Apache Spark 3.
3.1.



All changes:

https://github.com/apache/spark/compare/v3.3.0...branch-3.3



Correctness issues:

SPARK-40149: Propagate metadata columns through Project

SPARK-40002: Don't push down limit through window using ntile

SPARK-39976: ArrayIntersect should handle null in left expression correctly

SPARK-39833: Disable Parquet column index in DSv1 to fix a correctness
issue in the case of overlapping partition and data columns

SPARK-39061: Set nullable correctly for Inline output attributes

SPARK-39887: RemoveRedundantAliases should keep aliases that make the
output of projection nodes unique

SPARK-38614: Don't push down limit through window that's using percent_rank

Re: Welcoming three new PMC members

2022-08-10 Thread Yuming Wang

Congratulations!

On Wed, Aug 10, 2022 at 4:35 PM Yikun Jiang  wrote:

> Congratulations!
>
> Regards,
> Yikun
>
>
> On Wed, Aug 10, 2022 at 3:19 PM Maciej  wrote:
>
>> Congratulations!
>>
>> On 8/10/22 08:14, Yi Wu wrote:
>> > Congrats everyone!
>> >
>> >
>> >
>> > On Wed, Aug 10, 2022 at 11:33 AM Yuanjian Li > > > wrote:
>> >
>> > Congrats everyone!
>> >
>> > L. C. Hsieh mailto:vii...@gmail.com>>于2022年8月9
>> > 日 周二19:01写道：
>> >
>> > Congrats!
>> >
>> > On Tue, Aug 9, 2022 at 5:38 PM Chao Sun > > > wrote:
>> >  >
>> >  > Congrats everyone!
>> >  >
>> >  > On Tue, Aug 9, 2022 at 5:36 PM Dongjoon Hyun
>> > mailto:dongjoon.h...@gmail.com>>
>> wrote:
>> >  > >
>> >  > > Congrat to all!
>> >  > >
>> >  > > Dongjoon.
>> >  > >
>> >  > > On Tue, Aug 9, 2022 at 5:13 PM Takuya UESHIN
>> > mailto:ues...@happy-camper.st>> wrote:
>> >  > > >
>> >  > > > Congratulations!
>> >  > > >
>> >  > > > On Tue, Aug 9, 2022 at 4:57 PM Hyukjin Kwon
>> > mailto:gurwls...@gmail.com>> wrote:
>> >  > > >>
>> >  > > >> Congrats everybody!
>> >  > > >>
>> >  > > >> On Wed, 10 Aug 2022 at 05:50, Mridul Muralidharan
>> > mailto:mri...@gmail.com>> wrote:
>> >  > > >>>
>> >  > > >>>
>> >  > > >>> Congratulations !
>> >  > > >>> Great to have you join the PMC !!
>> >  > > >>>
>> >  > > >>> Regards,
>> >  > > >>> Mridul
>> >  > > >>>
>> >  > > >>> On Tue, Aug 9, 2022 at 11:57 AM vaquar khan
>> > mailto:vaquar.k...@gmail.com>> wrote:
>> >  > > 
>> >  > >  Congratulations
>> >  > > 
>> >  > >  On Tue, Aug 9, 2022, 11:40 AM Xiao Li
>> > mailto:gatorsm...@gmail.com>> wrote:
>> >  > > >
>> >  > > > Hi all,
>> >  > > >
>> >  > > > The Spark PMC recently voted to add three new PMC
>> > members. Join me in welcoming them to their new roles!
>> >  > > >
>> >  > > > New PMC members: Huaxin Gao, Gengliang Wang and Maxim
>> > Gekk
>> >  > > >
>> >  > > > The Spark PMC
>> >  > > >
>> >  > > >
>> >  > > >
>> >  > > > --
>> >  > > > Takuya UESHIN
>> >  > > >
>> >  > >
>> >  > >
>> >
>>  -
>> >  > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > 
>> >  > >
>> >  >
>> >  >
>> >
>>  -
>> >  > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > 
>> >  >
>> >
>> >
>>  -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > 
>> >
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>>
>>

Re: Apache Spark 3.2.2 Release?

2022-07-06 Thread Yuming Wang

+1

On Thu, Jul 7, 2022 at 5:53 AM Maxim Gekk 
wrote:

> +1
>
> On Thu, Jul 7, 2022 at 12:26 AM John Zhuge  wrote:
>
>> +1  Thanks for the effort!
>>
>> On Wed, Jul 6, 2022 at 2:23 PM Bjørn Jørgensen 
>> wrote:
>>
>>> +1
>>>
>>> ons. 6. jul. 2022, 23:05 skrev Hyukjin Kwon :
>>>
 Yeah +1

 On Thu, Jul 7, 2022 at 5:40 AM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> Since Apache Spark 3.2.1 tag creation (Jan 19), new 197 patches
> including 11 correctness patches arrived at branch-3.2.
>
> Shall we make a new release, Apache Spark 3.2.2, as the third release
> at 3.2 line? I'd like to volunteer as the release manager for Apache
> Spark 3.2.2. I'm thinking about starting the first RC next week.
>
> $ git log --oneline v3.2.1..HEAD | wc -l
>  197
>
> # Correctness issues
>
> SPARK-38075 Hive script transform with order by and limit will
> return fake rows
> SPARK-38204 All state operators are at a risk of inconsistency
> between state partitioning and operator partitioning
> SPARK-38309 SHS has incorrect percentiles for shuffle read bytes
> and shuffle total blocks metrics
> SPARK-38320 (flat)MapGroupsWithState can timeout groups which just
> received inputs in the same microbatch
> SPARK-38614 After Spark update, df.show() shows incorrect
> F.percent_rank results
> SPARK-38655 OffsetWindowFunctionFrameBase cannot find the offset
> row whose input is not null
> SPARK-38684 Stream-stream outer join has a possible correctness
> issue due to weakly read consistent on outer iterators
> SPARK-39061 Incorrect results or NPE when using Inline function
> against an array of dynamically created structs
> SPARK-39107 Silent change in regexp_replace's handling of empty
> strings
> SPARK-39259 Timestamps returned by now() and equivalent functions
> are not consistent in subqueries
> SPARK-39293 The accumulator of ArrayAggregate should copy the
> intermediate result if string, struct, array, or map
>
> Best,
> Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
>> John Zhuge
>>
>

Re: [PSA] Please rebase and sync your master branch in your forked repository

2022-06-20 Thread Yuming Wang

Thank you Hyukjin.

On Tue, Jun 21, 2022 at 7:46 AM Hyukjin Kwon  wrote:

> After https://github.com/apache/spark/pull/36922 gets merged, it requires
> your fork's master branch to be synced to the latest master branch in
> Apache Spark. Otherwise, builds would not be triggered in your PR.
>
>

Re: [ANNOUNCE] Apache Spark 3.3.0 released

2022-06-17 Thread Yuming Wang

Congrats and thanks.

On Fri, Jun 17, 2022 at 7:46 PM Maxim Gekk
 wrote:

> We are happy to announce the availability of Spark 3.3.0!
>
> Apache Spark 3.3.0 is the fourth release of the 3.x line. With tremendous
> contribution from the open-source community, this release managed to
> resolve in excess of 1,600 Jira tickets.
>
> To download Spark 3.3.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> Note that you might need to clear your browser cache or to use
> `Private`/`Incognito` mode according to your browsers.
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-3-0.html
>
> We would like to acknowledge all community members for contributing to
> this release. This release would not have been possible without you.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: [VOTE][SPIP] Spark Connect

2022-06-13 Thread Yuming Wang

+1.

On Tue, Jun 14, 2022 at 2:20 AM Matei Zaharia 
wrote:

> +1, very excited about this direction.
>
> Matei
>
> On Jun 13, 2022, at 11:07 AM, Herman van Hovell <
> her...@databricks.com.INVALID> wrote:
>
> Let me kick off the voting...
>
> +1
>
> On Mon, Jun 13, 2022 at 2:02 PM Herman van Hovell 
> wrote:
>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: "Spark Connect"
>>
>> The goal of the SPIP is to introduce a Dataframe based client/server API
>> for Spark
>>
>> Please also refer to:
>>
>> - Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark Connect
>> - A client and server interface for Apache Spark.
>> 
>> - Design doc: Spark Connect - A client and server interface for Apache
>> Spark.
>> 
>> - JIRA: SPARK-39375 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Kind Regards,
>> Herman
>>
>
>

Re: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Yuming Wang

+1 (non-binding)

On Tue, Jun 14, 2022 at 7:41 AM Dongjoon Hyun 
wrote:

> +1
>
> Thanks,
> Dongjoon.
>
> On Mon, Jun 13, 2022 at 3:54 PM Chris Nauroth  wrote:
>
>> +1 (non-binding)
>>
>> I repeated all checks I described for RC5:
>>
>> https://lists.apache.org/thread/ksoxmozgz7q728mnxl6c2z7ncmo87vls
>>
>> Maxim, thank you for your dedication on these release candidates.
>>
>> Chris Nauroth
>>
>>
>> On Mon, Jun 13, 2022 at 3:21 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes
>>>
>>> The test "SPARK-33084: Add jar support Ivy URI in SQL" in
>>> sql.SQLQuerySuite fails; but other than that, rest looks good.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>>
>>> On Mon, Jun 13, 2022 at 4:25 PM Tom Graves 
>>> wrote:
>>>
 +1

 Tom

 On Thursday, June 9, 2022, 11:27:50 PM CDT, Maxim Gekk
  wrote:

 Please vote on releasing the following candidate as
 Apache Spark version 3.3.0.

 The vote is open until 11:59pm Pacific time June 14th and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.3.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v3.3.0-rc6 (commit
 f74867bddfbcdd4d08076db36851e88b15e66556):
 https://github.com/apache/spark/tree/v3.3.0-rc6

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1407

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-docs/

 The list of bug fixes going into 3.3.0 can be found at the following
 URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12350369

 This release is using the release script of the tag v3.3.0-rc6.

 FAQ

 =
 How can I help test this release?
 =
 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala
 you can add the staging repository to your projects resolvers and test
 with the RC (make sure to clean up the artifact cache before/after so
 you don't end up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 3.3.0?
 ===
 The current list of open tickets targeted at 3.3.0 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.3.0

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should
 be worked on immediately. Everything else please retarget to an
 appropriate release.

 ==
 But my bug isn't fixed?
 ==
 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from the previous
 release. That being said, if there is something which is a regression
 that has not been correctly targeted please ping me or a committer to
 help target the issue.

 Maxim Gekk

 Software Engineer

 Databricks, Inc.

>>>

Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-06 Thread Yuming Wang

+1. Ran and verified results through internal tests.

On Mon, Jun 6, 2022 at 3:43 PM Dongjoon Hyun 
wrote:

> +1.
>
> I double-checked the following additionally.
>
> - Run unit tests on Apple Silicon with Java 17/Python 3.9.11/R 4.1.2
> - Run unit tests on Linux with Java11/Scala 2.12/2.13
> - K8s integration test (including Volcano batch scheduler) on K8s v1.24
> - Check S3 read/write with spark-shell with Scala 2.13/Java17.
>
> So far, it looks good except one flaky test from the new `Row-level
> Runtime Filters` feature. Actually, this has been flaky in the previous RCs
> too.
>
> Since `Row-level Runtime Filters` feature is still disabled by default in
> Apache Spark 3.3.0, I filed it as a non-blocker flaky test bug.
>
> https://issues.apache.org/jira/browse/SPARK-39386
>
> If there is no other report on this test case, this could be my local
> environmental issue.
>
> I'm going to test RC5 more until the deadline (June 8th PST).
>
> Thanks,
> Dongjoon.
>
>
> On Sat, Jun 4, 2022 at 1:33 PM Sean Owen  wrote:
>
>> +1 looks good now on Scala 2.13
>>
>> On Sat, Jun 4, 2022 at 9:51 AM Maxim Gekk
>>  wrote:
>>
>>> Please vote on releasing the following candidate as
>>> Apache Spark version 3.3.0.
>>>
>>> The vote is open until 11:59pm Pacific time June 8th and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.3.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.3.0-rc5 (commit
>>> 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
>>> https://github.com/apache/spark/tree/v3.3.0-rc5
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1406
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
>>>
>>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>
>>> This release is using the release script of the tag v3.3.0-rc5.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.3.0?
>>> ===
>>> The current list of open tickets targeted at 3.3.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.3.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>> Maxim Gekk
>>>
>>> Software Engineer
>>>
>>> Databricks, Inc.
>>>
>>

Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-18 Thread Yuming Wang

-1. There is a regression: https://github.com/apache/spark/pull/36595

On Wed, May 18, 2022 at 4:11 PM Martin Grigorov 
wrote:

> Hi,
>
> [X] +1 Release this package as Apache Spark 3.3.0
>
> Tested:
> - make local distribution from sources (with ./dev/make-distribution.sh
> --tgz --name with-volcano -Pkubernetes,volcano,hadoop-3)
> - create a Docker image (with JDK 11)
> - run Pi example on
> -- local
> -- Kubernetes with default scheduler
> -- Kubernetes with Volcano scheduler
>
> On both x86_64 and aarch64 !
>
> Regards,
> Martin
>
>
> On Mon, May 16, 2022 at 3:44 PM Maxim Gekk
>  wrote:
>
>> Please vote on releasing the following candidate as
>> Apache Spark version 3.3.0.
>>
>> The vote is open until 11:59pm Pacific time May 19th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.3.0-rc2 (commit
>> c8c657b922ac8fd8dcf9553113e11a80079db059):
>> https://github.com/apache/spark/tree/v3.3.0-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1403
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>>
>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>
>> This release is using the release script of the tag v3.3.0-rc2.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.0?
>> ===
>> The current list of open tickets targeted at 3.3.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.3.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>

Re: [VOTE] Spark 3.1.3 RC4

2022-02-14 Thread Yuming Wang

+1 (non-binding).

On Tue, Feb 15, 2022 at 10:22 AM Ruifeng Zheng  wrote:

> +1 (non-binding)
>
> checked the release script issue Dongjoon mentioned:
>
> curl -s
> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-bin/spark-3.1.3-bin-hadoop2.7.tgz
> | tar tz | grep hadoop-common
> spark-3.1.3-bin-hadoop2.7/jars/hadoop-common-2.7.4
>
>
> -- 原始邮件 --
> *发件人:* "Sean Owen" ;
> *发送时间:* 2022年2月15日(星期二) 上午10:01
> *收件人:* "Holden Karau";
> *抄送:* "dev";
> *主题:* Re: [VOTE] Spark 3.1.3 RC4
>
> Looks good to me, same results as last RC, +1
>
> On Mon, Feb 14, 2022 at 2:55 PM Holden Karau  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.1.3.
>>
>> The vote is open until Feb. 18th at 1 PM pacific (9 PM GMT) and passes if
>> a majority
>> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.1.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> There are currently no open issues targeting 3.1.3 in Spark's JIRA
>> https://issues.apache.org/jira/browse
>> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in
>> (Open, Reopened, "In Progress"))
>> at https://s.apache.org/n79dw
>>
>>
>>
>> The tag to be voted on is v3.1.3-rc4 (commit
>> d1f8a503a26bcfb4e466d9accc5fa241a7933667):
>> https://github.com/apache/spark/tree/v3.1.3-rc4
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at
>> https://repository.apache.org/content/repositories/orgapachespark-1401
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-docs/
>>
>> The list of bug fixes going into 3.1.3 can be found at the following URL:
>> https://s.apache.org/x0q9b
>>
>> This release is using the release script from 3.1.3
>> The release docker container was rebuilt since the previous version
>> didn't have the necessary components to build the R documentation.
>>
>> FAQ
>>
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.1.3?
>> ===
>>
>> The current list of open tickets targeted at 3.1.3 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.1.3
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something that is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Note: I added an extra day to the vote since I know some folks are likely
>> busy on the 14th with partner(s).
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread Yuming Wang

Thank you Huaxin.

On Sat, Jan 29, 2022 at 9:08 AM huaxin gao  wrote:

> We are happy to announce the availability of Spark 3.2.1!
>
> Spark 3.2.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.2 maintenance branch of Spark. We strongly
> recommend all 3.2 users to upgrade to this stable release.
>
> To download Spark 3.2.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Huaxin Gao
>

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-25 Thread Yuming Wang

+1 (non-binding)

On Tue, Jan 25, 2022 at 12:44 PM Wenchen Fan  wrote:

> +1
>
> On Tue, Jan 25, 2022 at 10:13 AM Ruifeng Zheng 
> wrote:
>
>> +1 (non-binding)
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Kent Yao" ;
>> *发送时间:* 2022年1月25日(星期二) 上午10:09
>> *收件人:* "John Zhuge";
>> *抄送:* "dev";
>> *主题:* Re: [VOTE] Release Spark 3.2.1 (RC2)
>>
>> +1， non-binding
>>
>> John Zhuge  于2022年1月25日周二 06:56写道：
>>
>>> +1 (non-binding)
>>>
>>> On Mon, Jan 24, 2022 at 2:28 PM Cheng Su  wrote:
>>>
 +1 (non-binding)

 Cheng Su

 *From: *Chao Sun 
 *Date: *Monday, January 24, 2022 at 2:10 PM
 *To: *Michael Heuer 
 *Cc: *dev 
 *Subject: *Re: [VOTE] Release Spark 3.2.1 (RC2)

 +1 (non-binding)

 On Mon, Jan 24, 2022 at 6:32 AM Michael Heuer 
 wrote:

 +1 (non-binding)

michael

 On Jan 24, 2022, at 7:30 AM, Gengliang Wang  wrote:

 +1 (non-binding)

 On Mon, Jan 24, 2022 at 6:26 PM Dongjoon Hyun 
 wrote:

 +1

 Dongjoon.

 On Sat, Jan 22, 2022 at 7:19 AM Mridul Muralidharan 
 wrote:

 +1

 Signatures, digests, etc check out fine.

 Checked out tag and build/tested with -Pyarn -Pmesos -Pkubernetes

 Regards,

 Mridul

 On Fri, Jan 21, 2022 at 9:01 PM Sean Owen  wrote:

 +1 with same result as last time.

 On Thu, Jan 20, 2022 at 9:59 PM huaxin gao 
 wrote:

 Please vote on releasing the following candidate as Apache Spark
 version 3.2.1. The vote is open until 8:00pm Pacific time January 25 and
 passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [
 ] +1 Release this package as Apache Spark 3.2.1 [ ] -1 Do not release
 this package because ... To learn more about Apache Spark, please see
 http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit
 4f25b3f71238a00508a356591553f2dfa89f8290):
 https://github.com/apache/spark/tree/v3.2.1-rc2  The release files,
 including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/  Signatures
 used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS The staging
 repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1398/
   The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/  The
 list of bug fixes going into 3.2.1 can be found at the following URL:
 https://s.apache.org/yu0cy   This release is using the release script
 of the tag v3.2.1-rc2. FAQ = How can I help
 test this release? = If you are a Spark user, you
 can help us test this release by taking an existing Spark workload and
 running on this release candidate, then reporting any regressions. If
 you're working in PySpark you can set up a virtual env and install the
 current RC and see if anything important breaks, in the Java/Scala you can
 add the staging repository to your projects resolvers and test with the RC
 (make sure to clean up the artifact cache before/after so you don't end up
 building with a out of date RC going forward).
 === What should happen to JIRA
 tickets still targeting 3.2.1? ===
 The current list of open tickets targeted at 3.2.1 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.2.1 Committers should look at those and triage. Extremely
 important bug fixes, documentation, and API tweaks that impact
 compatibility should be worked on immediately. Everything else please
 retarget to an appropriate release. == But my bug isn't
 fixed? == In order to make timely releases, we will
 typically not hold the release unless the bug in question is a regression
 from the previous release. That being said, if there is something which is
 a regression that has not been correctly targeted please ping me or a
 committer to help target the issue.

>>>
>>> --
>>> John Zhuge
>>>
>>

Re: [FYI] Build and run tests on Java 17 for Apache Spark 3.3

2021-11-12 Thread Yuming Wang

Cool, thank you Dongjoon.

On Sat, Nov 13, 2021 at 4:09 AM shane knapp ☠  wrote:

> woot!  nice work everyone!  :)
>
> On Fri, Nov 12, 2021 at 11:37 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Apache Spark community has been working on Java 17 support under the
>> following JIRA.
>>
>> https://issues.apache.org/jira/browse/SPARK-33772
>>
>> As of today, Apache Spark starts to have daily Java 17 test coverage via
>> GitHub Action jobs for Apache Spark 3.3.
>>
>>
>> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L38-L39
>>
>> Today's successful run is here.
>>
>> https://github.com/apache/spark/actions/runs/1453788012
>>
>> Please note that we are still working on some new Java 17 features like
>>
>> JEP 391: macOS/AArch64 Port
>> https://bugs.openjdk.java.net/browse/JDK-8251280
>>
>> For example, Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 already
>> support Apple Silicon natively, but some 3rd party libraries like
>> RocksDB/LevelDB are not ready yet. Since Mac is one of the popular dev
>> environments, we are going to keep monitoring and improving gradually for
>> Apache Spark 3.3.
>>
>> Please test Java 17 and let us know your feedback.
>>
>> Thanks,
>> Dongjoon.
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Yuming Wang

Congrats and thanks!

On Tue, Oct 19, 2021 at 10:17 PM Gengliang Wang  wrote:

> Hi all,
>
> Apache Spark 3.2.0 is the third release of the 3.x line. With tremendous
> contribution from the open-source community, this release managed to
> resolve in excess of 1,700 Jira tickets.
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 3.2.0, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-0.html
>

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-08 Thread Yuming Wang

+1 (non-binding).

On Fri, Oct 8, 2021 at 1:02 PM Dongjoon Hyun 
wrote:

> +1 for Apache Spark 3.2.0 RC7.
>
> It looks good to me. I tested with EKS 1.21 additionally.
>
> Cheers,
> Dongjoon.
>
>
> On Thu, Oct 7, 2021 at 7:46 PM 郑瑞峰  wrote:
>
>> +1 (non-binding)
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Sean Owen" ;
>> *发送时间:* 2021年10月7日(星期四) 晚上10:23
>> *收件人:* "Gengliang Wang";
>> *抄送:* "dev";
>> *主题:* Re: [VOTE] Release Spark 3.2.0 (RC7)
>>
>> +1 again. Looks good in Scala 2.12, 2.13, and in Java 11.
>> I note that the mem requirements for Java 11 tests seem to need to be
>> increased but we're handling that separately. It doesn't really affect
>> users.
>>
>> On Wed, Oct 6, 2021 at 11:49 AM Gengliang Wang  wrote:
>>
>>> Please vote on releasing the following candidate as
>>> Apache Spark version 3.2.0.
>>>
>>> The vote is open until 11:59pm Pacific time October 11 and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.2.0-rc7 (commit
>>> 5d45a415f3a29898d92380380cfd82bfc7f579ea):
>>> https://github.com/apache/spark/tree/v3.2.0-rc7
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc7-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1394
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc7-docs/
>>>
>>> The list of bug fixes going into 3.2.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>>>
>>> This release is using the release script of the tag v3.2.0-rc7.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.2.0?
>>> ===
>>> The current list of open tickets targeted at 3.2.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.2.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>

Re: [VOTE] Release Spark 3.2.0 (RC6)

2021-09-28 Thread Yuming Wang

+1 (non-bindng).

On Wed, Sep 29, 2021 at 12:03 AM Michael Heuer  wrote:

> +1 (non-bindng)
>
> Works for us, as with previous RCs.
>
>michael
>
>
> On Sep 28, 2021, at 10:45 AM, Gengliang Wang  wrote:
>
> Starting with my +1(non-binding)
>
> Thanks,
> Gengliang
>
> On Tue, Sep 28, 2021 at 11:45 PM Gengliang Wang  wrote:
>
>> Please vote on releasing the following candidate as
>> Apache Spark version 3.2.0.
>>
>> The vote is open until 11:59pm Pacific time September 30 and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.2.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.2.0-rc6 (commit
>> dde73e2e1c7e55c8e740cb159872e081ddfa7ed6):
>> https://github.com/apache/spark/tree/v3.2.0-rc6
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc6-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1393
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc6-docs/
>>
>> The list of bug fixes going into 3.2.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>>
>> This release is using the release script of the tag v3.2.0-rc6.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.2.0?
>> ===
>> The current list of open tickets targeted at 3.2.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.2.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>

Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-19 Thread Yuming Wang

+1

Tested a batch of production query with Thrift Server.

On Sat, Jun 19, 2021 at 3:04 PM Mridul Muralidharan 
wrote:

>
> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Pmesos
> -Pkubernetes
>
> Regards,
> Mridul
>
> PS: Might be related to some quirk of my local env - the first test run
> (after clean + package) usually fails for me (typically for hive tests) -
> with a second run succeeding : this is not specific to this RC though.
>
> On Fri, Jun 18, 2021 at 6:14 PM Liang-Chi Hsieh  wrote:
>
>> +1. Docs looks good. Binary looks good.
>>
>> Ran simple test and some tpcds queries.
>>
>> Thanks for working on this!
>>
>>
>> wuyi wrote
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 3.0.3.
>> >
>> > The vote is open until Jun 21th 3AM (PST) and passes if a majority +1
>> PMC
>> > votes are cast, with
>> > a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.0.3
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see https://spark.apache.org/
>> >
>> > The tag to be voted on is v3.0.3-rc1 (commit
>> > 65ac1e75dc468f53fc778cd2ce1ba3f21067aab8):
>> > https://github.com/apache/spark/tree/v3.0.3-rc1
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1386/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-docs/
>> >
>> > The list of bug fixes going into 3.0.3 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12349723
>> >
>> > This release is using the release script of the tag v3.0.3-rc1.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.0.3?
>> > ===
>> >
>> > The current list of open tickets targeted at 3.0.3 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> > Version/s" = 3.0.3
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [VOTE] Release Spark 3.1.2 (RC1)

2021-05-26 Thread Yuming Wang

+1 (non-binding)

On Wed, May 26, 2021 at 11:27 PM Maxim Gekk 
wrote:

> +1 (non-binding)
>
> On Mon, May 24, 2021 at 9:14 AM Dongjoon Hyun 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.1.2.
>>
>> The vote is open until May 27th 1AM (PST) and passes if a majority +1 PMC
>> votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.1.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v3.1.2-rc1 (commit
>> de351e30a90dd988b133b3d00fa6218bfcaba8b8):
>> https://github.com/apache/spark/tree/v3.1.2-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.2-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1384/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.2-rc1-docs/
>>
>> The list of bug fixes going into 3.1.2 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12349602
>>
>> This release is using the release script of the tag v3.1.2-rc1.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.1.2?
>> ===
>>
>> The current list of open tickets targeted at 3.1.2 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.1.2
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>

Re: [ANNOUNCE] Apache Spark 2.4.8 released

2021-05-18 Thread Yuming Wang

Great work, Liang-Chi!

On Tue, May 18, 2021 at 3:57 PM Jungtaek Lim 
wrote:

> Thanks for the huge efforts on driving the release!
>
> On Tue, May 18, 2021 at 4:53 PM Wenchen Fan  wrote:
>
>> Thank you, Liang-Chi!
>>
>> On Tue, May 18, 2021 at 1:32 PM Dongjoon Hyun 
>> wrote:
>>
>>> Finally! Thank you, Liang-Chi.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Mon, May 17, 2021 at 10:14 PM Takeshi Yamamuro 
>>> wrote:
>>>
 Thank you for the release work, Liang-Chi~

 On Tue, May 18, 2021 at 2:11 PM Hyukjin Kwon 
 wrote:

> Yay!
>
> 2021년 5월 18일 (화) 오후 12:57, Liang-Chi Hsieh 님이 작성:
>
>> We are happy to announce the availability of Spark 2.4.8!
>>
>> Spark 2.4.8 is a maintenance release containing stability,
>> correctness, and
>> security fixes.
>> This release is based on the branch-2.4 maintenance branch of Spark.
>> We
>> strongly recommend all 2.4 users to upgrade to this stable release.
>>
>> To download Spark 2.4.8, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> Note that you might need to clear your browser cache or to use
>> `Private`/`Incognito` mode according to your browsers.
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2-4-8.html
>>
>> We would like to acknowledge all community members for contributing
>> to this
>> release. This release would not have been possible without you.
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

 --
 ---
 Takeshi Yamamuro

>>>

Re: Apache Spark 3.1.2 Release?

2021-05-17 Thread Yuming Wang

+1.

On Tue, May 18, 2021 at 9:06 AM Hyukjin Kwon  wrote:

> +1 thanks for driving me
>
> On Tue, 18 May 2021, 09:33 Holden Karau,  wrote:
>
>> +1 and thanks for volunteering to be the RM :)
>>
>> On Mon, May 17, 2021 at 4:09 PM Takeshi Yamamuro 
>> wrote:
>>
>>> Thank you, Dongjoon~ sgtm, too.
>>>
>>> On Tue, May 18, 2021 at 7:34 AM Cheng Su  wrote:
>>>
 +1 for a new release, thanks Dongjoon!

 Cheng Su

 On 5/17/21, 2:44 PM, "Liang-Chi Hsieh"  wrote:

 +1 sounds good. Thanks Dongjoon for volunteering on this!


 Liang-Chi


 Dongjoon Hyun-2 wrote
 > Hi, All.
 >
 > Since Apache Spark 3.1.1 tag creation (Feb 21),
 > new 172 patches including 9 correctness patches and 4 K8s patches
 arrived
 > at branch-3.1.
 >
 > Shall we make a new release, Apache Spark 3.1.2, as the second
 release at
 > 3.1 line?
 > I'd like to volunteer for the release manager for Apache Spark
 3.1.2.
 > I'm thinking about starting the first RC next week.
 >
 > $ git log --oneline v3.1.1..HEAD | wc -l
 >  172
 >
 > # Known correctness issues
 > SPARK-34534 New protocol FetchShuffleBlocks in
 OneForOneBlockFetcher
 > lead to data loss or correctness
 > SPARK-34545 PySpark Python UDF return inconsistent results
 when
 > applying 2 UDFs with different return type to 2 columns together
 > SPARK-34681 Full outer shuffled hash join when building left
 side
 > produces wrong result
 > SPARK-34719 fail if the view query has duplicated column names
 > SPARK-34794 Nested higher-order functions broken in DSL
 > SPARK-34829 transform_values return identical values when
 it's used
 > with udf that returns reference type
 > SPARK-34833 Apply right-padding correctly for correlated
 subqueries
 > SPARK-35381 Fix lambda variable name issues in nested
 DataFrame
 > functions in R APIs
 > SPARK-35382 Fix lambda variable name issues in nested
 DataFrame
 > functions in Python APIs
 >
 > # Notable K8s patches since K8s GA
 > SPARK-34674Close SparkContext after the Main method has
 finished
 > SPARK-34948Add ownerReference to executor configmap to fix
 leakages
 > SPARK-34820add apt-update before gnupg install
 > SPARK-34361In case of downscaling avoid killing of executors
 already
 > known by the scheduler backend in the pod allocator
 >
 > Bests,
 > Dongjoon.





 --
 Sent from:
 http://apache-spark-developers-list.1001551.n3.nabble.com/


 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Re: beeline spark thrift server issue

2021-05-13 Thread Yuming Wang

Unable to access log(https://pastebin.com/G5Mwaw7E).

On Thu, May 13, 2021 at 11:23 PM Suryansh Agnihotri <
sagnihotri2...@gmail.com> wrote:

> Hi
> I was trying to access spark sql through jdbc but facing some error. I am
> trying to run beeline
>
> ! /usr/lib/spark/bin/beeline -u
> 'jdbc:hive2://host:10016/default;transportMode=binary'  -e '' 2>&1| awk
> '{print}'|grep -i -e 'Connection refused' -e 'Invalid URL' -e 'Error: Could
> not open'
>
> Error: Could not open client transport with JDBC Uri:
> host:10016/default;transportMode=binary: java.net.ConnectException:
> Connection refused (Connection refused) (state=08S01,code=0)
>
> hive.server2.thrift.port=10016 and mode is binary.
> I verified the process is running on this port.
> I checked the spark thrift server logs https://pastebin.com/G5Mwaw7E
> It says "java.lang.RuntimeException: Unable to instantiate
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient"
> (pasted the logs in above link).
>
> *I am using spark version 3.1.0 , hive 3.1.2 and hadoop 3.1.2.*
> Following this guide
>
> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore
>
> I set "spark.sql.hive.metastore.version" to 3.1.2 and set jars to point
> to hivemetastore 3.1.2 jars but still getting the same error.
>
> From logs I also got suspicious about a mismatch in the version of
> datanucleus core in hive and spark but both use the same version.
> https://github.com/apache/hive/blob/branch-3.1/pom.xml#L129
> https://github.com/apache/spark/blob/branch-3.1/pom.xml#L184 (edited)
>
> Is this a known issue , how should this be fixed. Let me know if anything
> else is required.
> Thanks
>

Re: [DISCUSS] Add error IDs

2021-04-16 Thread Yuming Wang

+1 for this proposal.

On Fri, Apr 16, 2021 at 5:15 AM Karen  wrote:

> We could leave space in the numbering system, but a more flexible method
> may be to have the severity as a field associated with the error class -
> the same way we would associate error ID with SQLSTATE, or with whether an
> error is user-facing or internal. As you noted, I don't believe there is a
> standard framework for hints/warnings in Spark today. I propose that we
> leave out severity as a field until there is sufficient demand. We will
> leave room in the format for other fields.
>
> On Thu, Apr 15, 2021 at 3:18 AM Steve Loughran 
> wrote:
>
>>
>> Machine readable logs are always good, especially if you can read the
>> entire logs into an SQL query.
>>
>> It might be good to use some specific differentiation between
>> hint/warn/fatal error in the numbering so that any automated analysis of
>> the logs can identify the class of an error even if its an error not
>> actually recognised. See VMS docs for an example of this; that in Windows
>> is apparently based on their work
>> https://www.stsci.edu/ftp/documents/system-docs/vms-guide/html/VUG_19.html
>> . Even if things are only errors for now, leaving room in the format for
>> other levels is wise.
>>
>> The trend in cloud infras is always to have some string "NoSuchBucket"
>> which is (a) guaranteed to be maintained over time and (b) searchable in
>> google.
>>
>> (That said. AWS has every service not just making up their own values but
>> not even consistent responses for the same problem. S3 throttling: 503.
>> DynamoDB: 500 + one of two different messages. see
>> com.amazonaws.retry.RetryUtils for the details )
>>
>> On Wed, 14 Apr 2021 at 20:04, Karen  wrote:
>>
>>> Hi all,
>>>
>>> We would like to kick off a discussion on adding error IDs to Spark.
>>>
>>> Proposal:
>>>
>>> Add error IDs to provide a language-agnostic, locale-agnostic, specific,
>>> and succinct answer for which class the problem falls under. When partnered
>>> with a text-based error class (eg. 12345 TABLE_OR_VIEW_NOT_FOUND), error
>>> IDs can provide meaningful categorization. They are useful for all Spark
>>> personas: from users, to support engineers, to developers.
>>>
>>> Add SQLSTATEs. As discussed in #32013
>>> , SQLSTATEs
>>> 
>>> are portable error codes that are part of the ANSI/ISO SQL-99 standard
>>> , and
>>> especially useful for JDBC/ODBC users. They are not mutually exclusive with
>>> adding product-specific error IDs, which can be more specific; for example,
>>> MySQL uses an N-1 mapping from error IDs to SQLSTATEs:
>>> https://dev.mysql.com/doc/refman/8.0/en/error-message-elements.html.
>>>
>>> Uniquely link error IDs to error messages (1-1). This simplifies the
>>> auditing process and ensures that we uphold quality standards, as outlined
>>> in SPIP: Standardize Error Message in Spark (
>>> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit
>>> ).
>>>
>>> Requirements:
>>>
>>> Changes are backwards compatible; developers should still be able to
>>> throw exceptions in the existing style (eg. throw new
>>> AnalysisException(“Arbitrary error message.”)). Adding error IDs will be a
>>> gradual process, as there are thousands of exceptions thrown across the
>>> code base.
>>>
>>> Optional:
>>>
>>> Label errors as user-facing or internal. Internal errors should be
>>> logged, and end-users should be aware that they likely cannot fix the error
>>> themselves.
>>>
>>> End result:
>>>
>>> Before:
>>>
>>> AnalysisException: Cannot find column ‘fakeColumn’; line 1 pos 14;
>>>
>>> After:
>>>
>>> AnalysisException: SPK-12345 COLUMN_NOT_FOUND: Cannot find column
>>> ‘fakeColumn’; line 1 pos 14; (SQLSTATE 42704)
>>>
>>> Please let us know what you think about this proposal! We’d love to hear
>>> what you think.
>>>
>>> Best,
>>>
>>> Karen Feng
>>>
>>

Re: [DISCUSS] Build error message guideline

2021-04-14 Thread Yuming Wang

+1 LGTM.

On Thu, Apr 15, 2021 at 1:50 AM Karen  wrote:

> That makes sense to me - given that an assert failure throws an
> AssertException, I would say that the same guidelines should apply for
> asserts.
>
> On Tue, Apr 13, 2021 at 7:41 PM Yuming Wang  wrote:
>
>> Do we have plans to apply these guidelines to assert? For example:
>>
>>
>> https://github.com/apache/spark/blob/5b478416f8e3fe2f015af1b6c8faa7fe9f15c05d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L136-L138
>>
>> https://github.com/apache/spark/blob/053dd858d38e6107bc71e0aa3a4954291b74f8c8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourcePartitioning.scala#L41
>>
>> On Wed, Apr 14, 2021 at 9:27 AM Hyukjin Kwon  wrote:
>>
>>> I would just go ahead and create a PR for that. Nothing written there
>>> looks unreasonable.
>>> But probably it should be best to wait a couple of days to make sure
>>> people are happy with it.
>>>
>>> 2021년 4월 14일 (수) 오전 6:38, Karen 님이 작성:
>>>
>>>> If the proposed guidelines look good, it would be useful to share these
>>>> guidelines with the wider community. A good landing page for contributors
>>>> could be https://spark.apache.org/contributing.html. What do you think?
>>>>
>>>> Thank you,
>>>>
>>>> Karen Feng
>>>>
>>>> On Wed, Apr 7, 2021 at 8:19 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> LGTM (I took a look, and had some offline discussions w/ some
>>>>> corrections before it came out)
>>>>>
>>>>> 2021년 4월 8일 (목) 오전 5:28, Karen 님이 작성:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> As discussed in SPIP: Standardize Exception Messages in Spark (
>>>>>> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing),
>>>>>> improving error message quality in Apache Spark involves establishing an
>>>>>> error message guideline for developers. Error message style guidelines 
>>>>>> are
>>>>>> common practice across open-source projects, for example PostgreSQL (
>>>>>> https://www.postgresql.org/docs/current/error-style-guide.html).
>>>>>>
>>>>>> To move towards the goal of improving error message quality, we would
>>>>>> like to start building an error message guideline. We have attached a 
>>>>>> rough
>>>>>> draft to kick off this discussion:
>>>>>> https://docs.google.com/document/d/12k4zmaKmmdm6Pk63HS0N1zN1QT-6TihkWaa5CkLmsn8/edit?usp=sharing
>>>>>> .
>>>>>>
>>>>>> Please let us know what you think should be in the guideline! We look
>>>>>> forward to building this as a community.
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Karen Feng
>>>>>>
>>>>>

Re: [DISCUSS] Build error message guideline

2021-04-13 Thread Yuming Wang

Do we have plans to apply these guidelines to assert? For example:

https://github.com/apache/spark/blob/5b478416f8e3fe2f015af1b6c8faa7fe9f15c05d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L136-L138
https://github.com/apache/spark/blob/053dd858d38e6107bc71e0aa3a4954291b74f8c8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourcePartitioning.scala#L41

On Wed, Apr 14, 2021 at 9:27 AM Hyukjin Kwon  wrote:

> I would just go ahead and create a PR for that. Nothing written there
> looks unreasonable.
> But probably it should be best to wait a couple of days to make sure
> people are happy with it.
>
> 2021년 4월 14일 (수) 오전 6:38, Karen 님이 작성:
>
>> If the proposed guidelines look good, it would be useful to share these
>> guidelines with the wider community. A good landing page for contributors
>> could be https://spark.apache.org/contributing.html. What do you think?
>>
>> Thank you,
>>
>> Karen Feng
>>
>> On Wed, Apr 7, 2021 at 8:19 PM Hyukjin Kwon  wrote:
>>
>>> LGTM (I took a look, and had some offline discussions w/ some
>>> corrections before it came out)
>>>
>>> 2021년 4월 8일 (목) 오전 5:28, Karen 님이 작성:
>>>
 Hi all,

 As discussed in SPIP: Standardize Exception Messages in Spark (
 https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing),
 improving error message quality in Apache Spark involves establishing an
 error message guideline for developers. Error message style guidelines are
 common practice across open-source projects, for example PostgreSQL (
 https://www.postgresql.org/docs/current/error-style-guide.html).

 To move towards the goal of improving error message quality, we would
 like to start building an error message guideline. We have attached a rough
 draft to kick off this discussion:
 https://docs.google.com/document/d/12k4zmaKmmdm6Pk63HS0N1zN1QT-6TihkWaa5CkLmsn8/edit?usp=sharing
 .

 Please let us know what you think should be in the guideline! We look
 forward to building this as a community.

 Thank you,

 Karen Feng

>>>

Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Yuming Wang

Congrats!

On Sat, Mar 27, 2021 at 7:13 AM Takeshi Yamamuro 
wrote:

> Congrats, all~
>
> On Sat, Mar 27, 2021 at 7:46 AM Jungtaek Lim 
> wrote:
>
>> Congrats all!
>>
>> 2021년 3월 27일 (토) 오전 6:56, Liang-Chi Hsieh 님이 작성:
>>
>>> Congrats! Welcome!
>>>
>>>
>>> Matei Zaharia wrote
>>> > Hi all,
>>> >
>>> > The Spark PMC recently voted to add several new committers. Please
>>> join me
>>> > in welcoming them to their new role! Our new committers are:
>>> >
>>> > - Maciej Szymkiewicz (contributor to PySpark)
>>> > - Max Gekk (contributor to Spark SQL)
>>> > - Kent Yao (contributor to Spark SQL)
>>> > - Attila Zsolt Piros (contributor to decommissioning and Spark on
>>> > Kubernetes)
>>> > - Yi Wu (contributor to Spark Core and SQL)
>>> > - Gabor Somogyi (contributor to Streaming and security)
>>> >
>>> > All six of them contributed to Spark 3.1 and we’re very excited to have
>>> > them join as committers.
>>> >
>>> > Matei and the Spark PMC
>>> > -
>>> > To unsubscribe e-mail:
>>>
>>> > dev-unsubscribe@.apache
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-02 Thread Yuming Wang

Great work, Hyukjin!

On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon  wrote:

> We are excited to announce Spark 3.1.1 today.
>
> Apache Spark 3.1.1 is the second release of the 3.x line. This release adds
> Python type annotations and Python dependency management support as part
> of Project Zen.
> Other major updates include improved ANSI SQL compliance support, history
> server support
> in structured streaming, the general availability (GA) of Kubernetes and
> node decommissioning
> in Kubernetes and Standalone. In addition, this release continues to focus
> on usability, stability,
> and polish while resolving around 1500 tickets.
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to
> this release. This release would not have been possible without you.
>
> To download Spark 3.1.1, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-1-1.html
>
>

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-22 Thread Yuming Wang

+1  @Sean Owen  I do not have this issue:

[info] SparkSQLEnvSuite:
19:45:15.430 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to
load native-hadoop library for your platform... using builtin-java
classes where applicable
19:45:56.366 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of
name hive.stats.jdbc.timeout does not exist
19:45:56.367 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of
name hive.stats.retries.wait does not exist
19:45:59.395 WARN org.apache.hadoop.hive.metastore.ObjectStore:
Version information not found in metastore.
hive.metastore.schema.verification is not enabled so recording the
schema version 2.3.0
19:45:59.395 WARN org.apache.hadoop.hive.metastore.ObjectStore:
setMetaStoreSchemaVersion called but recording version is disabled:
version = 2.3.0, comment = Set by MetaStore root@10.169.161.219
19:45:59.411 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed
to get database default, returning NoSuchObjectException
[info] - SPARK-29604 external listeners should be initialized with
Spark classloader (45 seconds, 249 milliseconds)
19:46:00.067 WARN org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite:

= POSSIBLE THREAD LEAK IN SUITE
o.a.s.sql.hive.thriftserver.SparkSQLEnvSuite, thread names:
rpc-boss-3-1, derby.rawStoreDaemon,
com.google.common.base.internal.Finalizer, Keep-Alive-Timer, Timer-3,
BoneCP-keep-alive-scheduler, shuffle-boss-6-1,
BoneCP-pool-watch-thread =
[info] ScalaTest
[info] Run completed in 46 seconds, 676 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

On Tue, Feb 23, 2021 at 9:38 AM Sean Owen  wrote:

> +1 LGTM, same results as last time. Does anyone see the error below? It is
> probably env-specific as the Jenkins jobs don't hit this. Just checking.
>
>  SPARK-29604 external listeners should be initialized with Spark
> classloader *** FAILED ***
>   java.lang.RuntimeException: [download failed:
> tomcat#jasper-compiler;5.5.23!jasper-compiler.jar, download failed:
> tomcat#jasper-runtime;5.5.23!jasper-runtime.jar, download failed:
> commons-el#commons-el;1.0!commons-el.jar, download failed:
> org.apache.hive#hive-exec;2.3.7!hive-exec.jar]
>   at
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1420)
>   at
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122)
>   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
>   at
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122)
>   at
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:64)
>   at
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:63)
>   at
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:439)
>   at
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:352)
>   at
> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:71)
>   at
> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:70)
>
> On Mon, Feb 22, 2021 at 12:57 AM Hyukjin Kwon  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.1.1.
>>
>> The vote is open until February 24th 11PM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.1.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.1.1-rc3 (commit
>> 1d550c4e90275ab418b9161925049239227f3dc9):
>> https://github.com/apache/spark/tree/v3.1.1-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> 
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1367
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/
>>
>> The list of bug fixes going into 3.1.1 can be found at the following URL:
>> https://s.apache.org/41kf2
>>
>> This release is using the release script of the tag v3.1.1-rc3.
>>
>> FAQ
>>
>> ===
>> What happened to 3.1.0?
>> ===
>>
>> There was a technical issue during Apache Spark 3.1.0 preparation, and it
>> was discussed and decided to skip 3.1.0.
>> Please see
>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>> more details.
>>
>>

Re: Apache Spark 3.0.2 Release ?

2021-02-12 Thread Yuming Wang

+1.

On Sat, Feb 13, 2021 at 10:38 AM Takeshi Yamamuro 
wrote:

> +1, too. Thanks, Dongjoon!
>
> 2021/02/13 11:07、Xiao Li のメール:
>
> 
> +1
>
> Happy Lunar New Year!
>
> Xiao
>
> On Fri, Feb 12, 2021 at 5:33 PM Hyukjin Kwon  wrote:
>
>> Yeah, +1 too
>>
>> 2021년 2월 13일 (토) 오전 4:49, Dongjoon Hyun 님이 작성:
>>
>>> Thank you, Sean!
>>>
>>> On Fri, Feb 12, 2021 at 11:41 AM Sean Owen  wrote:
>>>
 Sounds like a fine time to me, sure.

 On Fri, Feb 12, 2021 at 1:39 PM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> As of today, `branch-3.0` has 307 patches (including 25 correctness
> patches) since v3.0.1 tag (released on September 8th, 2020).
>
> Since we stabilized branch-3.0 during 3.1.x preparation so far,
> it would be great if we start to release Apache Spark 3.0.2 next week.
> And, I'd like to volunteer for Apache Spark 3.0.2 release manager.
>
> What do you think about the Apache Spark 3.0.2 release?
>
> Bests,
> Dongjoon.
>
>
> --
> SPARK-31511 Make BytesToBytesMap iterator() thread-safe
> SPARK-32635 When pyspark.sql.functions.lit() function is used with
> dataframe cache, it returns wrong result
> SPARK-32753 Deduplicating and repartitioning the same column create
> duplicate rows with AQE
> SPARK-32764 compare of -0.0 < 0.0 return true
> SPARK-32840 Invalid interval value can happen to be just adhesive with
> the unit
> SPARK-32908 percentile_approx() returns incorrect results
> SPARK-33019 Use
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
> SPARK-33183 Bug in optimizer rule EliminateSorts
> SPARK-33260 SortExec produces incorrect results if sortOrder is a
> Stream
> SPARK-33290 REFRESH TABLE should invalidate cache even though the
> table itself may not be cached
> SPARK-33358 Spark SQL CLI command processing loop can't exit while one
> comand fail
> SPARK-33404 "date_trunc" expression returns incorrect results
> SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
> SPARK-33591 NULL is recognized as the "null" string in partition specs
> SPARK-33593 Vector reader got incorrect data with binary partition
> value
> SPARK-33726 Duplicate field names causes wrong answers during
> aggregation
> SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
> SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
> SPARK-34187 Use available offset range obtained during polling when
> checking offset validation
> SPARK-34212 For parquet table, after changing the precision and scale
> of decimal type in hive, spark reads incorrect value
> SPARK-34213 LOAD DATA doesn't refresh v1 table cache
> SPARK-34229 Avro should read decimal values with the file schema
> SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache
>

>
> --
>
>

Re: [VOTE] Release Spark 3.1.1 (RC2)

2021-02-09 Thread Yuming Wang

+1. Tested a batch of queries with YARN client mode.

On Tue, Feb 9, 2021 at 2:57 PM 郑瑞峰  wrote:

> +1 (non-binding)
>
> Thank you, Hyukjin
>
>
> -- 原始邮件 --
> *发件人:* "Gengliang Wang" ;
> *发送时间:* 2021年2月9日(星期二) 中午1:50
> *收件人:* "Sean Owen";
> *抄送:* "Hyukjin Kwon";"Yuming Wang" >;"dev";
> *主题:* Re: [VOTE] Release Spark 3.1.1 (RC2)
>
> +1
>
> On Tue, Feb 9, 2021 at 1:39 PM Sean Owen  wrote:
>
>> Same result as last time for me, +1. Tested with Java 11.
>> I fixed the two issues without assignee; one was WontFix though.
>>
>> On Mon, Feb 8, 2021 at 7:43 PM Hyukjin Kwon  wrote:
>>
>>> Let's set the assignees properly then. Shouldn't be a problem for the
>>> release.
>>>
>>> On Tue, 9 Feb 2021, 10:40 Yuming Wang,  wrote:
>>>
>>>>
>>>> Many tickets do not have correct assignee:
>>>>
>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20in%20(3.1.0%2C%203.1.1)%20AND%20(assignee%20is%20EMPTY%20or%20assignee%20%3D%20apachespark)
>>>>
>>>>
>>>> On Tue, Feb 9, 2021 at 9:05 AM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> +1 (binding) from myself too.
>>>>>
>>>>> 2021년 2월 9일 (화) 오전 9:28, Kent Yao 님이 작성:
>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> *Kent Yao *
>>>>>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>>>>>> *a spark enthusiast*
>>>>>> *kyuubi <https://github.com/yaooqinn/kyuubi>is a unified
>>>>>> multi-tenant JDBC interface for large-scale data processing and 
>>>>>> analytics,
>>>>>> built on top of Apache Spark <http://spark.apache.org/>.*
>>>>>> *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A
>>>>>> Spark SQL extension which provides SQL Standard Authorization for 
>>>>>> **Apache
>>>>>> Spark <http://spark.apache.org/>.*
>>>>>> *spark-postgres <https://github.com/yaooqinn/spark-postgres> A
>>>>>> library for reading data from and transferring data to Postgres / 
>>>>>> Greenplum
>>>>>> with Spark SQL and DataFrames, 10~100x faster.*
>>>>>> *spark-func-extras <https://github.com/yaooqinn/spark-func-extras>A
>>>>>> library that brings excellent and useful functions from various modern
>>>>>> database management systems to Apache Spark <http://spark.apache.org/>.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 02/9/2021 08:24，Hyukjin Kwon
>>>>>>  wrote：
>>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 3.1.1.
>>>>>>
>>>>>> The vote is open until February 15th 5PM PST and passes if a majority
>>>>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>
>>>>>> Note that it is 7 days this time because it is a holiday season in
>>>>>> several countries including South Korea (where I live), China etc., and I
>>>>>> would like to make sure people do not miss it because it is a holiday
>>>>>> season.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 3.1.1
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>
>>>>>> The tag to be voted on is v3.1.1-rc2 (commit
>>>>>> cf0115ac2d60070399af481b14566f33d22ec45e):
>>>>>> https://github.com/apache/spark/tree/v3.1.1-rc2
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> <https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/>
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-bin/
>>>>>>
>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>
>>>>>> The staging repository for this release can be

Re: [VOTE] Release Spark 3.1.1 (RC2)

2021-02-08 Thread Yuming Wang

Many tickets do not have correct assignee:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20in%20(3.1.0%2C%203.1.1)%20AND%20(assignee%20is%20EMPTY%20or%20assignee%20%3D%20apachespark)


On Tue, Feb 9, 2021 at 9:05 AM Hyukjin Kwon  wrote:

> +1 (binding) from myself too.
>
> 2021년 2월 9일 (화) 오전 9:28, Kent Yao 님이 작성:
>
>>
>> +1
>>
>> *Kent Yao *
>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>> *a spark enthusiast*
>> *kyuubi is a
>> unified multi-tenant JDBC interface for large-scale data processing and
>> analytics, built on top of Apache Spark .*
>> *spark-authorizer A Spark
>> SQL extension which provides SQL Standard Authorization for **Apache
>> Spark .*
>> *spark-postgres  A library
>> for reading data from and transferring data to Postgres / Greenplum with
>> Spark SQL and DataFrames, 10~100x faster.*
>> *spark-func-extras A
>> library that brings excellent and useful functions from various modern
>> database management systems to Apache Spark .*
>>
>>
>>
>> On 02/9/2021 08:24，Hyukjin Kwon
>>  wrote：
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.1.1.
>>
>> The vote is open until February 15th 5PM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> Note that it is 7 days this time because it is a holiday season in
>> several countries including South Korea (where I live), China etc., and I
>> would like to make sure people do not miss it because it is a holiday
>> season.
>>
>> [ ] +1 Release this package as Apache Spark 3.1.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.1.1-rc2 (commit
>> cf0115ac2d60070399af481b14566f33d22ec45e):
>> https://github.com/apache/spark/tree/v3.1.1-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> 
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1365
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-docs/
>>
>> The list of bug fixes going into 3.1.1 can be found at the following URL:
>> https://s.apache.org/41kf2
>>
>> This release is using the release script of the tag v3.1.1-rc2.
>>
>> FAQ
>>
>> ===
>> What happened to 3.1.0?
>> ===
>>
>> There was a technical issue during Apache Spark 3.1.0 preparation, and it
>> was discussed and decided to skip 3.1.0.
>> Please see
>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>> more details.
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC via "pip install
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-bin/pyspark-3.1.1.tar.gz
>> "
>> and see if anything important breaks.
>> In the Java/Scala, you can add the staging repository to your projects
>> resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.1.1?
>> ===
>>
>> The current list of open tickets targeted at 3.1.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.1.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>>

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-18 Thread Yuming Wang

+1.

On Tue, Jan 19, 2021 at 7:54 AM Hyukjin Kwon  wrote:

> I forgot to say :). I'll start with my +1.
>
> On Mon, 18 Jan 2021, 21:06 Hyukjin Kwon,  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.1.1.
>>
>> The vote is open until January 22nd 4PM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.1.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.1.1-rc1 (commit
>> 53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d):
>> https://github.com/apache/spark/tree/v3.1.1-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1364
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-docs/
>>
>> The list of bug fixes going into 3.1.1 can be found at the following URL:
>> https://s.apache.org/41kf2
>>
>> This release is using the release script of the tag v3.1.1-rc1.
>>
>> FAQ
>>
>> ===
>> What happened to 3.1.0?
>> ===
>>
>> There was a technical issue during Apache Spark 3.1.0 preparation, and it
>> was discussed and decided to skip 3.1.0.
>> Please see
>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>> more details.
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC via "pip install
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/pyspark-3.1.1.tar.gz
>> "
>> and see if anything important breaks.
>> In the Java/Scala, you can add the staging repository to your projects
>> resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.1.1?
>> ===
>>
>> The current list of open tickets targeted at 3.1.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.1.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>

Re: [build system] WE'RE LIVE!

2020-12-01 Thread Yuming Wang

Thank you, Shane.

On Wed, Dec 2, 2020 at 8:55 AM shane knapp ☠  wrote:

> https://amplab.cs.berkeley.edu/jenkins/
>
> i cleared the build queue, so you'll need to retrigger your PRs.  there
> will be occasional downtime over the next few days and weeks as we uncover
> system-level errors and more reimaging happens...  but for now, we're
> building.
>
> a big thanks goes out to jon for his work on the project!  we couldn't
> have done it w/o him.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: jenkins downtime tomorrow evening/weekend

2020-11-24 Thread Yuming Wang

Hi Shane,

Did you set :export LANG=en_US.UTF-8? Some test seems failed because of
this issue:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131631/testReport/

Please see https://issues.apache.org/jira/browse/SPARK-27177 for more
details.

On Tue, Nov 24, 2020 at 8:23 AM shane knapp ☠  wrote:

> it seems that the plugin upgrade went as smoothly as it could have...  i
> still have a bunch of stack traces to filter through and see if anything is
> really broken but it's looking pretty good and things are building.
>
> if you see any bad behavior from jenkins, don't hesitate to file a jira
> and ping me here.
>
> also, my backlog of things i need to install will be addressed this week.
> the ansible is coming along nicely!
>
> On Mon, Nov 23, 2020 at 2:11 PM shane knapp ☠  wrote:
>
>> the third most terrifying event in the world, a massive jenkins plugin
>> update is happening in a couple of hours.  i'm going to restart jenkins and
>> start working out any bugs/issues that pop up.
>>
>> this could be short, or quite long.  i'm guessing somewhere in the
>> middle.  no new builds will be kicked off starting now.
>>
>> in parallel, i'm about to start porting my ansible to ubuntu 20 and
>> testing that on two freshly reinstalled workers.  the ultimate goal is to
>> get the PRB running on ubuntu 20...   the sbt tests will also likely be
>> broken as i've never been able to work on ubuntu 16, 18 or 20.
>>
>> shane
>>
>> On Sat, Nov 21, 2020 at 4:23 PM shane knapp ☠ 
>> wrote:
>>
>>> somehow that went pretty smoothly, tho i've got a bunch of plugins to
>>> deal with...  we're back up and building w/a shiny new UI.  :)
>>>
>>> On Sat, Nov 21, 2020 at 3:52 PM shane knapp ☠ 
>>> wrote:
>>>
 this is starting now

 On Thu, Nov 19, 2020 at 4:34 PM shane knapp ☠ 
 wrote:

> i'm going to be upgrading jenkins to something more reasonable, and
> there will definitely be some downtime as i get things sorted.
>
> we should be back up and building by monday.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: 回复： [DISCUSS] Apache Spark 3.0.1 Release

2020-08-25 Thread Yuming Wang

Another correctness issue: https://issues.apache.org/jira/browse/SPARK-32659

On Tue, Aug 25, 2020 at 11:25 PM Sean Owen  wrote:

> That isn't a blocker (see comments - not a regression).
> That said I think we have a fix ready to merge now, if there are no
> objections.
>
> On Tue, Aug 25, 2020 at 10:24 AM Dongjoon Hyun 
> wrote:
> >
> > For the correctness blocker, we have the following, Tom.
> >
> > - https://issues.apache.org/jira/browse/SPARK-32614
> > - https://github.com/apache/spark/pull/29516
> >
> > Bests,
> > Dongjoon.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[ANNOUNCE] Announcing Apache Spark 3.0.0-preview2

2019-12-24 Thread Yuming Wang

Hi all,

To enable wide-scale community testing of the upcoming Spark 3.0 release,
the Apache Spark community has posted a new preview release of Spark 3.0.
This preview is *not a stable release in terms of either API or
functionality*, but it is meant to give the community early access to try
the code that will become Spark 3.0. If you would like to test the release,
please download it, and send feedback using either the mailing lists
 or JIRA

.

There are a lot of exciting new features added to Spark 3.0, including
Dynamic Partition Pruning, Adaptive Query Execution, Accelerator-aware
Scheduling, Data Source API with Catalog Supports, Vectorization in SparkR,
support of Hadoop 3/JDK 11/Scala 2.12, and many more. For a full list of
major features and changes in Spark 3.0.0-preview2, please check the thread(
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-3-0-preview-release-feature-list-and-major-changes-td28050.html
 and
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-3-0-preview-release-2-td28491.html
).

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 3.0.0-preview2, head over to the download page:
https://archive.apache.org/dist/spark/spark-3.0.0-preview2

Happy Holidays.

Yuming

Re: Fw:Re: [VOTE][SPARK-29018][SPIP]:Build spark thrift server based on protocol v11

2019-12-23 Thread Yuming Wang

I'm  +1 for this SPIP for these two reasons:

1. The current thriftserver has some issues that are not easy to solve,
such as: SPARK-28636 .
2. The difference between the version of ORC we are using and the built-in
Hive is using is getting bigger and bigger. We can't ensure that there will
be no compatibility issues in the future. If thriftserver does not depend
on Hive, it will be much easier to upgrade the built-in Hive in the future.

On Sat, Dec 21, 2019 at 9:28 PM angers.zhu  wrote:

> Hi all,
>
> I have complete a Design doc about how to use and config this new thrift
> server, and some design detail about change and impersonation.
>
> Hope for your suggestions and ideas.
>
> SPIP DOC :
> https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#heading=h.x97c6tj78zo0
> Design DOC :
> https://docs.google.com/document/d/1UKE9QTtHqSZBq0V_vEn54PlWaWPiRAKf_JvcT0skaSo/edit#heading=h.q1ed5q1ldh14
> Thrift server about configurations
> https://docs.google.com/document/d/1uI35qJmQO4FKE6pr0h3zetZqww-uI8QsQjxaYY_qb1s/edit?usp=drive_web=110963191229426834922
>
> Best Regards
> angers.zhu
> angers@gmail.com
>
> 
> 签名由 网易邮箱大师  定制
>
> - Forwarded Message -
> From： angers.zhu  
> Date： 12/18/2019 22:29
> To： dev-ow...@spark.apache.org 
> 
> Subject： Re: [VOTE][SPARK-29018][SPIP]:Build spark thrift server based on
> protocol v11
>
> Add spark-dev group access privilege to google.
> angers.zhu
> angers@gmail.com
>
> 
> 签名由 网易邮箱大师  定制
>
> On 12/18/2019 22:02，Sandeep Katta
>  wrote：
>
> I couldn't access the doc, please give permission to the spark-dev group
>
> On Wed, 18 Dec 2019 at 18:05, angers.zhu  wrote:
>
>> With the development of Spark and Hive，in current sql/hive-thriftserver
>> module,
>>
>> we need to do a lot of work to solve code conflicts for different
>> built-in hive versions.
>>
>> It's an annoying and unending work in current ways. And these issues have
>> limited
>>
>> our ability and convenience to develop new features for Spark’s thrift
>> server.
>>
>> We suppose to implement a new thrift server and JDBC driver based on 
>> Hive’s
>> latest v11
>>
>> TCLService.thrift thrift protocol. Finally, the new thrift server have
>> below feature:
>>
>>1.
>>
>>Build new module spark-service as spark’s thrift server
>>2.
>>
>>Don't need as much reflection and inherited code as `hive-thriftser`
>>modules
>>3.
>>
>>Support all functions current `sql/hive-thriftserver` support
>>4.
>>
>>Use all code maintained by spark itself, won’t depend on Hive
>>5.
>>
>>Support origin functions use spark’s own way, won't limited by Hive's
>>code
>>6.
>>
>>Support running without hive metastore or with hive metastore
>>7.
>>
>>Support user impersonation by Multi-tenant splited hive
>>authentication and DFS authentication
>>8.
>>
>>Support session hook for with spark’s own code
>>9.
>>
>>Add a new jdbc driver spark-jdbc, with spark’s own connection url
>>“jdbc:spark::/”
>>10.
>>
>>Support both hive-jdbc and spark-jdbc client, then we can support
>>most clients and BI platform
>>
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-29018
>>
>> Google Doc:
>> https://docs.google.com/document/d/1ug4K5e2okF5Q2Pzi3qJiUILwwqkn0fVQaQ-Q95HEcJQ/edit#
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> I'll start with my +1
>> angers.zhu
>> angers@gmail.com
>>
>> 
>> 签名由 网易邮箱大师  定制
>>
>>

[VOTE][RESULT] SPARK 3.0.0-preview2 (RC2)

2019-12-22 Thread Yuming Wang

Hi, All.

The vote passes. Thanks to all who helped with this release 3.0.0-preview2!
I'll follow up later with a release announcement once everything is
published.

+1 (* = binding):
- Sean Owen *
- Dongjoon Hyun *
- Takeshi Yamamuro *
- Wenchen Fan *

+0: None

-1: None




Regards,
Yuming

Re: [VOTE] SPARK 3.0.0-preview2 (RC2)

2019-12-16 Thread Yuming Wang

Please go to  td28549

  
to vote, this voting link is incorrect.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE] SPARK 3.0.0-preview2 (RC2)

2019-12-16 Thread Yuming Wang

Please vote on releasing the following candidate as Apache Spark
version 3.0.0-preview2.

The vote is open until December 20 PST and passes if a majority +1 PMC
votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0-preview2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.0.0-preview2-rc2 (commit
bcadd5c3096109878fe26fb0d57a9b7d6fdaa257):
https://github.com/apache/spark/tree/v3.0.0-preview2-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1338/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-docs/

The list of bug fixes going into 3.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12339177

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.0.0?
===

The current list of open tickets targeted at 3.0.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.0.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

[VOTE] SPARK 3.0.0-preview2 (RC2)

2019-12-16 Thread Yuming Wang

Please vote on releasing the following candidate as Apache Spark version 3.0
.0-preview2.

The vote is open until December 20 PST and passes if a majority +1 PMC
votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0-preview2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.0.0-preview2-rc2 (commit
bcadd5c3096109878fe26fb0d57a9b7d6fdaa257):
https://github.com/apache/spark/tree/v3.0.0-preview2-rc2


The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-bin/


Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1338/


The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-docs/


The list of bug fixes going into 3.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12339177

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.0.0?
===

The current list of open tickets targeted at 3.0.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.0.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Re: Packages to release in 3.0.0-preview

2019-10-29 Thread Yuming Wang

Thank you Dongjoon. Please check out the latest code from test-spark-jdk11
<https://github.com/wangyum/test-spark-jdk11>. It works with JDK 1.8.
One workaround is to install the Spark packages to local maven repository
using hadoop-3.2 profile and JDK 1.8.


On Mon, Oct 28, 2019 at 5:03 AM Dongjoon Hyun 
wrote:

> Hi, Yuming.
>
> Is the project working correctly on JDK8 with you?
>
> When I simply cloned your repo and did `mvn clean package` on
> JDK 1.8.0_232, it seems not to pass the UTs.
>
> I also tried to rerun after ignoring two ORC table test like the
> followings, but the UT is failing.
>
> ~/A/test-spark-jdk11:master$ git diff | grep 'ORC table'
> -  test("Datasource ORC table") {
> +  ignore("Datasource ORC table") {
> -  test("Hive ORC table") {
> +  ignore("Hive ORC table") {
>
> ~/A/test-spark-jdk11:master$ mvn clean package
> ...
> - Hive ORC table !!! IGNORED !!!
> Run completed in 36 seconds, 999 milliseconds.
> Total number of tests run: 2
> Suites: completed 3, aborted 0
> Tests: succeeded 1, failed 1, canceled 0, ignored 2, pending 0
> *** 1 TEST FAILED ***
>
> ~/A/test-spark-jdk11:master$ java -version
> openjdk version "1.8.0_232"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)
>
>
> Bests,
> Dongjoon.
>
> On Sun, Oct 27, 2019 at 1:38 PM Dongjoon Hyun 
> wrote:
>
>> It seems not a Hadoop issue, doesn't it?
>>
>> What Yuming pointed seems to be `Hive 2.3.6` profile implementation issue
>> which is enabled only when `Hadoop 3.2`.
>>
>> From my side, I'm +1 for publishing jars which depends on `Hadoop 3.2.0 /
>> Hive 2.3.6` jars to Maven since Apache Spark 3.0.0.
>>
>> For the others, I'd like to mention that this implies the followings, too.
>>
>> 1. We are not going to use Hive 1.2.1 library. Only Hadoop-2.7 profile
>> tarball distribution will use Hive 1.2.1.
>> 2. Although we depends on Hadoop 3.2.0, Hadoop 3.2.1 changes their Guava
>> library version significantly.
>> So, it requires some attentions in Apache Spark. Otherwise, we may
>> hit some issues on Hadoop 3.2.1+ runtime later.
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Sun, Oct 27, 2019 at 7:31 AM Sean Owen  wrote:
>>
>>> Is the Spark artifact actually any different between those builds? I
>>> thought it just affected what else was included in the binary tarball.
>>> If it matters, yes I'd publish a "Hadoop 3" version to Maven. (Scala
>>> 2.12 is the only supported Scala version).
>>>
>>> On Sun, Oct 27, 2019 at 4:35 AM Yuming Wang  wrote:
>>> >
>>> > Do we need to publish the Scala 2.12 + hadoop 3.2 jar packages to the
>>> Maven repository? Otherwise it will throw a NoSuchMethodError on Java 11.
>>> > Here is an example:
>>> >
>>> https://github.com/wangyum/test-spark-jdk11/blob/master/src/test/scala/test/spark/HiveTableSuite.scala#L34-L38
>>> >
>>> https://github.com/wangyum/test-spark-jdk11/commit/927ce7d3766881fba98f2434055fa3a1d1544ad2/checks?check_suite_id=283076578
>>> >
>>> >
>>> > On Sat, Oct 26, 2019 at 10:41 AM Takeshi Yamamuro <
>>> linguin@gmail.com> wrote:
>>> >>
>>> >> Thanks for that work!
>>> >>
>>> >> > I don't think JDK 11 is a separate release (by design). We build
>>> >> > everything targeting JDK 8 and it should work on JDK 11 too.
>>> >> +1. a single package working on both jvms looks nice.
>>> >>
>>> >>
>>> >> On Sat, Oct 26, 2019 at 4:18 AM Sean Owen  wrote:
>>> >>>
>>> >>> I don't think JDK 11 is a separate release (by design). We build
>>> >>> everything targeting JDK 8 and it should work on JDK 11 too.
>>> >>>
>>> >>> So, just two releases, but, frankly I think we soon need to stop
>>> >>> multiple releases for multiple Hadoop versions, and stick to Hadoop
>>> 3.
>>> >>> I think it's fine to try to release for Hadoop 2 as the support still
>>> >>> exists, and because the difference happens to be larger due to the
>>> >>> different Hive dependency.
>>> >>>
>>> >>> On Fri, Oct 25, 2019 at 2:08 PM Xingbo Jiang 
>>> wrote:
>>> >>> >
>>> >>> > Hi all,
>>> >>> >
>>> >>> > I would like to bring out a discussion on how many packages shall
>>> be released in 3.0.0-preview, the ones I can think of now:
>>> >>> >
>>> >>> > * scala 2.12 + hadoop 2.7
>>> >>> > * scala 2.12 + hadoop 3.2
>>> >>> > * scala 2.12 + hadoop 3.2 + JDK 11
>>> >>> >
>>> >>> > Do you have other combinations to add to the above list?
>>> >>> >
>>> >>> > Cheers,
>>> >>> >
>>> >>> > Xingbo
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> ---
>>> >> Takeshi Yamamuro
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: Packages to release in 3.0.0-preview

2019-10-27 Thread Yuming Wang

Do we need to publish the Scala 2.12 + hadoop 3.2 jar packages to the Maven
repository? Otherwise it will throw a NoSuchMethodError on Java 11.
Here is an example:
https://github.com/wangyum/test-spark-jdk11/blob/master/src/test/scala/test/spark/HiveTableSuite.scala#L34-L38
https://github.com/wangyum/test-spark-jdk11/commit/927ce7d3766881fba98f2434055fa3a1d1544ad2/checks?check_suite_id=283076578


On Sat, Oct 26, 2019 at 10:41 AM Takeshi Yamamuro 
wrote:

> Thanks for that work!
>
> > I don't think JDK 11 is a separate release (by design). We build
> > everything targeting JDK 8 and it should work on JDK 11 too.
> +1. a single package working on both jvms looks nice.
>
>
> On Sat, Oct 26, 2019 at 4:18 AM Sean Owen  wrote:
>
>> I don't think JDK 11 is a separate release (by design). We build
>> everything targeting JDK 8 and it should work on JDK 11 too.
>>
>> So, just two releases, but, frankly I think we soon need to stop
>> multiple releases for multiple Hadoop versions, and stick to Hadoop 3.
>> I think it's fine to try to release for Hadoop 2 as the support still
>> exists, and because the difference happens to be larger due to the
>> different Hive dependency.
>>
>> On Fri, Oct 25, 2019 at 2:08 PM Xingbo Jiang 
>> wrote:
>> >
>> > Hi all,
>> >
>> > I would like to bring out a discussion on how many packages shall be
>> released in 3.0.0-preview, the ones I can think of now:
>> >
>> > * scala 2.12 + hadoop 2.7
>> > * scala 2.12 + hadoop 3.2
>> > * scala 2.12 + hadoop 3.2 + JDK 11
>> >
>> > Do you have other combinations to add to the above list?
>> >
>> > Cheers,
>> >
>> > Xingbo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>

jenkins locale issue

2019-03-11 Thread Yuming Wang

Why jenkins locale is:

LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX


Hadoop will throw InvalidPathException since HADOOP-12045
. For more details,
please see HADOOP-16180 
.

My question is, could we set system locale to UTF-8 to workaround this bug?

Thanks!

[DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Yuming Wang

Dear Spark Developers and Users,



Hyukjin and I plan to upgrade the built-in Hive from 1.2.1-spark2
 to 2.3.4
 to solve
some critical issues, such as support Hadoop 3.x, solve some ORC and
Parquet issues. This is the list:

*Hive issues*:

[SPARK-26332 ][HIVE-10790]
Spark sql write orc table on viewFS throws exception

[SPARK-25193 ][HIVE-12505]
insert overwrite doesn't throw exception when drop old data fails

[SPARK-26437 ][HIVE-13083]
Decimal data becomes bigint to query, unable to query

[SPARK-25919 ][HIVE-11771]
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
table is Partitioned

[SPARK-12014 ][HIVE-11100]
Spark SQL query containing semicolon is broken in Beeline



*Spark issues*:

[SPARK-23534 ] Spark run
on Hadoop 3.0.0

[SPARK-20202 ] Remove
references to org.spark-project.hive

[SPARK-18673 ]
Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

[SPARK-24766 ]
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column
stats in parquet





Since the code for the *hive-thriftserver* module has changed too much for
this upgrade, I split it into two PRs for easy review.

The first PR  does not contain
the changes of hive-thriftserver. Please ignore the failed test in
hive-thriftserver.

The second PR  is complete
changes.



I have created a Spark distribution for Apache Hadoop 2.7, you might
download it via Google Drive
 or Baidu
Pan .

Please help review and test. Thanks.

run docker-integration-tests in jenkins

2018-09-23 Thread Yuming Wang

hi, shane

Can we run docker-integration-tests in jenkins? see discuss here
.









Looking forward to hear your feedback,
Thanks.

Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-17 Thread yuming wang

Yes, It doesn’t work in 2.1.0 and 2.1.1, I create a PR for this: 
https://github.com/apache/spark/pull/19259 
.


> 在 2017年9月17日，16:14，Sean Owen  写道：
> 
> So, didn't work in 2.1.0 or 2.1.1? If it's not a regression and not critical, 
> it shouldn't block a release. It seems like this can only affect Docker 
> and/or Oracle JDBC? Well, if we need to roll another release anyway, seems OK.
> 
> On Sun, Sep 17, 2017 at 6:06 AM Xiao Li  > wrote:
> This is a bug introduced in 2.1. It works fine in 2.0
> 
> 2017-09-16 16:15 GMT-07:00 Holden Karau  >:
> Ok :) Was this working in 2.1.1?
> 
> On Sat, Sep 16, 2017 at 3:59 PM Xiao Li  > wrote:
> Still -1
> 
> Unable to pass the tests in my local environment. Open a JIRA 
> https://issues.apache.org/jira/browse/SPARK-22041 
> 
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
> 
>   types.apply(9).equals(org.apache.spark.sql.types.DateType) was false 
> (OracleIntegrationSuite.scala:158)
> 
> Xiao
> 
> 
> 2017-09-15 17:35 GMT-07:00 Ryan Blue  >:
> -1 (with my Apache member hat on, non-binding)
> 
> I'll continue discussion in the other thread, but I don't think we should 
> share signing keys.
> 
> On Fri, Sep 15, 2017 at 5:14 PM, Holden Karau  > wrote:
> Indeed it's limited to a people with login permissions on the Jenkins host 
> (and perhaps further limited, I'm not certain). Shane probably knows more 
> about the ACLs, so I'll ask him in the other thread for specifics.
> 
> This is maybe branching a bit from the question of the current RC though, so 
> I'd suggest we continue this discussion on the thread Sean Owen made.
> 
> On Fri, Sep 15, 2017 at 4:04 PM Ryan Blue  > wrote:
> I'm not familiar with the release procedure, can you send a link to this 
> Jenkins job? Can anyone run this job, or is it limited to committers?
> 
> rb
> 
> On Fri, Sep 15, 2017 at 12:28 PM, Holden Karau  > wrote:
> That's a good question, I built the release candidate however the Jenkins 
> scripts don't take a parameter for configuring who signs them rather it 
> always signs them with Patrick's key. You can see this from previous releases 
> which were managed by other folks but still signed by Patrick.
> 
> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue  > wrote:
> The signature is valid, but why was the release signed with Patrick Wendell's 
> private key? Did Patrick build the release candidate?
> 
> rb
> 
> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee  > wrote:
> +1 (non-binding)
> 
> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung  > wrote:
> +1 tested SparkR package on Windows, r-hub, Ubuntu.
> 
> _
> From: Sean Owen >
> Sent: Thursday, September 14, 2017 3:12 PM
> Subject: Re: [VOTE] Spark 2.1.2 (RC1)
> To: Holden Karau >, 
> >
> 
> 
> 
> +1
> Very nice. The sigs and hashes look fine, it builds fine for me on Debian 
> Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes tests. 
> 
> Yes as you say, no outstanding issues except for this which doesn't look 
> critical, as it's not a regression.
> 
> SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs
> 
> 
> On Thu, Sep 14, 2017 at 7:47 PM Holden Karau  > wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.1.2. The vote is open until Friday September 22nd at 18:00 PST and passes 
> if a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 2.1.2
> [ ] -1 Do not release this package because ...
> 
> 
> To learn more about Apache Spark, please see https://spark.apache.org/ 
> 
> 
> The tag to be voted on is v2.1.2-rc1 
>  
> (6f470323a0363656999dd36cb33f528afe627c12)
> 
> List of JIRA tickets resolved in this release can be found with this filter. 
> 
> 
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/ 
>

Is there something wrong with jenkins?

2017-06-26 Thread Yuming Wang

Hi All,

Is there something wrong with jenkins?


# To activate this environment, use:
# $ source activate /tmp/tmp.tWAUGnH6wZ/3.5
#
# To deactivate this environment, use:
# $ source deactivate
#
discarding /home/anaconda/bin from PATH
prepending /tmp/tmp.tWAUGnH6wZ/3.5/bin to PATH
Fetching package metadata: ..SSL verification error: hostname
'conda.binstar.org' doesn't match either of 'anaconda.com',
'anacondacloud.com', 'anacondacloud.org', 'binstar.org', 'wakari.io'
.SSL verification error: hostname 'conda.binstar.org' doesn't match
either of 'anaconda.com', 'anacondacloud.com', 'anacondacloud.org',
'binstar.org', 'wakari.io'
...
Solving package specifications: .
Error:  Package missing in current linux-64 channels:
  - pyarrow 0.4|0.4.0*

You can search for this package on anaconda.org with

anaconda search -t conda pyarrow 0.4|0.4.0*

You may need to install the anaconda-client command line client with

conda install anaconda-client
Cleaning up temporary directory - /tmp/tmp.tWAUGnH6wZ
[error] running
/home/jenkins/workspace/SparkPullRequestBuilder/dev/run-pip-tests ;
received return code 1
Attempting to post to Github...
 > Post successful.
Build step 'Execute shell' marked build as failure
Archiving artifacts
Recording test results
Test FAILed.
Refer to this link for build results (access rights to CI server
needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78627/
Test FAILed.
Finished: FAILURE


more logs: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78627/console



Thanks!

Remove old Hive support

2017-04-01 Thread Yuming Wang

Do we have a plan to remove old Hive support?

Re: Request for comments: Java 7 removal

2017-02-14 Thread Yuming Wang

There is a way only Spark use Java 8, Hadoop still use Java 7:
spark-conf.jpg
(58K)




By the way, I have a way to install any spark version on CM5.4 - CM5.7 by
custom CSD  and
custom Spark parcel .

On Wed, Feb 15, 2017 at 6:46 AM, Koert Kuipers  wrote:

> what about the conversation about dropping scala 2.10?
>
> On Fri, Feb 10, 2017 at 11:47 AM, Sean Owen  wrote:
>
>> As you have seen, there's a WIP PR to implement removal of Java 7
>> support: https://github.com/apache/spark/pull/16871
>>
>> I have heard several +1s at https://issues.apache.org/j
>> ira/browse/SPARK-19493 but am asking for concerns too, now that there's
>> a concrete change to review.
>>
>> If this goes in for 2.2 it can be followed by more extensive update of
>> the Java code to take advantage of Java 8; this is more or less the
>> baseline change.
>>
>> We also just removed Hadoop 2.5 support. I know there was talk about
>> removing Python 2.6. I have no opinion on that myself, but, might be time
>> to revive that conversation too.
>>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-16 Thread Yuming Wang

I hope https://github.com/apache/spark/pull/16252 can be fixed until
release 2.1.0. It's a fix for broadcast cannot fit in memory.

On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley 
wrote:

> +1
>
> On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
>> +1
>>
>> On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li  wrote:
>>
>>> +1
>>>
>>> Xiao Li
>>>
>>> 2016-12-16 12:19 GMT-08:00 Felix Cheung :
>>>
 For R we have a license field in the DESCRIPTION, and this is standard
 practice (and requirement) for R packages.

 https://cran.r-project.org/doc/manuals/R-exts.html#Licensing

 --
 *From:* Sean Owen 
 *Sent:* Friday, December 16, 2016 9:57:15 AM
 *To:* Reynold Xin; dev@spark.apache.org
 *Subject:* Re: [VOTE] Apache Spark 2.1.0 (RC5)

 (If you have a template for these emails, maybe update it to use https
 links. They work for apache.org domains. After all we are asking
 people to verify the integrity of release artifacts, so it might as well be
 secure.)

 (Also the new archives use .tar.gz instead of .tgz like the others. No
 big deal, my OCD eye just noticed it.)

 I don't see an Apache license / notice for the Pyspark or SparkR
 artifacts. It would be good practice to include this in a convenience
 binary. I'm not sure if it's strictly mandatory, but something to adjust in
 any event. I think that's all there is to do for SparkR. For Pyspark, which
 packages a bunch of dependencies, it does include the licenses (good) but I
 think it should include the NOTICE file.

 This is the first time I recall getting 0 test failures off the bat!
 I'm using Java 8 / Ubuntu 16 and yarn/hive/hadoop-2.7 profiles.

 I think I'd +1 this therefore unless someone knows that the license
 issue above is real and a blocker.

 On Fri, Dec 16, 2016 at 5:17 AM Reynold Xin 
 wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT
> and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc5 (cd0a08361e2526519e7c131c42116
> bf56fa62c76)
>
> List of JIRA tickets resolved are:  https://issues.apache.org/jir
> a/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapache
> spark-1223/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.
> 0-rc5-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.1.0?*
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should be
> worked on immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>
> *What happened to RC3/RC5?*
>
> They had issues withe release packaging and as a result were skipped.
>
>
>>>
>>
>>
>> --
>>
>> Herman van Hövell
>>
>> Software Engineer
>>
>> Databricks Inc.
>>
>> hvanhov...@databricks.com
>>
>> +31 6 420 590 27
>>
>> databricks.com
>>
>> [image: http://databricks.com] 
>>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>

Apply patch to 1.2.1.spark2

2016-07-27 Thread yuming wang

Hi,

Is there a way to apply this patch
 to ?

ml.feature.Word2Vec.transform() very slow issue

2015-11-09 Thread Yuming Wang

Hi



I found org.apache.spark.ml.feature.Word2Vec.transform() very slow.

I think we should not read broadcast every sentence, so I fixed on my forked.



https://github.com/979969786/spark/commit/a9f894df3671bb8df2f342de1820dab3185598f3



I have use 2 number rows test it. Original version consume *5 minutes*,




and my version just consume *22 seconds* on same data.







If I'm right, I will pull request.



Thanks

98 matches

Mail list logo