Re: A scene with unstable Spark performance

2022-05-17 Thread Bowen Song
Hi,

Spark dynamic resource allocation cannot solve my problem, because the 
resources of the production environment are limited. I expect that under this 
premise, by reserving resources to ensure that job tasks of different groups 
can be scheduled in time.

Thank you,
Bowen Song


From: Qian SUN 
Sent: Wednesday, May 18, 2022 9:32
To: Bowen Song 
Cc: user.spark 
Subject: Re: A scene with unstable Spark performance

Hi. I think you need Spark dynamic resource allocation. Please refer to 
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation.
And If you use Spark SQL, AQE maybe help. 
https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution

Bowen Song mailto:bowen.s...@kyligence.io>> 
于2022年5月17日周二 22:33写道:

Hi all,



I find Spark performance is unstable in this scene: we divided the jobs into 
two groups according to the job completion time. One group of jobs had an 
execution time of less than 10s, and the other group of jobs had an execution 
time from 10s to 300s. The reason for the difference is that the latter will 
scan more files, that is, the number of tasks will be larger. When the two 
groups of jobs were submitted to Spark for execution, I found that due to 
resource competition, the existence of the slower jobs made the original faster 
job take longer to return the result, which manifested as unstable Spark 
performance. The problem I want to solve is: Can we reserve certain resources 
for each of the two groups, so that the fast jobs can be scheduled in time, and 
the slow jobs will not be starved to death because the resources are completely 
allocated to the fast jobs.



In this context, I need to group spark jobs, and the tasks from different 
groups of jobs can be scheduled using group reserved resources. At the 
beginning of each round of scheduling, tasks in this group will be scheduled 
first, only when there are no tasks in this group to schedule, its resources 
can be allocated to other groups to avoid idling of resources.



For the consideration of resource utilization and the overhead of managing 
multiple clusters, I hope that the jobs can share the spark cluster, rather 
than creating private clusters for the groups.



I've read the code for the Spark Fair Scheduler, and the implementation doesn't 
seem to meet the need to reserve resources for different groups of job.



Is there a workaround that can solve this problem through Spark Fair Scheduler? 
If it can't be solved, would you consider adding a mechanism like capacity 
scheduling.



Thank you,

Bowen Song


--
Best!
Qian SUN


Unable to create view due to up cast error when migrating from Hive to Spark

2022-05-17 Thread beliefer
During the migration from Hive to spark, there was a problem when the view 
created in Hive was used in Spark SQL.
The origin Hive SQL show below:


CREATE VIEW myView AS

SELECT
CASE WHEN age > 12 THEN CAST(gender * 0.3 - 0.1 AS double) END AS TT, gender, 
age

FROM myTable;

Users use Spark SQL to query the view, but encountered up cast error. The error 
message is as follows:

Cannot up cast TT from decimal(13, 1) to double.

The type path of the target object is:



You can either add an explicit cast to the input data or choose a higher 
precision type of the field in the target object

How should we solve this problem?

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
Yes, it should be possible, any interest to work on this together? Need
more hands to add more features here :)

On Tue, May 17, 2022 at 2:06 PM Holden Karau  wrote:

> Could we make it do the same sort of history server fallback approach?
>
> On Tue, May 17, 2022 at 10:41 PM bo yang  wrote:
>
>> It is like Web Application Proxy in YARN (
>> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html),
>> to provide easy access for Spark UI when the Spark application is running.
>>
>> When running Spark on Kubernetes with S3, there is no YARN. The reverse
>> proxy here is to behave like that Web Application Proxy. It will
>> simplify settings to access Spark UI on Kubernetes.
>>
>>
>> On Mon, May 16, 2022 at 11:46 PM wilson  wrote:
>>
>>> what's the advantage of using reverse proxy for spark UI?
>>>
>>> Thanks
>>>
>>> On Tue, May 17, 2022 at 1:47 PM bo yang  wrote:
>>>
 Hi Spark Folks,

 I built a web reverse proxy to access Spark UI on Kubernetes (working
 together with
 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
 share here in case other people have similar need.

 The reverse proxy code is here:
 https://github.com/datapunchorg/spark-ui-reverse-proxy

 Let me know if anyone wants to use or would like to contribute.

 Thanks,
 Bo

 --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Data Engineering Track at ApacheCon (October 3-6, New Orleans) - CFP ends 23/05

2022-05-17 Thread Pasha Finkelshtein
Hi Ismaël,

Looks like I do: https://github.com/JetBrains/kotlin-spark-api :)

Regards,
Pasha


ср, 18 мая 2022 г., 01:18 Ismaël Mejía :

> Hello Pasha,
>
> This is not only for Apache project maintainers, if you contribute or
> maintain other tool that integrates with an existing Apache project to
> do or improve common Data Engineering tasks it can definitely fit.
>
> Regards,
> Ismaël
>
> On Tue, May 17, 2022 at 11:23 PM Pasha Finkelshtein
>  wrote:
> >
> > Hi Ismaël,
> >
> > Thank you, it's interesting. Is this message relevant only to
> maintainers/contributors of top-level Apache projects or works for other
> maintainers of Apache-licensed software too?
> >
> > Regards,
> > Pasha
> >
> > ср, 18 мая 2022 г., 00:05 Ismaël Mejía :
> >>
> >> Hello,
> >>
> >> ApacheCon North America is back in person this year in October.
> >> https://apachecon.com/acna2022/
> >>
> >> Together with Jarek Potiuk, we are organizing for the first time a Data
> >> Engineering Track as part of ApacheCon.
> >>
> >> You might be wondering why a different track if we already have the Big
> Data
> >> track. Simple, this new track covers the ‘other’ open-source projects
> we use to
> >> clean data, orchestrate workloads, do observability, visualization,
> governance,
> >> data lineage and many other tasks that are part of data engineering and
> that are
> >> usually not covered by the data processing / database tracks.
> >>
> >> If you are curious you can find more details here:
> >> https://s.apache.org/apacheconna-2022-dataeng-track
> >>
> >> So why are you getting this message? Well it could be that (1) you are
> >> already a contributor to a project in the data engineering space and you
> >> might be interested in sending your proposal, or (2) you are interested
> in
> >> integrations of these tools with your existing data tools/projects.
> >>
> >> If you are interested you can submit a proposal using the CfP link
> below.
> >> Don’t forget to choose the Data Engineering Track.
> >> https://apachecon.com/acna2022/cfp.html
> >>
> >> The Call for Presentations (CfP) closes in less than one week on May
> 23th,
> >> 2022.
> >>
> >> We are looking forward to receiving your submissions and hopefully
> seeing you in
> >> New Orleans in October.
> >>
> >> Thanks,
> >> Ismaël and Jarek
> >>
> >> ps. Excuses if you already received this email by a different channel/ML
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>


Re: Data Engineering Track at ApacheCon (October 3-6, New Orleans) - CFP ends 23/05

2022-05-17 Thread Ismaël Mejía
Hello Pasha,

This is not only for Apache project maintainers, if you contribute or
maintain other tool that integrates with an existing Apache project to
do or improve common Data Engineering tasks it can definitely fit.

Regards,
Ismaël

On Tue, May 17, 2022 at 11:23 PM Pasha Finkelshtein
 wrote:
>
> Hi Ismaël,
>
> Thank you, it's interesting. Is this message relevant only to 
> maintainers/contributors of top-level Apache projects or works for other 
> maintainers of Apache-licensed software too?
>
> Regards,
> Pasha
>
> ср, 18 мая 2022 г., 00:05 Ismaël Mejía :
>>
>> Hello,
>>
>> ApacheCon North America is back in person this year in October.
>> https://apachecon.com/acna2022/
>>
>> Together with Jarek Potiuk, we are organizing for the first time a Data
>> Engineering Track as part of ApacheCon.
>>
>> You might be wondering why a different track if we already have the Big Data
>> track. Simple, this new track covers the ‘other’ open-source projects we use 
>> to
>> clean data, orchestrate workloads, do observability, visualization, 
>> governance,
>> data lineage and many other tasks that are part of data engineering and that 
>> are
>> usually not covered by the data processing / database tracks.
>>
>> If you are curious you can find more details here:
>> https://s.apache.org/apacheconna-2022-dataeng-track
>>
>> So why are you getting this message? Well it could be that (1) you are
>> already a contributor to a project in the data engineering space and you
>> might be interested in sending your proposal, or (2) you are interested in
>> integrations of these tools with your existing data tools/projects.
>>
>> If you are interested you can submit a proposal using the CfP link below.
>> Don’t forget to choose the Data Engineering Track.
>> https://apachecon.com/acna2022/cfp.html
>>
>> The Call for Presentations (CfP) closes in less than one week on May 23th,
>> 2022.
>>
>> We are looking forward to receiving your submissions and hopefully seeing 
>> you in
>> New Orleans in October.
>>
>> Thanks,
>> Ismaël and Jarek
>>
>> ps. Excuses if you already received this email by a different channel/ML
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Data Engineering Track at ApacheCon (October 3-6, New Orleans) - CFP ends 23/05

2022-05-17 Thread Pasha Finkelshtein
Hi Ismaël,

Thank you, it's interesting. Is this message relevant only to
maintainers/contributors of top-level Apache projects or works for other
maintainers of Apache-licensed software too?

Regards,
Pasha

ср, 18 мая 2022 г., 00:05 Ismaël Mejía :

> Hello,
>
> ApacheCon North America is back in person this year in October.
> https://apachecon.com/acna2022/
>
> Together with Jarek Potiuk, we are organizing for the first time a Data
> Engineering Track as part of ApacheCon.
>
> You might be wondering why a different track if we already have the Big
> Data
> track. Simple, this new track covers the ‘other’ open-source projects we
> use to
> clean data, orchestrate workloads, do observability, visualization,
> governance,
> data lineage and many other tasks that are part of data engineering and
> that are
> usually not covered by the data processing / database tracks.
>
> If you are curious you can find more details here:
> https://s.apache.org/apacheconna-2022-dataeng-track
>
> So why are you getting this message? Well it could be that (1) you are
> already a contributor to a project in the data engineering space and you
> might be interested in sending your proposal, or (2) you are interested in
> integrations of these tools with your existing data tools/projects.
>
> If you are interested you can submit a proposal using the CfP link below.
> Don’t forget to choose the Data Engineering Track.
> https://apachecon.com/acna2022/cfp.html
>
> The Call for Presentations (CfP) closes in less than one week on May 23th,
> 2022.
>
> We are looking forward to receiving your submissions and hopefully seeing
> you in
> New Orleans in October.
>
> Thanks,
> Ismaël and Jarek
>
> ps. Excuses if you already received this email by a different channel/ML
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau
Could we make it do the same sort of history server fallback approach?

On Tue, May 17, 2022 at 10:41 PM bo yang  wrote:

> It is like Web Application Proxy in YARN (
> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html),
> to provide easy access for Spark UI when the Spark application is running.
>
> When running Spark on Kubernetes with S3, there is no YARN. The reverse
> proxy here is to behave like that Web Application Proxy. It will
> simplify settings to access Spark UI on Kubernetes.
>
>
> On Mon, May 16, 2022 at 11:46 PM wilson  wrote:
>
>> what's the advantage of using reverse proxy for spark UI?
>>
>> Thanks
>>
>> On Tue, May 17, 2022 at 1:47 PM bo yang  wrote:
>>
>>> Hi Spark Folks,
>>>
>>> I built a web reverse proxy to access Spark UI on Kubernetes (working
>>> together with
>>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
>>> share here in case other people have similar need.
>>>
>>> The reverse proxy code is here:
>>> https://github.com/datapunchorg/spark-ui-reverse-proxy
>>>
>>> Let me know if anyone wants to use or would like to contribute.
>>>
>>> Thanks,
>>> Bo
>>>
>>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Data Engineering Track at ApacheCon (October 3-6, New Orleans) - CFP ends 23/05

2022-05-17 Thread Ismaël Mejía
Hello,

ApacheCon North America is back in person this year in October.
https://apachecon.com/acna2022/

Together with Jarek Potiuk, we are organizing for the first time a Data
Engineering Track as part of ApacheCon.

You might be wondering why a different track if we already have the Big Data
track. Simple, this new track covers the ‘other’ open-source projects we use to
clean data, orchestrate workloads, do observability, visualization, governance,
data lineage and many other tasks that are part of data engineering and that are
usually not covered by the data processing / database tracks.

If you are curious you can find more details here:
https://s.apache.org/apacheconna-2022-dataeng-track

So why are you getting this message? Well it could be that (1) you are
already a contributor to a project in the data engineering space and you
might be interested in sending your proposal, or (2) you are interested in
integrations of these tools with your existing data tools/projects.

If you are interested you can submit a proposal using the CfP link below.
Don’t forget to choose the Data Engineering Track.
https://apachecon.com/acna2022/cfp.html

The Call for Presentations (CfP) closes in less than one week on May 23th,
2022.

We are looking forward to receiving your submissions and hopefully seeing you in
New Orleans in October.

Thanks,
Ismaël and Jarek

ps. Excuses if you already received this email by a different channel/ML

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
It is like Web Application Proxy in YARN (
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html),
to provide easy access for Spark UI when the Spark application is running.

When running Spark on Kubernetes with S3, there is no YARN. The reverse
proxy here is to behave like that Web Application Proxy. It will
simplify settings to access Spark UI on Kubernetes.


On Mon, May 16, 2022 at 11:46 PM wilson  wrote:

> what's the advantage of using reverse proxy for spark UI?
>
> Thanks
>
> On Tue, May 17, 2022 at 1:47 PM bo yang  wrote:
>
>> Hi Spark Folks,
>>
>> I built a web reverse proxy to access Spark UI on Kubernetes (working
>> together with
>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
>> share here in case other people have similar need.
>>
>> The reverse proxy code is here:
>> https://github.com/datapunchorg/spark-ui-reverse-proxy
>>
>> Let me know if anyone wants to use or would like to contribute.
>>
>> Thanks,
>> Bo
>>
>>


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
Thanks Holden :)

On Mon, May 16, 2022 at 11:12 PM Holden Karau  wrote:

> Oh that’s rad 
>
> On Tue, May 17, 2022 at 7:47 AM bo yang  wrote:
>
>> Hi Spark Folks,
>>
>> I built a web reverse proxy to access Spark UI on Kubernetes (working
>> together with
>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to
>> share here in case other people have similar need.
>>
>> The reverse proxy code is here:
>> https://github.com/datapunchorg/spark-ui-reverse-proxy
>>
>> Let me know if anyone wants to use or would like to contribute.
>>
>> Thanks,
>> Bo
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Re: Re: Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread Hyukjin Kwon
There might be other blockers. Lets wait and see.

On Tue, May 17, 2022 at 8:59 PM beliefer  wrote:

> OK. let it into 3.3.1
>
>
> 在 2022-05-17 18:59:13,"Hyukjin Kwon"  写道:
>
> I think most users won't be affected since aggregate pushdown is disabled
> by default.
>
> On Tue, 17 May 2022 at 19:53, beliefer  wrote:
>
>> If we not contains https://github.com/apache/spark/pull/36556, we will
>> break change when we merge it into 3.3.1
>>
>> At 2022-05-17 18:26:12, "Hyukjin Kwon"  wrote:
>>
>> We need add https://github.com/apache/spark/pull/36556 to RC2.
>>
>> We will likely have to change the version being added if RC2 passes.
>> Since this is a new API/improvement, I would prefer to not block the
>> release by that.
>>
>> On Tue, 17 May 2022 at 19:19, beliefer  wrote:
>>
>>> We need add https://github.com/apache/spark/pull/36556 to RC2.
>>>
>>>
>>> 在 2022-05-17 17:37:13,"Hyukjin Kwon"  写道:
>>>
>>> That seems like a test-only issue. I made a quick followup at
>>> https://github.com/apache/spark/pull/36576.
>>>
>>> On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:
>>>
 I'm still seeing failures related to the function registry, like:

 ExpressionsSchemaSuite:
 - Check schemas for expression examples *** FAILED ***
   396 did not equal 398 Expected 396 blocks in result file but got 398.
 Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)

 - SPARK-14415: All functions should have own descriptions *** FAILED ***
   "Function: bloom_filter_aggClass:
 org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
 N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
 the result) (QueryTest.scala:54)

 There seems to be consistently a difference of 2 in the list of
 expected functions and actual. I haven't looked closely, don't know this
 code. I'm on Ubuntu 22.04. Anyone else seeing something like this?
 Wondering if it's something weird to do with case sensitivity, hidden files
 lurking somewhere, etc.

 I suspect it's not a 'real' error as the Linux-based testers work fine,
 but I also can't think of why this is failing.



 On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
  wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time May 19th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc2 (commit
> c8c657b922ac8fd8dcf9553113e11a80079db059):
> https://github.com/apache/spark/tree/v3.3.0-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1403
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following
> URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc2.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> 

Re:Re: Re: Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread beliefer
OK. let it into 3.3.1




在 2022-05-17 18:59:13,"Hyukjin Kwon"  写道:

I think most users won't be affected since aggregate pushdown is disabled by 
default.


On Tue, 17 May 2022 at 19:53, beliefer  wrote:


If we not contains https://github.com/apache/spark/pull/36556, we will break 
change when we merge it into 3.3.1

At 2022-05-17 18:26:12, "Hyukjin Kwon"  wrote:

We need add https://github.com/apache/spark/pull/36556 to RC2.

We will likely have to change the version being added if RC2 passes.
Since this is a new API/improvement, I would prefer to not block the release by 
that.



On Tue, 17 May 2022 at 19:19, beliefer  wrote:

We need add https://github.com/apache/spark/pull/36556 to RC2.




在 2022-05-17 17:37:13,"Hyukjin Kwon"  写道:

That seems like a test-only issue. I made a quick followup at 
https://github.com/apache/spark/pull/36576.


On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:

I'm still seeing failures related to the function registry, like:


ExpressionsSchemaSuite:
- Check schemas for expression examples *** FAILED ***
  396 did not equal 398 Expected 396 blocks in result file but got 398. Try 
regenerating the result files. (ExpressionsSchemaSuite.scala:161)

- SPARK-14415: All functions should have own descriptions *** FAILED ***
  "Function: bloom_filter_aggClass: 
org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage: 
N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in the 
result) (QueryTest.scala:54)



There seems to be consistently a difference of 2 in the list of expected 
functions and actual. I haven't looked closely, don't know this code. I'm on 
Ubuntu 22.04. Anyone else seeing something like this? Wondering if it's 
something weird to do with case sensitivity, hidden files lurking somewhere, 
etc.


I suspect it's not a 'real' error as the Linux-based testers work fine, but I 
also can't think of why this is failing.






On Mon, May 16, 2022 at 7:44 AM Maxim Gekk  
wrote:

Please vote on releasing the following candidate as Apache Spark version 3.3.0.



The vote is open until 11:59pm Pacific time May 19th and passes if a majority 
+1 PMC votes are cast, with a minimum of 3 +1 votes.



[ ] +1 Release this package as Apache Spark 3.3.0

[ ] -1 Do not release this package because ...



To learn more about Apache Spark, please see http://spark.apache.org/



The tag to be voted on is v3.3.0-rc2 (commit 
c8c657b922ac8fd8dcf9553113e11a80079db059):

https://github.com/apache/spark/tree/v3.3.0-rc2



The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/



Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS



The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1403



The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/



The list of bug fixes going into 3.3.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12350369


This release is using the release script of the tag v3.3.0-rc2.





FAQ



=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.



If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with a out of date RC going forward).



===

What should happen to JIRA tickets still targeting 3.3.0?

===

The current list of open tickets targeted at 3.3.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.3.0



Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.



==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.


Maxim Gekk


Software Engineer

Databricks, Inc.





 





 

Re: Re: Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread Hyukjin Kwon
And seems like it won't break it because adding a new method won't break
binary compatibility.

On Tue, 17 May 2022 at 19:59, Hyukjin Kwon  wrote:

> I think most users won't be affected since aggregate pushdown is disabled
> by default.
>
> On Tue, 17 May 2022 at 19:53, beliefer  wrote:
>
>> If we not contains https://github.com/apache/spark/pull/36556, we will
>> break change when we merge it into 3.3.1
>>
>> At 2022-05-17 18:26:12, "Hyukjin Kwon"  wrote:
>>
>> We need add https://github.com/apache/spark/pull/36556 to RC2.
>>
>> We will likely have to change the version being added if RC2 passes.
>> Since this is a new API/improvement, I would prefer to not block the
>> release by that.
>>
>> On Tue, 17 May 2022 at 19:19, beliefer  wrote:
>>
>>> We need add https://github.com/apache/spark/pull/36556 to RC2.
>>>
>>>
>>> 在 2022-05-17 17:37:13,"Hyukjin Kwon"  写道:
>>>
>>> That seems like a test-only issue. I made a quick followup at
>>> https://github.com/apache/spark/pull/36576.
>>>
>>> On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:
>>>
 I'm still seeing failures related to the function registry, like:

 ExpressionsSchemaSuite:
 - Check schemas for expression examples *** FAILED ***
   396 did not equal 398 Expected 396 blocks in result file but got 398.
 Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)

 - SPARK-14415: All functions should have own descriptions *** FAILED ***
   "Function: bloom_filter_aggClass:
 org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
 N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
 the result) (QueryTest.scala:54)

 There seems to be consistently a difference of 2 in the list of
 expected functions and actual. I haven't looked closely, don't know this
 code. I'm on Ubuntu 22.04. Anyone else seeing something like this?
 Wondering if it's something weird to do with case sensitivity, hidden files
 lurking somewhere, etc.

 I suspect it's not a 'real' error as the Linux-based testers work fine,
 but I also can't think of why this is failing.



 On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
  wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time May 19th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc2 (commit
> c8c657b922ac8fd8dcf9553113e11a80079db059):
> https://github.com/apache/spark/tree/v3.3.0-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1403
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following
> URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc2.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't 

Re: Re: Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread Hyukjin Kwon
I think most users won't be affected since aggregate pushdown is disabled
by default.

On Tue, 17 May 2022 at 19:53, beliefer  wrote:

> If we not contains https://github.com/apache/spark/pull/36556, we will
> break change when we merge it into 3.3.1
>
> At 2022-05-17 18:26:12, "Hyukjin Kwon"  wrote:
>
> We need add https://github.com/apache/spark/pull/36556 to RC2.
>
> We will likely have to change the version being added if RC2 passes.
> Since this is a new API/improvement, I would prefer to not block the
> release by that.
>
> On Tue, 17 May 2022 at 19:19, beliefer  wrote:
>
>> We need add https://github.com/apache/spark/pull/36556 to RC2.
>>
>>
>> 在 2022-05-17 17:37:13,"Hyukjin Kwon"  写道:
>>
>> That seems like a test-only issue. I made a quick followup at
>> https://github.com/apache/spark/pull/36576.
>>
>> On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:
>>
>>> I'm still seeing failures related to the function registry, like:
>>>
>>> ExpressionsSchemaSuite:
>>> - Check schemas for expression examples *** FAILED ***
>>>   396 did not equal 398 Expected 396 blocks in result file but got 398.
>>> Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)
>>>
>>> - SPARK-14415: All functions should have own descriptions *** FAILED ***
>>>   "Function: bloom_filter_aggClass:
>>> org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
>>> N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
>>> the result) (QueryTest.scala:54)
>>>
>>> There seems to be consistently a difference of 2 in the list of expected
>>> functions and actual. I haven't looked closely, don't know this code. I'm
>>> on Ubuntu 22.04. Anyone else seeing something like this? Wondering if it's
>>> something weird to do with case sensitivity, hidden files lurking
>>> somewhere, etc.
>>>
>>> I suspect it's not a 'real' error as the Linux-based testers work fine,
>>> but I also can't think of why this is failing.
>>>
>>>
>>>
>>> On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
>>>  wrote:
>>>
 Please vote on releasing the following candidate as
 Apache Spark version 3.3.0.

 The vote is open until 11:59pm Pacific time May 19th and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.3.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v3.3.0-rc2 (commit
 c8c657b922ac8fd8dcf9553113e11a80079db059):
 https://github.com/apache/spark/tree/v3.3.0-rc2

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1403

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/

 The list of bug fixes going into 3.3.0 can be found at the following
 URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12350369

 This release is using the release script of the tag v3.3.0-rc2.


 FAQ

 =
 How can I help test this release?
 =
 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala
 you can add the staging repository to your projects resolvers and test
 with the RC (make sure to clean up the artifact cache before/after so
 you don't end up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 3.3.0?
 ===
 The current list of open tickets targeted at 3.3.0 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.3.0

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should
 be worked on immediately. Everything else please retarget to an
 appropriate release.

 ==
 But my bug isn't fixed?
 ==
 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from the previous
 release. That being said, if there is something which is a regression
 that has not 

Re:Re: Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread beliefer
If we not contains https://github.com/apache/spark/pull/36556, we will break 
change when we merge it into 3.3.1

At 2022-05-17 18:26:12, "Hyukjin Kwon"  wrote:

We need add https://github.com/apache/spark/pull/36556 to RC2.

We will likely have to change the version being added if RC2 passes.
Since this is a new API/improvement, I would prefer to not block the release by 
that.



On Tue, 17 May 2022 at 19:19, beliefer  wrote:

We need add https://github.com/apache/spark/pull/36556 to RC2.




在 2022-05-17 17:37:13,"Hyukjin Kwon"  写道:

That seems like a test-only issue. I made a quick followup at 
https://github.com/apache/spark/pull/36576.


On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:

I'm still seeing failures related to the function registry, like:


ExpressionsSchemaSuite:
- Check schemas for expression examples *** FAILED ***
  396 did not equal 398 Expected 396 blocks in result file but got 398. Try 
regenerating the result files. (ExpressionsSchemaSuite.scala:161)

- SPARK-14415: All functions should have own descriptions *** FAILED ***
  "Function: bloom_filter_aggClass: 
org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage: 
N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in the 
result) (QueryTest.scala:54)



There seems to be consistently a difference of 2 in the list of expected 
functions and actual. I haven't looked closely, don't know this code. I'm on 
Ubuntu 22.04. Anyone else seeing something like this? Wondering if it's 
something weird to do with case sensitivity, hidden files lurking somewhere, 
etc.


I suspect it's not a 'real' error as the Linux-based testers work fine, but I 
also can't think of why this is failing.






On Mon, May 16, 2022 at 7:44 AM Maxim Gekk  
wrote:

Please vote on releasing the following candidate as Apache Spark version 3.3.0.



The vote is open until 11:59pm Pacific time May 19th and passes if a majority 
+1 PMC votes are cast, with a minimum of 3 +1 votes.



[ ] +1 Release this package as Apache Spark 3.3.0

[ ] -1 Do not release this package because ...



To learn more about Apache Spark, please see http://spark.apache.org/



The tag to be voted on is v3.3.0-rc2 (commit 
c8c657b922ac8fd8dcf9553113e11a80079db059):

https://github.com/apache/spark/tree/v3.3.0-rc2



The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/



Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS



The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1403



The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/



The list of bug fixes going into 3.3.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12350369


This release is using the release script of the tag v3.3.0-rc2.





FAQ



=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.



If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with a out of date RC going forward).



===

What should happen to JIRA tickets still targeting 3.3.0?

===

The current list of open tickets targeted at 3.3.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.3.0



Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.



==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.


Maxim Gekk


Software Engineer

Databricks, Inc.





 

Re: Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread Hyukjin Kwon
We need add https://github.com/apache/spark/pull/36556 to RC2.

We will likely have to change the version being added if RC2 passes.
Since this is a new API/improvement, I would prefer to not block the
release by that.

On Tue, 17 May 2022 at 19:19, beliefer  wrote:

> We need add https://github.com/apache/spark/pull/36556 to RC2.
>
>
> 在 2022-05-17 17:37:13,"Hyukjin Kwon"  写道:
>
> That seems like a test-only issue. I made a quick followup at
> https://github.com/apache/spark/pull/36576.
>
> On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:
>
>> I'm still seeing failures related to the function registry, like:
>>
>> ExpressionsSchemaSuite:
>> - Check schemas for expression examples *** FAILED ***
>>   396 did not equal 398 Expected 396 blocks in result file but got 398.
>> Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)
>>
>> - SPARK-14415: All functions should have own descriptions *** FAILED ***
>>   "Function: bloom_filter_aggClass:
>> org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
>> N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
>> the result) (QueryTest.scala:54)
>>
>> There seems to be consistently a difference of 2 in the list of expected
>> functions and actual. I haven't looked closely, don't know this code. I'm
>> on Ubuntu 22.04. Anyone else seeing something like this? Wondering if it's
>> something weird to do with case sensitivity, hidden files lurking
>> somewhere, etc.
>>
>> I suspect it's not a 'real' error as the Linux-based testers work fine,
>> but I also can't think of why this is failing.
>>
>>
>>
>> On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
>>  wrote:
>>
>>> Please vote on releasing the following candidate as
>>> Apache Spark version 3.3.0.
>>>
>>> The vote is open until 11:59pm Pacific time May 19th and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.3.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.3.0-rc2 (commit
>>> c8c657b922ac8fd8dcf9553113e11a80079db059):
>>> https://github.com/apache/spark/tree/v3.3.0-rc2
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1403
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>>>
>>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>
>>> This release is using the release script of the tag v3.3.0-rc2.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.3.0?
>>> ===
>>> The current list of open tickets targeted at 3.3.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.3.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>> Maxim Gekk
>>>
>>> Software Engineer
>>>
>>> Databricks, Inc.
>>>
>>
>
>
>


Re:Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread beliefer
We need add https://github.com/apache/spark/pull/36556 to RC2.




在 2022-05-17 17:37:13,"Hyukjin Kwon"  写道:

That seems like a test-only issue. I made a quick followup at 
https://github.com/apache/spark/pull/36576.


On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:

I'm still seeing failures related to the function registry, like:


ExpressionsSchemaSuite:
- Check schemas for expression examples *** FAILED ***
  396 did not equal 398 Expected 396 blocks in result file but got 398. Try 
regenerating the result files. (ExpressionsSchemaSuite.scala:161)

- SPARK-14415: All functions should have own descriptions *** FAILED ***
  "Function: bloom_filter_aggClass: 
org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage: 
N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in the 
result) (QueryTest.scala:54)



There seems to be consistently a difference of 2 in the list of expected 
functions and actual. I haven't looked closely, don't know this code. I'm on 
Ubuntu 22.04. Anyone else seeing something like this? Wondering if it's 
something weird to do with case sensitivity, hidden files lurking somewhere, 
etc.


I suspect it's not a 'real' error as the Linux-based testers work fine, but I 
also can't think of why this is failing.






On Mon, May 16, 2022 at 7:44 AM Maxim Gekk  
wrote:

Please vote on releasing the following candidate as Apache Spark version 3.3.0.



The vote is open until 11:59pm Pacific time May 19th and passes if a majority 
+1 PMC votes are cast, with a minimum of 3 +1 votes.



[ ] +1 Release this package as Apache Spark 3.3.0

[ ] -1 Do not release this package because ...



To learn more about Apache Spark, please see http://spark.apache.org/



The tag to be voted on is v3.3.0-rc2 (commit 
c8c657b922ac8fd8dcf9553113e11a80079db059):

https://github.com/apache/spark/tree/v3.3.0-rc2



The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/



Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS



The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1403



The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/



The list of bug fixes going into 3.3.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12350369


This release is using the release script of the tag v3.3.0-rc2.





FAQ



=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.



If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with a out of date RC going forward).



===

What should happen to JIRA tickets still targeting 3.3.0?

===

The current list of open tickets targeted at 3.3.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.3.0



Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.



==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.


Maxim Gekk


Software Engineer

Databricks, Inc.

Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread Hyukjin Kwon
That seems like a test-only issue. I made a quick followup at
https://github.com/apache/spark/pull/36576.

On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:

> I'm still seeing failures related to the function registry, like:
>
> ExpressionsSchemaSuite:
> - Check schemas for expression examples *** FAILED ***
>   396 did not equal 398 Expected 396 blocks in result file but got 398.
> Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)
>
> - SPARK-14415: All functions should have own descriptions *** FAILED ***
>   "Function: bloom_filter_aggClass:
> org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
> N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
> the result) (QueryTest.scala:54)
>
> There seems to be consistently a difference of 2 in the list of expected
> functions and actual. I haven't looked closely, don't know this code. I'm
> on Ubuntu 22.04. Anyone else seeing something like this? Wondering if it's
> something weird to do with case sensitivity, hidden files lurking
> somewhere, etc.
>
> I suspect it's not a 'real' error as the Linux-based testers work fine,
> but I also can't think of why this is failing.
>
>
>
> On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
>  wrote:
>
>> Please vote on releasing the following candidate as
>> Apache Spark version 3.3.0.
>>
>> The vote is open until 11:59pm Pacific time May 19th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.3.0-rc2 (commit
>> c8c657b922ac8fd8dcf9553113e11a80079db059):
>> https://github.com/apache/spark/tree/v3.3.0-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1403
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>>
>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>
>> This release is using the release script of the tag v3.3.0-rc2.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.0?
>> ===
>> The current list of open tickets targeted at 3.3.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.3.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>


Unable to create view due to up cast error when migrating from Hive to Spark

2022-05-17 Thread beliefer
During the migration from hive to spark, there was a problem with the SQL used 
to create views in hive. The problem is that the SQL that legally creates a 
view in hive will make an error when executed in spark SQL.

The SQL is as follows:

CREATE VIEW myView AS

SELECT
CASE WHEN age > 12 THEN CAST(gender * 0.3 - 0.1 AS double) END AS TT, gender, 
age

FROM myTable;

The error message is as follows:

Cannot up cast TT from decimal(13, 1) to double.

The type path of the target object is:



You can either add an explicit cast to the input data or choose a higher 
precision type of the field in the target object



How should we solve this problem?



Re: Introducing "Pandas API on Spark" component in JIRA, and use "PS" PR title component

2022-05-17 Thread Maciej
Sounds good!

+1

On 5/17/22 06:08, Yikun Jiang wrote:
> It's a pretty good idea, +1.
> 
> To be clear in Github:
> 
> - For each PR Title: [SPARK-XXX][PYTHON][PS] The Pandas on spark pr title
> (*still keep [PYTHON]* and [PS] new added)
> 
> - For PR label: new added: `PANDAS API ON Spark`, still keep: `PYTHON`,
> `CORE`
> (*still keep `PYTHON`, `CORE`* and `PANDAS API ON SPARK` new added)
> https://github.com/apache/spark/pull/36574
> 
> 
> Right?
> 
> Regards,
> Yikun
> 
> 
> On Tue, May 17, 2022 at 11:26 AM Hyukjin Kwon  > wrote:
> 
> Hi all,
> 
> What about we introduce a component in JIRA "Pandas API on Spark",
> and use "PS"  (pandas-on-Spark) in PR titles? We already use "ps" in
> many places when we: import pyspark.pandas as ps.
> This is similar to "Structured Streaming" in JIRA, and "SS" in PR title.
> 
> I think it'd be easier to track the changes here with that.
> Currently it's a bit difficult to identify it from pure PySpark changes.
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC


OpenPGP_signature
Description: OpenPGP digital signature


Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau
Oh that’s rad 

On Tue, May 17, 2022 at 7:47 AM bo yang  wrote:

> Hi Spark Folks,
>
> I built a web reverse proxy to access Spark UI on Kubernetes (working
> together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator).
> Want to share here in case other people have similar need.
>
> The reverse proxy code is here:
> https://github.com/datapunchorg/spark-ui-reverse-proxy
>
> Let me know if anyone wants to use or would like to contribute.
>
> Thanks,
> Bo
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau