date:20210407

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Hyukjin Kwon

- builds

FYI, cc'ing Spark dev was dropped during the discussion. If you haven't
subscribed to builds@a.g, you have seen the partial discussions only.
Please subscribe bui...@apache.org mailing list to participate in the
discussion further.


2021년 4월 8일 (목) 오후 1:50, Wenchen Fan 님이 작성:

> > for example, having sub-groups where each group shares the resources -
> currently one GitHub organisation shares all resources across the projects.
>
> That's a good idea. We do need to thank Github to give free resources to
> ASF projects, but it's better if we can make it a business: we allow
> individual projects to sign deals with Github to get dedicated resources.
> It's a bit wasteful to ask every project to set up its own dev ops,
> using Github Action is more convenient. Maybe we should raise it to Github?
>
> On Wed, Apr 7, 2021 at 9:31 PM Hyukjin Kwon  wrote:
>
>> Thanks Martin for your feedback.
>>
>> > What was your reason to migrate from Apache Jenkins to Github Actions ?
>>
>> I am sure there were more reasons for migrating from Amplap Jenkins
>>  to GitHub Actions but as far
>> as I can remember:
>> - To reduce the maintenance cost of machines
>> - The Jenkins machines became unstable and slow causing CI jobs to fail
>> or be very flaky.
>> - Difficulty to manage the installed libraries.
>> - Intermittent unknown issues in the machines
>>
>> Yes, one option might be to consider other options to migrate again.
>> However, other projects will very likely suffer the
>> same problem. In addition, the migration in a large project is not an
>> easy work to do
>>
>> I would like to know the feasibility of having more resources in GitHub
>> Actions, or, for example, having sub-groups where
>> each group shares the resources - currently one GitHub organisation
>> shares all resources across the projects.
>>
>>
>> 2021년 4월 7일 (수) 오후 10:04, Martin Grigorov 님이 작성:
>>
>>>
>>>
>>> On Wed, Apr 7, 2021 at 3:41 PM Hyukjin Kwon  wrote:
>>>
 Hi Greg,

 I raised this thread to figure out a way that we can work together to
 resolve this issue, gather feedback, and to understand how other
 projects
 work around.
 Several projects I observed, as far as I can tell, have made enough
 efforts
 to save the resources in GitHub Actions but still suffer from the lack
 of
 resources.

>>>
>>> And it will get even worse because:
>>> 1) more and more Apache projects migrate from TravisCI to Github Actions
>>> (GA)
>>> 2) new projects join ASF and many of them already use GA
>>>
>>>
>>> What was your reason to migrate from Apache Jenkins to Github Actions ?
>>> If you want dedicated resources then you will need to manage the CI
>>> yourself.
>>> You could use Apache Jenkins/Buildbot with dedicated agents for your
>>> project.
>>> Or you could set up your own CI infrastructure with Jenkins, DroneIO,
>>> ConcourceCI, ...
>>>
>>> Yet another option is to move to CircleCI or Cirrus. They are similar to
>>> TravisCI / GA and less crowded (for now).
>>>
>>> Martin
>>>
>>> I appreciate the resources provided to us but that does not resolve the
 issue of the development being slowed down.


 2021년 4월 7일 (수) 오후 5:52, Greg Stein 님이 작성:

 > On Wed, Apr 7, 2021 at 12:25 AM Hyukjin Kwon 
 wrote:
 >
 >> Hi all,
 >>
 >> I am an Apache Spark PMC,
 >
 >
 > You are a member of the Apache Spark PMC. You are *not* a PMC. Please
 stop
 > with that terminology. The Foundation has about 200 PMCs, and you are
 a
 > member of one of them. You are NOT a "PMC" .. you're a person. A PMC
 is a
 > construct of the Foundation.
 >
 > >...
 >
 >> I am aware of the limited GitHub Actions resources that are shared
 >> across all projects in ASF,
 >> and many projects suffer from it. This issue significantly slows
 down the
 >> development cycle of
 >>  other projects, at least Apache Spark.
 >>
 >
 > And the Foundation gets those build minutes for GitHub Actions
 provided to
 > us from GitHub and Microsoft, and we are thankful that they provide
 them to
 > the Foundation. Maybe it isn't all the build minutes that every group
 > wants, but that is what we have. So it is incumbent upon all of us to
 > figure out how to build more, with fewer minutes.
 >
 > Say "thank you" to GitHub, please.
 >
 > Regards,
 > -g
 >
 >

>>>

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Wenchen Fan

> for example, having sub-groups where each group shares the resources -
currently one GitHub organisation shares all resources across the projects.

That's a good idea. We do need to thank Github to give free resources to
ASF projects, but it's better if we can make it a business: we allow
individual projects to sign deals with Github to get dedicated resources.
It's a bit wasteful to ask every project to set up its own dev ops,
using Github Action is more convenient. Maybe we should raise it to Github?

On Wed, Apr 7, 2021 at 9:31 PM Hyukjin Kwon  wrote:

> Thanks Martin for your feedback.
>
> > What was your reason to migrate from Apache Jenkins to Github Actions ?
>
> I am sure there were more reasons for migrating from Amplap Jenkins
>  to GitHub Actions but as far as
> I can remember:
> - To reduce the maintenance cost of machines
> - The Jenkins machines became unstable and slow causing CI jobs to fail or
> be very flaky.
> - Difficulty to manage the installed libraries.
> - Intermittent unknown issues in the machines
>
> Yes, one option might be to consider other options to migrate again.
> However, other projects will very likely suffer the
> same problem. In addition, the migration in a large project is not an
> easy work to do
>
> I would like to know the feasibility of having more resources in GitHub
> Actions, or, for example, having sub-groups where
> each group shares the resources - currently one GitHub organisation shares
> all resources across the projects.
>
>
> 2021년 4월 7일 (수) 오후 10:04, Martin Grigorov 님이 작성:
>
>>
>>
>> On Wed, Apr 7, 2021 at 3:41 PM Hyukjin Kwon  wrote:
>>
>>> Hi Greg,
>>>
>>> I raised this thread to figure out a way that we can work together to
>>> resolve this issue, gather feedback, and to understand how other projects
>>> work around.
>>> Several projects I observed, as far as I can tell, have made enough
>>> efforts
>>> to save the resources in GitHub Actions but still suffer from the lack of
>>> resources.
>>>
>>
>> And it will get even worse because:
>> 1) more and more Apache projects migrate from TravisCI to Github Actions
>> (GA)
>> 2) new projects join ASF and many of them already use GA
>>
>>
>> What was your reason to migrate from Apache Jenkins to Github Actions ?
>> If you want dedicated resources then you will need to manage the CI
>> yourself.
>> You could use Apache Jenkins/Buildbot with dedicated agents for your
>> project.
>> Or you could set up your own CI infrastructure with Jenkins, DroneIO,
>> ConcourceCI, ...
>>
>> Yet another option is to move to CircleCI or Cirrus. They are similar to
>> TravisCI / GA and less crowded (for now).
>>
>> Martin
>>
>> I appreciate the resources provided to us but that does not resolve the
>>> issue of the development being slowed down.
>>>
>>>
>>> 2021년 4월 7일 (수) 오후 5:52, Greg Stein 님이 작성:
>>>
>>> > On Wed, Apr 7, 2021 at 12:25 AM Hyukjin Kwon 
>>> wrote:
>>> >
>>> >> Hi all,
>>> >>
>>> >> I am an Apache Spark PMC,
>>> >
>>> >
>>> > You are a member of the Apache Spark PMC. You are *not* a PMC. Please
>>> stop
>>> > with that terminology. The Foundation has about 200 PMCs, and you are a
>>> > member of one of them. You are NOT a "PMC" .. you're a person. A PMC
>>> is a
>>> > construct of the Foundation.
>>> >
>>> > >...
>>> >
>>> >> I am aware of the limited GitHub Actions resources that are shared
>>> >> across all projects in ASF,
>>> >> and many projects suffer from it. This issue significantly slows down
>>> the
>>> >> development cycle of
>>> >>  other projects, at least Apache Spark.
>>> >>
>>> >
>>> > And the Foundation gets those build minutes for GitHub Actions
>>> provided to
>>> > us from GitHub and Microsoft, and we are thankful that they provide
>>> them to
>>> > the Foundation. Maybe it isn't all the build minutes that every group
>>> > wants, but that is what we have. So it is incumbent upon all of us to
>>> > figure out how to build more, with fewer minutes.
>>> >
>>> > Say "thank you" to GitHub, please.
>>> >
>>> > Regards,
>>> > -g
>>> >
>>> >
>>>
>>

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh



I'm working on the fix for master. I think the fix is the same for 2.4.

Okay. So I think we are in favor of RC2 and RC1 is dropped. Then I will make
the fix merged first and then prepare RC2.

Thank you.

Liang-Chi


Mridul Muralidharan wrote
> Do we have a fix for this in 3.x/master which can be backported without
> too
> much surrounding change ?
> Given we are expecting 2.4.7 to probably be the last release for 2.4, if
> we
> can fix it, that would be great.
> 
> Regards,
> Mridul





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Build error message guideline

2021-04-07 Thread Hyukjin Kwon

LGTM (I took a look, and had some offline discussions w/ some corrections
before it came out)

2021년 4월 8일 (목) 오전 5:28, Karen 님이 작성:

> Hi all,
>
> As discussed in SPIP: Standardize Exception Messages in Spark (
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing),
> improving error message quality in Apache Spark involves establishing an
> error message guideline for developers. Error message style guidelines are
> common practice across open-source projects, for example PostgreSQL (
> https://www.postgresql.org/docs/current/error-style-guide.html).
>
> To move towards the goal of improving error message quality, we would like
> to start building an error message guideline. We have attached a rough
> draft to kick off this discussion:
> https://docs.google.com/document/d/12k4zmaKmmdm6Pk63HS0N1zN1QT-6TihkWaa5CkLmsn8/edit?usp=sharing
> .
>
> Please let us know what you think should be in the guideline! We look
> forward to building this as a community.
>
> Thank you,
>
> Karen Feng
>

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Mridul Muralidharan

Do we have a fix for this in 3.x/master which can be backported without too
much surrounding change ?
Given we are expecting 2.4.7 to probably be the last release for 2.4, if we
can fix it, that would be great.

Regards,
Mridul

On Wed, Apr 7, 2021 at 9:31 PM Liang-Chi Hsieh  wrote:

> Thanks for voting.
>
> After I started running the release script to cut RC1 for a while, I found
> a
> nested column pruning bug SPARK-34963, and unfortunately it exists in 2.4.7
> too. As RC1 is cut, so I continue this voting.
>
> The bug looks corner case to me and it is not reported yet since we support
> nested column pruning from 2.4. So maybe it is okay to not fix it in 2.4?
>
>
>
>
> cloud0fan wrote
> > +1
> >
> > On Thu, Apr 8, 2021 at 9:24 AM Sean Owen 
>
> > srowen@
>
> >  wrote:
> >
> >> Looks good to me testing on Java 8, Hadoop 2.7, Ubuntu, with about all
> >> profiles enabled.
> >> I still get an odd failure in the Hive versions suite, but I keep seeing
> >> that in my env and think it's something odd about my setup.
> >> +1
> >>
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh

Thanks for voting.

After I started running the release script to cut RC1 for a while, I found a
nested column pruning bug SPARK-34963, and unfortunately it exists in 2.4.7
too. As RC1 is cut, so I continue this voting.

The bug looks corner case to me and it is not reported yet since we support
nested column pruning from 2.4. So maybe it is okay to not fix it in 2.4?

cloud0fan wrote
> +1
> 
> On Thu, Apr 8, 2021 at 9:24 AM Sean Owen 

> srowen@

>  wrote:
> 
>> Looks good to me testing on Java 8, Hadoop 2.7, Ubuntu, with about all
>> profiles enabled.
>> I still get an odd failure in the Hive versions suite, but I keep seeing
>> that in my env and think it's something odd about my setup.
>> +1
>>

--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Takeshi Yamamuro

Thanks for driving this, Liang-Chi~
IIUC there is no critical issue in the SQL part, so it looks fine.

+1 (non-binding)

On Thu, Apr 8, 2021 at 11:20 AM Wenchen Fan  wrote:

> +1
>
> On Thu, Apr 8, 2021 at 9:24 AM Sean Owen  wrote:
>
>> Looks good to me testing on Java 8, Hadoop 2.7, Ubuntu, with about all
>> profiles enabled.
>> I still get an odd failure in the Hive versions suite, but I keep seeing
>> that in my env and think it's something odd about my setup.
>> +1
>>
>

-- 
---
Takeshi Yamamuro

Re: Big Broadcast Hash Join with Dynamic Partition Pruning gives wrong results

2021-04-07 Thread Wenchen Fan

Hi Tomas, thanks for reporting this bug!

Is it possible to share your dataset so that other people can reproduce and
debug it?

On Thu, Apr 8, 2021 at 7:52 AM Tomas Bartalos 
wrote:

> when I try to do a Broadcast Hash Join on a bigger table (6Mil rows) I get
> an incorrect result of 0 rows.
>
> val rightDF = spark.read.format("parquet").load("table-a")
> val leftDF =  spark.read.format("parquet").load("table-b")
>   //needed to activate dynamic pruning subquery
>   .where('part_ts === 20210304000L)
>
> // leftDF has 7 Mil rows ~ 120 MB
> val join = broadcast(leftDF).join(rightDF,
>   $"match_part_id" === $"part_id" && $"match_id" === $"id"
> )
> join.count
>
> res1: Long = 0
>
> I think it's connected with Dynamic Partition Pruning of the rightDF,
> which is happening according to the plan:
>
> PartitionFilters: [isnotnull(part_id#477L), 
> dynamicpruningexpression(part_id#477L IN dynamicpruning#534)]
>
> = Subqueries =
>
> Subquery:1 Hosting operator id = 6 Hosting Expression = part_id#477L IN 
> dynamicpruning#534
> ReusedExchange (11)
>
>
> (11) ReusedExchange [Reuses operator id: 5]
> Output [4]: [match_part_id#487L, match_id#488L, UK#489, part_ts#490L]
>
> *Removing the broadcast hint OR shrinking the broadcasted table corrects
> the result*:
>
> val rightDF = spark.read.format("parquet").load("table-a")
> val leftDF =  spark.read.format("parquet").load("table-b")
>   //needed to activate dynamic pruning subquery
>   .where('part_ts === 20210304000L)
>  // shrinks the broadcasted table to 18K rows
>  .where('match_id === 33358792)
>
> // leftDF has 18K rows
> val join = broadcast(leftDF).join(rightDF,
>   $"match_part_id" === $"part_id" && $"match_id" === $"id"
> )
> join.count
>
> res2: Long = 379701
>
> I would expect the broadcast to fail, but would never expect to get
> incorrect results without an exception. What do you think ?
>
>
> BR,
>
> Tomas
>

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Wenchen Fan

+1

On Thu, Apr 8, 2021 at 9:24 AM Sean Owen  wrote:

> Looks good to me testing on Java 8, Hadoop 2.7, Ubuntu, with about all
> profiles enabled.
> I still get an odd failure in the Hive versions suite, but I keep seeing
> that in my env and think it's something odd about my setup.
> +1
>

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Sean Owen

Looks good to me testing on Java 8, Hadoop 2.7, Ubuntu, with about all
profiles enabled.
I still get an odd failure in the Hive versions suite, but I keep seeing
that in my env and think it's something odd about my setup.
+1

Big Broadcast Hash Join with Dynamic Partition Pruning gives wrong results

2021-04-07 Thread Tomas Bartalos

when I try to do a Broadcast Hash Join on a bigger table (6Mil rows) I get
an incorrect result of 0 rows.

val rightDF = spark.read.format("parquet").load("table-a")
val leftDF =  spark.read.format("parquet").load("table-b")
  //needed to activate dynamic pruning subquery
  .where('part_ts === 20210304000L)

// leftDF has 7 Mil rows ~ 120 MB
val join = broadcast(leftDF).join(rightDF,
  $"match_part_id" === $"part_id" && $"match_id" === $"id"
)
join.count

res1: Long = 0

I think it's connected with Dynamic Partition Pruning of the rightDF, which
is happening according to the plan:

PartitionFilters: [isnotnull(part_id#477L),
dynamicpruningexpression(part_id#477L IN dynamicpruning#534)]

= Subqueries =

Subquery:1 Hosting operator id = 6 Hosting Expression = part_id#477L
IN dynamicpruning#534
ReusedExchange (11)


(11) ReusedExchange [Reuses operator id: 5]
Output [4]: [match_part_id#487L, match_id#488L, UK#489, part_ts#490L]

*Removing the broadcast hint OR shrinking the broadcasted table corrects
the result*:

val rightDF = spark.read.format("parquet").load("table-a")
val leftDF =  spark.read.format("parquet").load("table-b")
  //needed to activate dynamic pruning subquery
  .where('part_ts === 20210304000L)
 // shrinks the broadcasted table to 18K rows
 .where('match_id === 33358792)

// leftDF has 18K rows
val join = broadcast(leftDF).join(rightDF,
  $"match_part_id" === $"part_id" && $"match_id" === $"id"
)
join.count

res2: Long = 379701

I would expect the broadcast to fail, but would never expect to get
incorrect results without an exception. What do you think ?


BR,

Tomas

Re: [Discuss][SPIP] DataSource V2 SQL push down

2021-04-07 Thread huaxin gao

Hi Chang,

Thanks for working on this.

Could you please explain how your proposal can be extended to the
file-based data sources? Since at least half of the Spark community are
using file-based data sources, I think any designs should consider the
file-based data sources as well. I work on both sql-based and file-based
data sources, and I understand that they are very different. It’s
challenging to have a design to work for both, but the current filter push
down and column pruning have been designed nicely to fit both sides. I
think we should follow the same approach to make Aggregate push down work
for both too.

I am currently collaborating with the Apple Spark team and Facebook Spark
team to push down Aggregate to file-based data sources. We are doing some
ongoing work right now to push down Max/Min/Count to parquet and later to
ORC to utilize the statistics information there (
https://github.com/apache/spark/pull/32049). Please correct me if I am
wrong: it seems to me that your proposal doesn't consider file-based data
sources at all and will stop us from continuing our work.

Let's schedule a meeting to discuss this?

Thanks,

Huaxin



On Wed, Apr 7, 2021 at 1:32 AM Chang Chen  wrote:

> hi huaxin
>
> please review https://github.com/apache/spark/pull/32061
>
> as for add a *trait PrunedFilteredAggregateScan* for V1 JDBC, I delete
> trait, since V1 DataSource needn't support aggregation push down
>
> Chang Chen  于2021年4月5日周一 下午10:02写道：
>
>> Hi huaxin
>>
>> What I am concerned about is abstraction
>>
>>1. How to extend sources.Aggregation. Because Catalyst Expression
>>is recursion, it is very bad to define a new hierarchy, I think 
>> ScanBuilder
>>must convert pushed expressions to its format.
>>2. The optimization rule is also an extended point, I didn't see any
>>consideration on join push down. I also think
>>SupportsPushDownRequiredColumns and SupportsPushDownFilters are
>>problematic.
>>
>> Obviously, File Based Source and SQL Based Source are quite different on
>> push down capabilities. I am not sure they can be consolidated into one API.
>>
>> I will push my PR tomorrow, and after that, could we schedule a meeting
>> to discuss the API?
>>
>> huaxin gao  于2021年4月5日周一 上午2:24写道：
>>
>>> Hello Chang,
>>>
>>> Thanks for proposing the SPIP and initiating the discussion. However, I
>>> think the problem with your proposal is that you haven’t taken into
>>> consideration file-based data sources such as parquet, ORC, etc. As far as
>>> I know, most of the Spark users have file-based data sources.  As a matter
>>> of fact, I have customers waiting for Aggregate push down for Parquet.
>>> That’s the reason I have my current implementation, which has a unified
>>> Aggregate push down approach for both the file-based data sources and JDBC.
>>>
>>> I discussed with several members of the Spark community recently, and we
>>> have agreed to break down the Aggregate push down work into the following
>>> steps:
>>>
>>>1.
>>>
>>>Implement Max, Min and Count push down in Parquet
>>>2.
>>>
>>>Add a new physical plan rewrite rule to remove partial aggregate. We
>>>can optimize one more step to remove ShuffleExchange if the group by 
>>> column
>>>and partition col are the same.
>>>3.
>>>
>>>Implement Max, Min and Count push down in JDBC
>>>4.
>>>
>>>Implement Sum and Avg push down in JDBC
>>>
>>>
>>> I plan to implement Aggregate push down for Parquet first for now. The
>>> reasons are:
>>>
>>>1.
>>>
>>>It’s relatively easier to implement Parquet Aggregate push down than
>>>JDBC.
>>>
>>>
>>>1.
>>>
>>>Only need to implement  Max, Min and Count
>>>2.
>>>
>>>No need to deal with the differences between Spark and other
>>>databases. For example, aggregating decimal values have different
>>>behaviours between database implementations.
>>>
>>> The main point is that we want to keep the PR minimal and support the
>>> basic infrastructure for Aggregate push down first. Actually, the PR for
>>> implementing Parquet Aggregate push down is already very big. We don’t want
>>> to have a huge PR to solve all the problems. It’s too hard to review.
>>>
>>>
>>>1.
>>>
>>>I think it’s too early to implement the JDBC Aggregate push down for
>>>now. Underneath, V2 DS JDBC still calls the V1 DS JDBC path. If we
>>>implement JDBC Aggregate push down now, we still need to add a *trait
>>>PrunedFilteredAggregateScan* for V1 JDBC. One of the major
>>>motivations that we are having V2 DS is that we want to improve the
>>>flexibility of implementing new operator push down by avoiding adding a 
>>> new
>>>push down trait. If we still add a new pushdown trait in V1 DS JDBC, I 
>>> feel
>>>we are defeating the purpose of having DS V2. So I want to wait until we
>>>fully migrate to DS V2 JDBC, and then implement Aggregate push down for
>>>JDBC.
>>>
>>>
>>> I have submitted Parquet Aggregate push

[DISCUSS] Build error message guideline

2021-04-07 Thread Karen

Hi all,

As discussed in SPIP: Standardize Exception Messages in Spark (
https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing),
improving error message quality in Apache Spark involves establishing an
error message guideline for developers. Error message style guidelines are
common practice across open-source projects, for example PostgreSQL (
https://www.postgresql.org/docs/current/error-style-guide.html).

To move towards the goal of improving error message quality, we would like
to start building an error message guideline. We have attached a rough
draft to kick off this discussion:
https://docs.google.com/document/d/12k4zmaKmmdm6Pk63HS0N1zN1QT-6TihkWaa5CkLmsn8/edit?usp=sharing
.

Please let us know what you think should be in the guideline! We look
forward to building this as a community.

Thank you,

Karen Feng

please read: current state and the future of the apache spark build system

2021-04-07 Thread shane knapp ☠

this will be a relatively big update, as there are many many moving pieces
with short, medium and long term goals.

TLDR1:   we're shutting jenkins down at the end of 2021.

TLDR2:  i know we're way behind on pretty much everything.  most of the
hardware is at or beyond EOL, and random systemic build failures (like
k8s/minikube) are randomly popping up.  i've had to restrict access due to
new campus policies, and i will be dealing with that shortly and only for a
few contributors.

long term (until EOY):
* decide what the future of spark builds and releases will look like
  - do we need jenkins?
  - if we do, who's responsible for hosting + ops?
* we will permanently shut down amplab jenkins by the end of 2021
  - uc berkeley has funded this for over 10 years, and both the funds and
staff (only me, for 7 years) are going away.  i'm staying at cal, but have
a much different job now.  :)

medium term (in 6 months):
* prepare jenkins worker ansible configs and stick in the spark repo
  - nothing fancy, but enough to config ubuntu workers
  - could be used to create docker containers for testing in
THE CLOUD
* train up brian shiratsuki (cced) to help w/ops tasks and upgrades over
the next ~6m
* get to all of the python version, library installation, etc etc jira
requests

short term(weeks):
* debug and figure out why minikube/k8s broke
  - https://issues.apache.org/jira/browse/SPARK-34738
  - i really could use some help here...
* bring up additional workers
  - finish hardware/system level repairs on the bare metal
  - see above, re k8s jira
* stabilize cluster
  - recent jenkins LTS upgrade broke the web GUI
  - finish deploying monitoring/alerting
  - this hardware is OLD and literally falling over, so we have lots of
random disk and ram failures.  it's literally whack-a-mole and each trip to
the colo to repair literally takes a full day

i'm only able to spend a few hours a week on the build system, so expect
random downtime, reboots, restarts, and testing.  we're testing new nodes
as we deploy, and hoping to fix anything before releasing them into the
wild, but some things might be flaky.

but the biggest question is what you all need w/regards to build
infrastructure...  and who's going to be responsible for it.

thanks for reading!  :)

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Hyukjin Kwon

Thanks Martin for your feedback.

> What was your reason to migrate from Apache Jenkins to Github Actions ?

I am sure there were more reasons for migrating from Amplap Jenkins
 to GitHub Actions but as far as I
can remember:
- To reduce the maintenance cost of machines
- The Jenkins machines became unstable and slow causing CI jobs to fail or
be very flaky.
- Difficulty to manage the installed libraries.
- Intermittent unknown issues in the machines

Yes, one option might be to consider other options to migrate again.
However, other projects will very likely suffer the
same problem. In addition, the migration in a large project is not an
easy work to do

I would like to know the feasibility of having more resources in GitHub
Actions, or, for example, having sub-groups where
each group shares the resources - currently one GitHub organisation shares
all resources across the projects.


2021년 4월 7일 (수) 오후 10:04, Martin Grigorov 님이 작성:

>
>
> On Wed, Apr 7, 2021 at 3:41 PM Hyukjin Kwon  wrote:
>
>> Hi Greg,
>>
>> I raised this thread to figure out a way that we can work together to
>> resolve this issue, gather feedback, and to understand how other projects
>> work around.
>> Several projects I observed, as far as I can tell, have made enough
>> efforts
>> to save the resources in GitHub Actions but still suffer from the lack of
>> resources.
>>
>
> And it will get even worse because:
> 1) more and more Apache projects migrate from TravisCI to Github Actions
> (GA)
> 2) new projects join ASF and many of them already use GA
>
>
> What was your reason to migrate from Apache Jenkins to Github Actions ?
> If you want dedicated resources then you will need to manage the CI
> yourself.
> You could use Apache Jenkins/Buildbot with dedicated agents for your
> project.
> Or you could set up your own CI infrastructure with Jenkins, DroneIO,
> ConcourceCI, ...
>
> Yet another option is to move to CircleCI or Cirrus. They are similar to
> TravisCI / GA and less crowded (for now).
>
> Martin
>
> I appreciate the resources provided to us but that does not resolve the
>> issue of the development being slowed down.
>>
>>
>> 2021년 4월 7일 (수) 오후 5:52, Greg Stein 님이 작성:
>>
>> > On Wed, Apr 7, 2021 at 12:25 AM Hyukjin Kwon 
>> wrote:
>> >
>> >> Hi all,
>> >>
>> >> I am an Apache Spark PMC,
>> >
>> >
>> > You are a member of the Apache Spark PMC. You are *not* a PMC. Please
>> stop
>> > with that terminology. The Foundation has about 200 PMCs, and you are a
>> > member of one of them. You are NOT a "PMC" .. you're a person. A PMC is
>> a
>> > construct of the Foundation.
>> >
>> > >...
>> >
>> >> I am aware of the limited GitHub Actions resources that are shared
>> >> across all projects in ASF,
>> >> and many projects suffer from it. This issue significantly slows down
>> the
>> >> development cycle of
>> >>  other projects, at least Apache Spark.
>> >>
>> >
>> > And the Foundation gets those build minutes for GitHub Actions provided
>> to
>> > us from GitHub and Microsoft, and we are thankful that they provide
>> them to
>> > the Foundation. Maybe it isn't all the build minutes that every group
>> > wants, but that is what we have. So it is incumbent upon all of us to
>> > figure out how to build more, with fewer minutes.
>> >
>> > Say "thank you" to GitHub, please.
>> >
>> > Regards,
>> > -g
>> >
>> >
>>
>

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Martin Grigorov

On Wed, Apr 7, 2021 at 3:41 PM Hyukjin Kwon  wrote:

> Hi Greg,
>
> I raised this thread to figure out a way that we can work together to
> resolve this issue, gather feedback, and to understand how other projects
> work around.
> Several projects I observed, as far as I can tell, have made enough efforts
> to save the resources in GitHub Actions but still suffer from the lack of
> resources.
>

And it will get even worse because:
1) more and more Apache projects migrate from TravisCI to Github Actions
(GA)
2) new projects join ASF and many of them already use GA


What was your reason to migrate from Apache Jenkins to Github Actions ?
If you want dedicated resources then you will need to manage the CI
yourself.
You could use Apache Jenkins/Buildbot with dedicated agents for your
project.
Or you could set up your own CI infrastructure with Jenkins, DroneIO,
ConcourceCI, ...

Yet another option is to move to CircleCI or Cirrus. They are similar to
TravisCI / GA and less crowded (for now).

Martin

I appreciate the resources provided to us but that does not resolve the
> issue of the development being slowed down.
>
>
> 2021년 4월 7일 (수) 오후 5:52, Greg Stein 님이 작성:
>
> > On Wed, Apr 7, 2021 at 12:25 AM Hyukjin Kwon 
> wrote:
> >
> >> Hi all,
> >>
> >> I am an Apache Spark PMC,
> >
> >
> > You are a member of the Apache Spark PMC. You are *not* a PMC. Please
> stop
> > with that terminology. The Foundation has about 200 PMCs, and you are a
> > member of one of them. You are NOT a "PMC" .. you're a person. A PMC is a
> > construct of the Foundation.
> >
> > >...
> >
> >> I am aware of the limited GitHub Actions resources that are shared
> >> across all projects in ASF,
> >> and many projects suffer from it. This issue significantly slows down
> the
> >> development cycle of
> >>  other projects, at least Apache Spark.
> >>
> >
> > And the Foundation gets those build minutes for GitHub Actions provided
> to
> > us from GitHub and Microsoft, and we are thankful that they provide them
> to
> > the Foundation. Maybe it isn't all the build minutes that every group
> > wants, but that is what we have. So it is incumbent upon all of us to
> > figure out how to build more, with fewer minutes.
> >
> > Say "thank you" to GitHub, please.
> >
> > Regards,
> > -g
> >
> >
>

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Hyukjin Kwon

Hi Greg,

I raised this thread to figure out a way that we can work together to
resolve this issue, gather feedback, and to understand how other projects
work around.
Several projects I observed, as far as I can tell, have made enough efforts
to save the resources in GitHub Actions but still suffer from the lack of
resources.
I appreciate the resources provided to us but that does not resolve the
issue of the development being slowed down.


2021년 4월 7일 (수) 오후 5:52, Greg Stein 님이 작성:

> On Wed, Apr 7, 2021 at 12:25 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I am an Apache Spark PMC,
>
>
> You are a member of the Apache Spark PMC. You are *not* a PMC. Please stop
> with that terminology. The Foundation has about 200 PMCs, and you are a
> member of one of them. You are NOT a "PMC" .. you're a person. A PMC is a
> construct of the Foundation.
>
> >...
>
>> I am aware of the limited GitHub Actions resources that are shared
>> across all projects in ASF,
>> and many projects suffer from it. This issue significantly slows down the
>> development cycle of
>>  other projects, at least Apache Spark.
>>
>
> And the Foundation gets those build minutes for GitHub Actions provided to
> us from GitHub and Microsoft, and we are thankful that they provide them to
> the Foundation. Maybe it isn't all the build minutes that every group
> wants, but that is what we have. So it is incumbent upon all of us to
> figure out how to build more, with fewer minutes.
>
> Say "thank you" to GitHub, please.
>
> Regards,
> -g
>
>

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Greg Stein

On Wed, Apr 7, 2021 at 12:25 AM Hyukjin Kwon  wrote:

> Hi all,
>
> I am an Apache Spark PMC,

You are a member of the Apache Spark PMC. You are *not* a PMC. Please stop
with that terminology. The Foundation has about 200 PMCs, and you are a
member of one of them. You are NOT a "PMC" .. you're a person. A PMC is a
construct of the Foundation.

>...

> I am aware of the limited GitHub Actions resources that are shared
> across all projects in ASF,
> and many projects suffer from it. This issue significantly slows down the
> development cycle of
>  other projects, at least Apache Spark.
>

And the Foundation gets those build minutes for GitHub Actions provided to
us from GitHub and Microsoft, and we are thankful that they provide them to
the Foundation. Maybe it isn't all the build minutes that every group
wants, but that is what we have. So it is incumbent upon all of us to
figure out how to build more, with fewer minutes.

Say "thank you" to GitHub, please.

Regards,
-g

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Jarek Potiuk

Just a comment here - as  I commented also in the ticket

The document
https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
gives complete overview of where the Github Actions are  for the ASF
project.

And we have some nice experiences in Apache Airflow that we will be able to
share soon likely with running our own self -hosted runners. More in this
comment
https://issues.apache.org/jira/browse/INFRA-21646?focusedCommentId=17316108=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17316108

J.

On Wed, Apr 7, 2021 at 7:24 AM Hyukjin Kwon  wrote:

> Hi all,
>
> I am an Apache Spark PMC, and would like to know the future plan about
> GitHub Actions in ASF.
> Please also see the INFRA ticket I filed:
> https://issues.apache.org/jira/browse/INFRA-21646.
>
> I am aware of the limited GitHub Actions resources that are shared
> across all projects in ASF,
> and many projects suffer from it. This issue significantly slows down the
> development cycle of
>  other projects, at least Apache Spark.
>
> How do we plan to increase the resources in GitHub Actions, and what are
> the blockers? I would appreciate any input and thoughts on this.
>
> Thank you so much.
>
> CC'ing Spark @dev  for more visibility. Please take
> it out if considered inappropriate.
>

-- 
+48 660 796 129

Re: [Discuss][SPIP] DataSource V2 SQL push down

2021-04-07 Thread Chang Chen

hi huaxin

please review https://github.com/apache/spark/pull/32061

as for add a *trait PrunedFilteredAggregateScan* for V1 JDBC, I delete
trait, since V1 DataSource needn't support aggregation push down

Chang Chen  于2021年4月5日周一 下午10:02写道：

> Hi huaxin
>
> What I am concerned about is abstraction
>
>1. How to extend sources.Aggregation. Because Catalyst Expression
>is recursion, it is very bad to define a new hierarchy, I think ScanBuilder
>must convert pushed expressions to its format.
>2. The optimization rule is also an extended point, I didn't see any
>consideration on join push down. I also think
>SupportsPushDownRequiredColumns and SupportsPushDownFilters are
>problematic.
>
> Obviously, File Based Source and SQL Based Source are quite different on
> push down capabilities. I am not sure they can be consolidated into one API.
>
> I will push my PR tomorrow, and after that, could we schedule a meeting to
> discuss the API?
>
> huaxin gao  于2021年4月5日周一 上午2:24写道：
>
>> Hello Chang,
>>
>> Thanks for proposing the SPIP and initiating the discussion. However, I
>> think the problem with your proposal is that you haven’t taken into
>> consideration file-based data sources such as parquet, ORC, etc. As far as
>> I know, most of the Spark users have file-based data sources.  As a matter
>> of fact, I have customers waiting for Aggregate push down for Parquet.
>> That’s the reason I have my current implementation, which has a unified
>> Aggregate push down approach for both the file-based data sources and JDBC.
>>
>> I discussed with several members of the Spark community recently, and we
>> have agreed to break down the Aggregate push down work into the following
>> steps:
>>
>>1.
>>
>>Implement Max, Min and Count push down in Parquet
>>2.
>>
>>Add a new physical plan rewrite rule to remove partial aggregate. We
>>can optimize one more step to remove ShuffleExchange if the group by 
>> column
>>and partition col are the same.
>>3.
>>
>>Implement Max, Min and Count push down in JDBC
>>4.
>>
>>Implement Sum and Avg push down in JDBC
>>
>>
>> I plan to implement Aggregate push down for Parquet first for now. The
>> reasons are:
>>
>>1.
>>
>>It’s relatively easier to implement Parquet Aggregate push down than
>>JDBC.
>>
>>
>>1.
>>
>>Only need to implement  Max, Min and Count
>>2.
>>
>>No need to deal with the differences between Spark and other databases.
>>For example, aggregating decimal values have different behaviours
>>between database implementations.
>>
>> The main point is that we want to keep the PR minimal and support the
>> basic infrastructure for Aggregate push down first. Actually, the PR for
>> implementing Parquet Aggregate push down is already very big. We don’t want
>> to have a huge PR to solve all the problems. It’s too hard to review.
>>
>>
>>1.
>>
>>I think it’s too early to implement the JDBC Aggregate push down for
>>now. Underneath, V2 DS JDBC still calls the V1 DS JDBC path. If we
>>implement JDBC Aggregate push down now, we still need to add a *trait
>>PrunedFilteredAggregateScan* for V1 JDBC. One of the major
>>motivations that we are having V2 DS is that we want to improve the
>>flexibility of implementing new operator push down by avoiding adding a 
>> new
>>push down trait. If we still add a new pushdown trait in V1 DS JDBC, I 
>> feel
>>we are defeating the purpose of having DS V2. So I want to wait until we
>>fully migrate to DS V2 JDBC, and then implement Aggregate push down for
>>JDBC.
>>
>>
>> I have submitted Parquet Aggregate push down PR. Here is the link:
>>
>> https://github.com/apache/spark/pull/32049
>>
>>
>> Thanks,
>>
>> Huaxin
>>
>>
>> On Fri, Apr 2, 2021 at 1:04 AM Chang Chen  wrote:
>>
>>> The link is broken. I post a PDF version.
>>>
>>> Chang Chen  于2021年4月2日周五 下午3:57写道：
>>>
 Hi All

 We would like to post s SPIP of Datasource V2 SQL PushDown in Spark.
 Here is document link:


 https://olapio.atlassian.net/wiki/spaces/TeamCX/pages/2667315361/Discuss+SQL+Data+Source+V2+SQL+Push+Down?atlOrigin=eyJpIjoiOTI5NGYzYWMzMWYwNDliOWIwM2ZkODllODk4Njk2NzEiLCJwIjoiYyJ9

 This SPIP aims to make pushdown more extendable.

 I would like to thank huaxin gao, my prototype is based on her PR. I
 will submit a PR ASAP

 Thanks

 Chang.

>>>

[VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Liang-Chi Hsieh



Please vote on releasing the following candidate as Apache Spark version
2.4.8.

The vote is open until Apr 10th at 9AM PST and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.8
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no issues targeting 2.4.8 (try project = SPARK AND
"Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In Progress"))

The tag to be voted on is v2.4.8-rc1 (commit
53d37e4e17254c4cfee1abcb60c36f865b255046):
https://github.com/apache/spark/tree/v2.4.8-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1368/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc1-docs/

The list of bug fixes going into 2.4.8 can be found at the following URL:
https://s.apache.org/spark-v2.4.8-rc1

This release is using the release script of the tag v2.4.8-rc1.

FAQ


=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.8?
===

The current list of open tickets targeted at 2.4.8 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.8

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

Re: [VOTE] Release Spark 2.4.8 (RC1)

Re: [DISCUSS] Build error message guideline

Re: [VOTE] Release Spark 2.4.8 (RC1)

Re: [VOTE] Release Spark 2.4.8 (RC1)

Re: [VOTE] Release Spark 2.4.8 (RC1)

Re: Big Broadcast Hash Join with Dynamic Partition Pruning gives wrong results

Re: [VOTE] Release Spark 2.4.8 (RC1)

Re: [VOTE] Release Spark 2.4.8 (RC1)

Big Broadcast Hash Join with Dynamic Partition Pruning gives wrong results

Re: [Discuss][SPIP] DataSource V2 SQL push down

[DISCUSS] Build error message guideline

please read: current state and the future of the apache spark build system

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

Re: [Discuss][SPIP] DataSource V2 SQL push down

[VOTE] Release Spark 2.4.8 (RC1)

21 matches

Site Navigation

Mail list logo

Footer information