ASF board report draft for August

2019-08-12 Thread Matei Zaharia
Hi all,

It’s time to submit our quarterly report to the ASF board again this Wednesday. 
Here is my draft about what’s new — feel free to suggest changes.



Apache Spark is a fast and general engine for large-scale data processing. It
offers high-level APIs in Java, Scala, Python and R as well as a rich set of
libraries including stream processing, machine learning, and graph analytics.

Project status:

- Discussions are continuing about our next feature release, which will likely
  be Spark 3.0, on the dev and user mailing lists. Some key questions include
  whether to remove various deprecated APIs, and which minimum versions of
  Java, Python, Scala, etc to support. There are also a number of new features
  targeting this release. We encourage everyone in the community to give
  feedback on these discussions through our mailing lists or issue tracker.

- We announced a plan to stop supporting Python 2 in our next major release,
  as many other projects in the Python ecosystem are now dropping support
  (https://spark.apache.org/news/plan-for-dropping-python-2-support.html 
).

- We added three new PMC members to the project in May: Takuya Ueshin,
  Jerry Shao and Hyukjin Kwon.

- There is an ongoing discussion on our dev list about whether to consider
  adding project committers who do not contribute to the code or docs in the
  project, and what the criteria might be for those. (Note that the project does
  solicit committers who only work on docs, and has also added committers
  who work on other tasks, like maintaining our build infrastructure).

Trademarks:

- We are continuing engagement with various organizations.

Latest releases:

- May 8th, 2018: Spark 2.4.3
- April 23rd, 2019: Spark 2.4.2
- March 31st, 2019: Spark 2.4.1
- Feb 15th, 2019: Spark 2.3.3

Committers and PMC:

- The latest committer was added on Jan 29th, 2019 (Jose Torres).
- The latest PMC members were added on May 21st, 2019 (Jerry Shao,
  Takuya Ueshin and Hyukjin Kwon).



Re: displaying "Test build" in PR

2019-08-12 Thread Shane Knapp
when you create a PR, the jenkins pull request builder job polls every ~5
or so minutes and will trigger jobs based on creation/approval to test/code
updates/etc.

On Mon, Aug 12, 2019 at 11:25 AM Younggyu Chun 
wrote:

> Hi All,
>
> I have a quick question about PR. Once I create a PR I'm not able to see
> if "Test build" is being processed. But I can see this after a few minutes
> or hours later. Is it possible to see if "Test Build" is being processed
> after PR is created right away?
>
> Thank you,
> Younggyu Chun
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


displaying "Test build" in PR

2019-08-12 Thread Younggyu Chun
Hi All,

I have a quick question about PR. Once I create a PR I'm not able to see if
"Test build" is being processed. But I can see this after a few minutes or
hours later. Is it possible to see if "Test Build" is being processed after
PR is created right away?

Thank you,
Younggyu Chun


Re: [SPARK-23207] Repro

2019-08-12 Thread Yuanjian Li
Hi Tyson,

Thanks for the reporting!
I reproduced this locally based on your code with some changes, which only
keep the wrong answer job. The code as below:

import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).cache.repartition(239).map { x =>
  if (TaskContext.get.attemptNumber == 0 &&
TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber
== 0) {
throw new Exception("pkill -f -n java".!!)
  }
  x
}
val r2 = df.distinct.count()

I think the reason for the wrong answer is, in the CachedRDDBuilder, we
miss tracking the `isOrderSensitive` characteristic to the newly created
MapPartitionsRDD. Jira created in:
https://issues.apache.org/jira/browse/SPARK-28699.
The fix will base on Wenchen's work SPARK-23243. Currently, we make the job
fail when we find an indeterminate stage retry. Feel free to have a review.

The support of Spark rerun the indeterminate stage will be done after
SPARK-25341 . If you
need the indeterminate stage after cache operation right now, you can test
on this branch
.

Best,
Yuanjian

Wenchen Fan  于2019年8月12日周一 下午8:19写道:

> Hi Tyson,
>
> Thanks for reporting it! I quickly checked the related scheduler code but
> can't find an obvious place that can go wrong with cached RDD.
>
> Sean said that he can't produce it, but the second job fails. This is
> actually expected. We need a lot more changes to completely fix this
> problem, so currently the fix is to fail the job if the scheduler needs to
> retry an indeterminate shuffle map stage.
>
> It would be great to know if we can reproduce this bug with the master
> branch.
>
> Thanks,
> Wenchen
>
> On Sun, Aug 11, 2019 at 7:22 AM Xiao Li  wrote:
>
>> Hi, Tyson,
>>
>> Could you open a new JIRA with correctness label? SPARK-23207 might not
>> cover all the scenarios, especially when you using cache.
>>
>> Cheers,
>>
>> Xiao
>>
>> On Fri, Aug 9, 2019 at 9:26 AM  wrote:
>>
>>> Hi Sean,
>>>
>>> To finish the job, I did need to set spark.stage.maxConsecutiveAttempts
>>> to a large number e.g., 100; a suggestion from Jiang Xingbo.
>>>
>>> I haven't seen any recent movement/PRs on this issue, but I'll see if we
>>> can repro with a more recent version of Spark.
>>>
>>> Best regards,
>>> Tyson
>>>
>>> -Original Message-
>>> From: Sean Owen 
>>> Sent: Friday, August 9, 2019 7:49 AM
>>> To: tcon...@gmail.com
>>> Cc: dev 
>>> Subject: Re: [SPARK-23207] Repro
>>>
>>> Interesting but I'd put this on the JIRA, and also test vs master first.
>>> It's entirely possible this is something else that was subsequently fixed,
>>> and maybe even backported for 2.4.4.
>>> (I can't quite reproduce it - just makes the second job fail, which is
>>> also puzzling)
>>>
>>> On Fri, Aug 9, 2019 at 8:11 AM  wrote:
>>> >
>>> > Hi,
>>> >
>>> >
>>> >
>>> > We are able to reproduce this bug in Spark 2.4 using the following
>>> program:
>>> >
>>> >
>>> >
>>> > import scala.sys.process._
>>> >
>>> > import org.apache.spark.TaskContext
>>> >
>>> >
>>> >
>>> > val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000,
>>> > x)}.repartition(20)
>>> >
>>> > res.distinct.count
>>> >
>>> >
>>> >
>>> > // kill an executor in the stage that performs repartition(239)
>>> >
>>> > val df = res.repartition(113).cache.repartition(239).map { x =>
>>> >
>>> >   if (TaskContext.get.attemptNumber == 0 &&
>>> > TaskContext.get.partitionId < 1) {
>>> >
>>> > throw new Exception("pkill -f java".!!)
>>> >
>>> >   }
>>> >
>>> >   x
>>> >
>>> > }
>>> >
>>> > df.distinct.count()
>>> >
>>> >
>>> >
>>> > The first df.distinct.count correctly produces 1
>>> >
>>> > The second df.distinct.count incorrect produces 9769
>>> >
>>> >
>>> >
>>> > If the cache step is removed then the bug does not reproduce.
>>> >
>>> >
>>> >
>>> > Best regards,
>>> >
>>> > Tyson
>>> >
>>> >
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>


Re: [SPARK-23207] Repro

2019-08-12 Thread Wenchen Fan
Hi Tyson,

Thanks for reporting it! I quickly checked the related scheduler code but
can't find an obvious place that can go wrong with cached RDD.

Sean said that he can't produce it, but the second job fails. This is
actually expected. We need a lot more changes to completely fix this
problem, so currently the fix is to fail the job if the scheduler needs to
retry an indeterminate shuffle map stage.

It would be great to know if we can reproduce this bug with the master
branch.

Thanks,
Wenchen

On Sun, Aug 11, 2019 at 7:22 AM Xiao Li  wrote:

> Hi, Tyson,
>
> Could you open a new JIRA with correctness label? SPARK-23207 might not
> cover all the scenarios, especially when you using cache.
>
> Cheers,
>
> Xiao
>
> On Fri, Aug 9, 2019 at 9:26 AM  wrote:
>
>> Hi Sean,
>>
>> To finish the job, I did need to set spark.stage.maxConsecutiveAttempts
>> to a large number e.g., 100; a suggestion from Jiang Xingbo.
>>
>> I haven't seen any recent movement/PRs on this issue, but I'll see if we
>> can repro with a more recent version of Spark.
>>
>> Best regards,
>> Tyson
>>
>> -Original Message-
>> From: Sean Owen 
>> Sent: Friday, August 9, 2019 7:49 AM
>> To: tcon...@gmail.com
>> Cc: dev 
>> Subject: Re: [SPARK-23207] Repro
>>
>> Interesting but I'd put this on the JIRA, and also test vs master first.
>> It's entirely possible this is something else that was subsequently fixed,
>> and maybe even backported for 2.4.4.
>> (I can't quite reproduce it - just makes the second job fail, which is
>> also puzzling)
>>
>> On Fri, Aug 9, 2019 at 8:11 AM  wrote:
>> >
>> > Hi,
>> >
>> >
>> >
>> > We are able to reproduce this bug in Spark 2.4 using the following
>> program:
>> >
>> >
>> >
>> > import scala.sys.process._
>> >
>> > import org.apache.spark.TaskContext
>> >
>> >
>> >
>> > val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000,
>> > x)}.repartition(20)
>> >
>> > res.distinct.count
>> >
>> >
>> >
>> > // kill an executor in the stage that performs repartition(239)
>> >
>> > val df = res.repartition(113).cache.repartition(239).map { x =>
>> >
>> >   if (TaskContext.get.attemptNumber == 0 &&
>> > TaskContext.get.partitionId < 1) {
>> >
>> > throw new Exception("pkill -f java".!!)
>> >
>> >   }
>> >
>> >   x
>> >
>> > }
>> >
>> > df.distinct.count()
>> >
>> >
>> >
>> > The first df.distinct.count correctly produces 1
>> >
>> > The second df.distinct.count incorrect produces 9769
>> >
>> >
>> >
>> > If the cache step is removed then the bug does not reproduce.
>> >
>> >
>> >
>> > Best regards,
>> >
>> > Tyson
>> >
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>