Re: Spark JIRA tags clarification and management

2018-09-06 Thread Reynold Xin
Yup I sometimes use it. I think quite a few others do. It might've been
called out in the contributor guide too.


On Thu, Sep 6, 2018 at 8:54 PM Sean Owen  wrote:

> I believe 'starter' is still the standard tag for simple issues for
> newcomers.
>
> On Thu, Sep 6, 2018 at 8:46 PM Hyukjin Kwon  wrote:
>
>> Does anyone know if we still user starter or newbie tags as well?
>>
>


Re: Spark JIRA tags clarification and management

2018-09-06 Thread Sean Owen
I believe 'starter' is still the standard tag for simple issues for
newcomers.

On Thu, Sep 6, 2018 at 8:46 PM Hyukjin Kwon  wrote:

> Does anyone know if we still user starter or newbie tags as well?
>


Re: Spark JIRA tags clarification and management

2018-09-06 Thread Hyukjin Kwon
Does anyone know if we still user starter or newbie tags as well?

2018년 9월 4일 (화) 오후 10:00, Kazuaki Ishizaki 님이 작성:

> Of course, we would like to eliminate all of the following tags
>
> "flanky" or "flankytest"
>
> Kazuaki Ishizaki
>
>
>
> From:Hyukjin Kwon 
> To:dev 
> Cc:Xiao Li , Wenchen Fan <
> cloud0...@gmail.com>
> Date:2018/09/04 14:20
> Subject:Re: Spark JIRA tags clarification and management
> --
>
>
>
> Thanks, Reynold.
>
> +Adding Xiao and Wenchen who I saw often used tags.
>
> Would you have some tags you think we should document more?
>
> 2018년 9월 4일 (화) 오전 9:27, Reynold Xin <*r...@databricks.com*
> >님이 작성:
> The most common ones we do are:
>
> releasenotes
>
> correctness
>
>
>
> On Mon, Sep 3, 2018 at 6:23 PM Hyukjin Kwon <*gurwls...@gmail.com*
> > wrote:
> Thanks, Felix and Reynold. Would you guys mind if I ask this to anyone who
> use the tags frequently? Frankly, I don't use the tags often ..
>
> 2018년 9월 4일 (화) 오전 2:04, Felix Cheung <*felixcheun...@hotmail.com*
> >님이 작성:
> +1 good idea.
> There are a few for organizing but some also are critical to the release
> process, like rel note. Would be good to clarify.
>
> --
>
> *From:* Reynold Xin <*r...@databricks.com* >
> *Sent:* Sunday, September 2, 2018 11:50 PM
> *To:* Hyukjin Kwon
> *Cc:* dev
> *Subject:* Re: Spark JIRA tags clarification and management
>
> It would be great to document the common ones.
>
> On Sun, Sep 2, 2018 at 11:49 PM Hyukjin Kwon <*gurwls...@gmail.com*
> > wrote:
> Hi all,
>
> I lately noticed tags are often used to classify JIRAs. I was thinking we
> better explicitly document what tags are used and explain which tag means
> what. For instance, we documented "Contributing to JIRA Maintenance" at
> *https://spark.apache.org/contributing.html*
> before (thanks, Sean Owen) -
> this helps me a lot to managing JIRAs, and they are good standards for, at
> least, me to take an action.
>
> It doesn't necessarily mean we should clarify everything but it might be
> good to document tags used often.
>
> We can leave this for committer's scope as well, if that's preferred - I
> don't have a strong opinion on this. My point is, can we clarify this in
> the contributing guide so that we can reduce the maintenance cost?
>
>
>


RE: [VOTE] SPARK 2.3.2 (RC5)

2018-09-06 Thread Sharanabasappa G Keriwaddi
Hi –

Are there any blocking issues open for 2.3.2?

2.3.1 had few critical issues, I feel it would be better to publish 2.3.2 with 
all those critical bug fixes.


Thanks and Regards
Sharan



From: Saisai Shao [mailto:sai.sai.s...@gmail.com]
Sent: 07 September 2018 08:30
To: 441586683 <441586...@qq.com>
Cc: dev 
Subject: Re: [VOTE] SPARK 2.3.2 (RC5)

Hi,

PMC members asked me to hold a bit while they're dealing with some other 
things. Please wait for a bit while.

Thanks
Saisai


zzc <441586...@qq.com> 于2018年9月6日周四 下午4:27写道:
Hi Saisai:
  Spark 2.4 was cut, and is there any new process on 2.3.2?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org


Re: [VOTE] SPARK 2.3.2 (RC5)

2018-09-06 Thread Saisai Shao
Hi,

PMC members asked me to hold a bit while they're dealing with some other
things. Please wait for a bit while.

Thanks
Saisai


zzc <441586...@qq.com> 于2018年9月6日周四 下午4:27写道:

> Hi Saisai:
>   Spark 2.4 was cut, and is there any new process on 2.3.2?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: time for Apache Spark 3.0?

2018-09-06 Thread Matei Zaharia
Yes, you can start with Unstable and move to Evolving and Stable when needed. 
We’ve definitely had experimental features that changed across maintenance 
releases when they were well-isolated. If your change risks breaking stuff in 
stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> 
> I meant flexibility beyond the point releases. I think what Reynold was 
> suggesting was getting v2 code out more often than the point releases every 6 
> months. An Evolving API can change in point releases, but maybe we should 
> move v2 to Unstable so it can change more often? I don't really see another 
> way to get changes out more often.
> 
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra  wrote:
> Yes, that is why we have these annotations in the code and the corresponding 
> labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> 
> As long as it is properly annotated, we can change or even eliminate an API 
> method before the next major release. And frankly, we shouldn't be 
> contemplating bringing in the DS v2 API (and, I'd argue, any new API) without 
> such an annotation. There is just too much risk of not getting everything 
> right before we see the results of the new API being more widely used, and 
> too much cost in maintaining until the next major release something that we 
> come to regret for us to create new API in a fully frozen state.
>  
> 
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:
> It would be great to get more features out incrementally. For experimental 
> features, do we have more relaxed constraints?
> 
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
> +1 on 3.0
> 
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, 
> dsv1 and a lot of other major features all were developed throughout the 1.x 
> and 2.x lines.
> 
> I do want to explore ways for us to get dsv2 incremental changes out there 
> more frequently, to get feedback. Maybe that means we apply additive changes 
> to 2.4.x; maybe that means making another 2.5 release sooner. I will start a 
> separate thread about it.
> 
> 
> 
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 
> 6 months?) but simply next. Do you mean you'd prefer that change to happen 
> before 3.x? if it's a significant change, seems reasonable for a major 
> version bump rather than minor. Is the concern that tying it to 3.0 means you 
> have to take a major version update to get it?
> 
> I generally support moving on to 3.x so we can also jettison a lot of older 
> dependencies, code, fix some long standing issues, etc.
> 
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
> 
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue  wrote:
> My concern is that the v2 data source API is still evolving and not very 
> close to stable. I had hoped to have stabilized the API and behaviors for a 
> 3.0 release. But we could also wait on that for a 4.0 release, depending on 
> when we think that will be.
> 
> Unless there is a pressing need to move to 3.0 for some other area, I think 
> it would be better for the v2 sources to have a 2.5 release.
> 
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
> Yesterday, the 2.4 branch was created. Based on the above discussion, I think 
> we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
I meant flexibility beyond the point releases. I think what Reynold was
suggesting was getting v2 code out more often than the point releases every
6 months. An Evolving API can change in point releases, but maybe we should
move v2 to Unstable so it can change more often? I don't really see another
way to get changes out more often.

On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
wrote:

> Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation:
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>
> As long as it is properly annotated, we can change or even eliminate an
> API method before the next major release. And frankly, we shouldn't be
> contemplating bringing in the DS v2 API (and, I'd argue, *any* new API)
> without such an annotation. There is just too much risk of not getting
> everything right before we see the results of the new API being more widely
> used, and too much cost in maintaining until the next major release
> something that we come to regret for us to create new API in a fully frozen
> state.
>
>
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue 
> wrote:
>
>> It would be great to get more features out incrementally. For
>> experimental features, do we have more relaxed constraints?
>>
>> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
>>
>>> +1 on 3.0
>>>
>>> Dsv2 stable can still evolve in across major releases. DataFrame,
>>> Dataset, dsv1 and a lot of other major features all were developed
>>> throughout the 1.x and 2.x lines.
>>>
>>> I do want to explore ways for us to get dsv2 incremental changes out
>>> there more frequently, to get feedback. Maybe that means we apply additive
>>> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
>>> will start a separate thread about it.
>>>
>>>
>>>
>>> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
>>>
 I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
 timing? 6 months?) but simply next. Do you mean you'd prefer that change to
 happen before 3.x? if it's a significant change, seems reasonable for a
 major version bump rather than minor. Is the concern that tying it to 3.0
 means you have to take a major version update to get it?

 I generally support moving on to 3.x so we can also jettison a lot of
 older dependencies, code, fix some long standing issues, etc.

 (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)

 On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
 wrote:

> My concern is that the v2 data source API is still evolving and not
> very close to stable. I had hoped to have stabilized the API and behaviors
> for a 3.0 release. But we could also wait on that for a 4.0 release,
> depending on when we think that will be.
>
> Unless there is a pressing need to move to 3.0 for some other area, I
> think it would be better for the v2 sources to have a 2.5 release.
>
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>
>> Yesterday, the 2.4 branch was created. Based on the above discussion,
>> I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: time for Apache Spark 3.0?

2018-09-06 Thread Mark Hamstra
Yes, that is why we have these annotations in the code and the
corresponding labels appearing in the API documentation:
https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java

As long as it is properly annotated, we can change or even eliminate an API
method before the next major release. And frankly, we shouldn't be
contemplating bringing in the DS v2 API (and, I'd argue, *any* new API)
without such an annotation. There is just too much risk of not getting
everything right before we see the results of the new API being more widely
used, and too much cost in maintaining until the next major release
something that we come to regret for us to create new API in a fully frozen
state.


On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:

> It would be great to get more features out incrementally. For experimental
> features, do we have more relaxed constraints?
>
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
>
>> +1 on 3.0
>>
>> Dsv2 stable can still evolve in across major releases. DataFrame,
>> Dataset, dsv1 and a lot of other major features all were developed
>> throughout the 1.x and 2.x lines.
>>
>> I do want to explore ways for us to get dsv2 incremental changes out
>> there more frequently, to get feedback. Maybe that means we apply additive
>> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
>> will start a separate thread about it.
>>
>>
>>
>> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
>>
>>> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>>> happen before 3.x? if it's a significant change, seems reasonable for a
>>> major version bump rather than minor. Is the concern that tying it to 3.0
>>> means you have to take a major version update to get it?
>>>
>>> I generally support moving on to 3.x so we can also jettison a lot of
>>> older dependencies, code, fix some long standing issues, etc.
>>>
>>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>>
>>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
>>> wrote:
>>>
 My concern is that the v2 data source API is still evolving and not
 very close to stable. I had hoped to have stabilized the API and behaviors
 for a 3.0 release. But we could also wait on that for a 4.0 release,
 depending on when we think that will be.

 Unless there is a pressing need to move to 3.0 for some other area, I
 think it would be better for the v2 sources to have a 2.5 release.

 On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:

> Yesterday, the 2.4 branch was created. Based on the above discussion,
> I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Branch 2.4 is cut

2018-09-06 Thread Dongjoon Hyun
Great for branch cut and Scala 2.12 build.

We also need to add `branch-2.4` to our Jenkins dashboard to prevent any
regression.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

Bests,
Dongjoon.


On Thu, Sep 6, 2018 at 6:56 AM Wenchen Fan  wrote:

> Good news! I'll try and update you later. Thanks!
>
> On Thu, Sep 6, 2018 at 9:44 PM Sean Owen  wrote:
>
>> BTW it does appear the Scala 2.12 build works now:
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/229/
>>
>> Let's try also producing a 2.12 build with this release. The machinery
>> should be there in the release scripts, but let me know if something fails
>> while running the release for 2.12.
>>
>> On Thu, Sep 6, 2018 at 12:32 AM Wenchen Fan  wrote:
>>
>>> Hi all,
>>>
>>> I've cut the branch-2.4 since all the major blockers are resolved. If no
>>> objections I'll shortly followup with an RC to get the QA started in
>>> parallel.
>>>
>>> Committers, please only merge PRs to branch-2.4 that are bug fixes,
>>> performance regression fixes, document changes, or test suites changes.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>


Re: data source api v2 refactoring

2018-09-06 Thread Ryan Blue
Wenchen,

I'm not really sure what you're proposing here. What is a `LogicalWrite`?
Is it something that mirrors the read side in your PR?

I think that I agree that if we have a Write independent of the Table that
carries the commit and abort methods, then we can create it directly
without a WriteConfig. So I tentatively agree with what you propose,
assuming that I understand it correctly.

rb

On Tue, Sep 4, 2018 at 8:42 PM Wenchen Fan  wrote:

> I'm switching to my another Gmail account, let's see if it still gets
> dropped this time.
>
> Hi Ryan,
>
> I'm thinking about the write path and feel the abstraction should be the
> same.
>
> We still have logical and physical writing. And the table can create
> different logical writing based on how to write. e.g., append, delete,
> replaceWhere, etc.
>
> One thing I'm not sure about is the WriteConfig. With the WriteConfig, the
> API would look like
> trait Table {
>   WriteConfig newAppendWriteConfig();
>
>   WriteConfig newDeleteWriteConfig(deleteExprs);
>
>   LogicalWrite newLogicalWrite(writeConfig);
> }
>
> Without WriteConfig, the API looks like
> trait Table {
>   LogicalWrite newAppendWrite();
>
>   LogicalWrite newDeleteWrite(deleteExprs);
> }
>
>
> It looks to me that the API is simpler without WriteConfig, what do you
> think?
>
> Thanks,
> Wenchen
>
> On Wed, Sep 5, 2018 at 4:24 AM Ryan Blue 
> wrote:
>
>> Latest from Wenchen in case it was dropped.
>>
>> -- Forwarded message -
>> From: Wenchen Fan 
>> Date: Mon, Sep 3, 2018 at 6:16 AM
>> Subject: Re: data source api v2 refactoring
>> To: 
>> Cc: Ryan Blue , Reynold Xin , <
>> dev@spark.apache.org>
>>
>>
>> Hi Mridul,
>>
>> I'm not sure what's going on, my email was CC'ed to the dev list.
>>
>>
>> Hi Ryan,
>>
>> The logical and physical scan idea sounds good. To add more color
>> to Jungtaek's question, both micro-batch and continuous mode have
>> the logical and physical scan, but there is a difference: for micro-batch
>> mode, a physical scan outputs data for one epoch, but it's not true for
>> continuous mode.
>>
>> I'm not sure if it's necessary to include streaming epoch in the API
>> abstraction, for features like metrics reporting.
>>
>> On Sun, Sep 2, 2018 at 12:31 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Is it only me or are all others getting Wenchen’s mails ? (Obviously
>>> Ryan did :-) )
>>> I did not see it in the mail thread I received or in archives ... [1]
>>> Wondering which othersenderswere getting dropped (if yes).
>>>
>>> Regards
>>> Mridul
>>>
>>> [1]
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/data-source-api-v2-refactoring-td24848.html
>>>
>>>
>>> On Sat, Sep 1, 2018 at 8:58 PM Ryan Blue 
>>> wrote:
>>>
 Thanks for clarifying, Wenchen. I think that's what I expected.

 As for the abstraction, here's the way that I think about it: there are
 two important parts of a scan: the definition of what will be read, and
 task sets that actually perform the read. In batch, there's one definition
 of the scan and one task set so it makes sense that there's one scan object
 that encapsulates both of these concepts. For streaming, we need to
 separate the two into the definition of what will be read (the stream or
 streaming read) and the task sets that are run (scans). That way, the
 streaming read behaves like a factory for scans, producing scans that
 handle the data either in micro-batches or using continuous tasks.

 To address Jungtaek's question, I think that this does work with
 continuous. In continuous mode, the query operators keep running and send
 data to one another directly. The API still needs a streaming read layer
 because it may still produce more than one continuous scan. That would
 happen when the underlying source changes and Spark needs to reconfigure. I
 think the example here is when partitioning in a Kafka topic changes and
 Spark needs to re-map Kafka partitions to continuous tasks.

 rb

 On Fri, Aug 31, 2018 at 5:12 PM Wenchen Fan 
 wrote:

> Hi Ryan,
>
> Sorry I may use a wrong wording. The pushdown is done with ScanConfig,
> which is not table/stream/scan, but something between them. The table
> creates ScanConfigBuilder, and table creates stream/scan with ScanConfig.
> For streaming source, stream is the one to take care of the pushdown
> result. For batch source, it's the scan.
>
> It's a little tricky because stream is an abstraction for streaming
> source only. Better ideas are welcome!
>

> On Sat, Sep 1, 2018 at 7:26 AM Ryan Blue  wrote:
>
>> Thanks, Reynold!
>>
>> I think your API sketch looks great. I appreciate having the Table
>> level in the abstraction to plug into as well. I think this makes it 
>> clear
>> what everything does, particularly having the Stream level that 
>> represents
>> a configured (by ScanConfig) strea

Re: time for Apache Spark 3.0?

2018-09-06 Thread sadhen
I’d like to see an independent Spark Catalyst without Spark Core and Hadoop 
dependencies in Spark 3.0 .


I created Enzyme (A Spark SQL compatible SQL engine that depends on Spark 
Catalyst) in Wacai for performance reason in a non-distribute scenario.


Enzyme is a simplified version of Spark SQL, similar to liancheng’s toy 
projecthttps://github.com/liancheng/spear , but aims to keep compatibility with 
Spark SQL and Dataframe with Hive UDF support.


The implementation of Enzyme is a shame mimic of existing code from Spark SQL. 
Besides, I tuned it for better performance, lower memory and CPU usage.


We mainly use Enzyme for using SQL as a DSL in our inner product for data 
analysts. And guys from other comanies in China are interested in using Enzyme 
for ML serving. My colleagues are trying to use Enzyme in Flink Streaming 
because we can reuse our existing Hive UDFs with Enzyme.


This is my reason for make Spark Catalyst independent. And we will open source 
Enzyme several months later.


Spark Catalyst is awesome. Personally, I hope it goes beyond Spark and finally 
become a great alternative of Calcite.




Best Regards,
Darcy Shen


原始邮件
发件人:Xiao ligatorsm...@gmail.com
收件人:vaquar khanvaquar.k...@gmail.com
抄送:Reynold xinr...@databricks.com; Mridul muralidharanmri...@gmail.com; Mark 
hamstram...@clearstorydata.com; 银狐andyye...@gmail.com; 
user@spark.apache.org...@spark.apache.org
发送时间:2018年9月6日(周四) 23:59
主题:Re: time for Apache Spark 3.0?


Yesterday, the 2.4 branch was created. Based on the above discussion, I think 
we can bump the master branch to3.0.0-SNAPSHOT. Any concern?


Thanks,


Xiao


vaquar khan vaquar.k...@gmail.com 于2018年6月16日周六 上午10:21写道:

+1 for 2.4 next, followed by 3.0.  

Where we can get Apache Spark road map for 2.4 and 2.5  3.0 ?
is it possible we can share future release proposed specification same like 
releases (https://spark.apache.org/releases/spark-release-2-3-0.html)


Regards,
Viquar khan


On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan vaquar.k...@gmail.com wrote:

Plz ignore last email link (you tube )not sure how it added .
Apologies not sure how to delete it.




On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan vaquar.k...@gmail.com wrote:

+1


https://www.youtube.com/watch?v=-ik7aJ5U6kg



Regards,
Vaquar khan


On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin r...@databricks.com wrote:

Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.




On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan mri...@gmail.com wrote:

I agree, I dont see pressing need for major version bump as well.
 
 
 Regards,
 Mridul
 On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra m...@clearstorydata.com wrote:
 
  Changing major version numbers is not about new features or a vague notion 
that it is time to do something that will be seen to be a significant release. 
It is about breaking stable public APIs.
 
  I still remain unconvinced that the next version can't be 2.4.0.
 
  On Fri, Jun 15, 2018 at 1:34 AM Andy andyye...@gmail.com wrote:
 
  Dear all:
 
  It have been 2 months since this topic being proposed. Any progress now? 2018 
has been passed about 1/2.
 
  I agree with that the new version should be some exciting new feature. How 
about this one:
 
  6. ML/DL framework to be integrated as core component and feature. (Such as 
Angel / BigDL / ……)
 
  3.0 is a very important version for an good open source project. It should be 
better to drift away the historical burden and focus in new area. Spark has 
been widely used all over the world as a successful big data framework. And it 
can be better than that.
 
  Andy
 
 
  On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin r...@databricks.com wrote:
 
  There was a discussion thread on scala-contributors about Apache Spark not 
yet supporting Scala 2.12, and that got me to think perhaps it is about time 
for Spark to work towards the 3.0 release. By the time it comes out, it will be 
more than 2 years since Spark 2.0.
 
  For contributors less familiar with Spark’s history, I want to give more 
context on Spark releases:
 
  1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we 
were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.
 
  2. Spark’s versioning policy promises that Spark does not break stable APIs 
in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a 
necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).
 
  3. That said, a major version isn’t necessarily the playground for disruptive 
API changes to make it painful for users to update. The main purpose of a major 
release is an opportunity to fix things that are broken in the current API and 
remove certain deprecated APIs.
 
  4. Spark as a project has a culture of evolving architecture and developing 
major new features incrementally, so major releases are not the only time for 
exciting new features. For example, the bulk of the work in the move towards 
the Data

Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
It would be great to get more features out incrementally. For experimental
features, do we have more relaxed constraints?

On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:

> +1 on 3.0
>
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset,
> dsv1 and a lot of other major features all were developed throughout the
> 1.x and 2.x lines.
>
> I do want to explore ways for us to get dsv2 incremental changes out there
> more frequently, to get feedback. Maybe that means we apply additive
> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
> will start a separate thread about it.
>
>
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
>
>> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>> happen before 3.x? if it's a significant change, seems reasonable for a
>> major version bump rather than minor. Is the concern that tying it to 3.0
>> means you have to take a major version update to get it?
>>
>> I generally support moving on to 3.x so we can also jettison a lot of
>> older dependencies, code, fix some long standing issues, etc.
>>
>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>
>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
>> wrote:
>>
>>> My concern is that the v2 data source API is still evolving and not very
>>> close to stable. I had hoped to have stabilized the API and behaviors for a
>>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>>> when we think that will be.
>>>
>>> Unless there is a pressing need to move to 3.0 for some other area, I
>>> think it would be better for the v2 sources to have a 2.5 release.
>>>
>>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>>>
 Yesterday, the 2.4 branch was created. Based on the above discussion, I
 think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?



-- 
Ryan Blue
Software Engineer
Netflix


Re: time for Apache Spark 3.0?

2018-09-06 Thread Reynold Xin
I definitely agree we shouldn't make dsv2 stable in the next release.

On Thu, Sep 6, 2018 at 9:48 AM Ryan Blue  wrote:

> I definitely support moving to 3.0 to remove deprecations and update
> dependencies.
>
> For the v2 work, we know that there will be a major API changes and
> standardization of behavior from the new logical plans going into the next
> release. I think it is a safe bet that this isn’t going to be completely
> done for the next release, so it will still be experimental or unstable for
> 3.0. I also expect that there will be some things that we want to
> deprecate. Ideally, that deprecation could happen before a major release so
> we can remove it.
>
> I don’t have a problem releasing 3.0 with an unstable v2 API or targeting
> 4.0 to remove behavior and APIs replaced by v2. But, I want to make sure we
> consider it when deciding what the next release should be.
>
> It is probably better to release 3.0 now because it isn’t clear when the
> v2 API will become stable. And if we choose to release 3.0 next, we should
> *not* aim to stabilize v2 for that release. Not that we shouldn’t try to
> make it stable as soon as possible, I just think that it is unlikely to
> happen in time and we should not rush to claim it is stable.
>
> rb
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
>
>> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>> happen before 3.x? if it's a significant change, seems reasonable for a
>> major version bump rather than minor. Is the concern that tying it to 3.0
>> means you have to take a major version update to get it?
>>
>> I generally support moving on to 3.x so we can also jettison a lot of
>> older dependencies, code, fix some long standing issues, etc.
>>
>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>
>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
>> wrote:
>>
>>> My concern is that the v2 data source API is still evolving and not very
>>> close to stable. I had hoped to have stabilized the API and behaviors for a
>>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>>> when we think that will be.
>>>
>>> Unless there is a pressing need to move to 3.0 for some other area, I
>>> think it would be better for the v2 sources to have a 2.5 release.
>>>
>>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>>>
 Yesterday, the 2.4 branch was created. Based on the above discussion, I
 think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?


>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
I definitely support moving to 3.0 to remove deprecations and update
dependencies.

For the v2 work, we know that there will be a major API changes and
standardization of behavior from the new logical plans going into the next
release. I think it is a safe bet that this isn’t going to be completely
done for the next release, so it will still be experimental or unstable for
3.0. I also expect that there will be some things that we want to
deprecate. Ideally, that deprecation could happen before a major release so
we can remove it.

I don’t have a problem releasing 3.0 with an unstable v2 API or targeting
4.0 to remove behavior and APIs replaced by v2. But, I want to make sure we
consider it when deciding what the next release should be.

It is probably better to release 3.0 now because it isn’t clear when the v2
API will become stable. And if we choose to release 3.0 next, we should
*not* aim to stabilize v2 for that release. Not that we shouldn’t try to
make it stable as soon as possible, I just think that it is unlikely to
happen in time and we should not rush to claim it is stable.

rb

On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:

> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
> wrote:
>
>> My concern is that the v2 data source API is still evolving and not very
>> close to stable. I had hoped to have stabilized the API and behaviors for a
>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>> when we think that will be.
>>
>> Unless there is a pressing need to move to 3.0 for some other area, I
>> think it would be better for the v2 sources to have a 2.5 release.
>>
>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>>
>>> Yesterday, the 2.4 branch was created. Based on the above discussion, I
>>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix


Re: time for Apache Spark 3.0?

2018-09-06 Thread Reynold Xin
+1 on 3.0

Dsv2 stable can still evolve in across major releases. DataFrame, Dataset,
dsv1 and a lot of other major features all were developed throughout the
1.x and 2.x lines.

I do want to explore ways for us to get dsv2 incremental changes out there
more frequently, to get feedback. Maybe that means we apply additive
changes to 2.4.x; maybe that means making another 2.5 release sooner. I
will start a separate thread about it.



On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:

> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
> wrote:
>
>> My concern is that the v2 data source API is still evolving and not very
>> close to stable. I had hoped to have stabilized the API and behaviors for a
>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>> when we think that will be.
>>
>> Unless there is a pressing need to move to 3.0 for some other area, I
>> think it would be better for the v2 sources to have a 2.5 release.
>>
>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>>
>>> Yesterday, the 2.4 branch was created. Based on the above discussion, I
>>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>>
>>>


Re: time for Apache Spark 3.0?

2018-09-06 Thread Sean Owen
I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
timing? 6 months?) but simply next. Do you mean you'd prefer that change to
happen before 3.x? if it's a significant change, seems reasonable for a
major version bump rather than minor. Is the concern that tying it to 3.0
means you have to take a major version update to get it?

I generally support moving on to 3.x so we can also jettison a lot of older
dependencies, code, fix some long standing issues, etc.

(BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)

On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue  wrote:

> My concern is that the v2 data source API is still evolving and not very
> close to stable. I had hoped to have stabilized the API and behaviors for a
> 3.0 release. But we could also wait on that for a 4.0 release, depending on
> when we think that will be.
>
> Unless there is a pressing need to move to 3.0 for some other area, I
> think it would be better for the v2 sources to have a 2.5 release.
>
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>
>> Yesterday, the 2.4 branch was created. Based on the above discussion, I
>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>
>>


Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
My concern is that the v2 data source API is still evolving and not very
close to stable. I had hoped to have stabilized the API and behaviors for a
3.0 release. But we could also wait on that for a 4.0 release, depending on
when we think that will be.

Unless there is a pressing need to move to 3.0 for some other area, I think
it would be better for the v2 sources to have a 2.5 release.

On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:

> Yesterday, the 2.4 branch was created. Based on the above discussion, I
> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>
> Thanks,
>
> Xiao
>
> vaquar khan  于2018年6月16日周六 上午10:21写道:
>
>> +1  for 2.4 next, followed by 3.0.
>>
>> Where we can get Apache Spark road map for 2.4 and 2.5  3.0 ?
>> is it possible we can share future release proposed specification same
>> like  releases (
>> https://spark.apache.org/releases/spark-release-2-3-0.html)
>> Regards,
>> Viquar khan
>>
>> On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan 
>> wrote:
>>
>>> Plz ignore last email link (you tube )not sure how it added .
>>> Apologies not sure how to delete it.
>>>
>>>
>>> On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan 
>>> wrote:
>>>
 +1

 https://www.youtube.com/watch?v=-ik7aJ5U6kg

 Regards,
 Vaquar khan

 On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin 
 wrote:

> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>
>
> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
> wrote:
>
>> I agree, I dont see pressing need for major version bump as well.
>>
>>
>> Regards,
>> Mridul
>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <
>> m...@clearstorydata.com> wrote:
>> >
>> > Changing major version numbers is not about new features or a vague
>> notion that it is time to do something that will be seen to be a
>> significant release. It is about breaking stable public APIs.
>> >
>> > I still remain unconvinced that the next version can't be 2.4.0.
>> >
>> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
>> >>
>> >> Dear all:
>> >>
>> >> It have been 2 months since this topic being proposed. Any
>> progress now? 2018 has been passed about 1/2.
>> >>
>> >> I agree with that the new version should be some exciting new
>> feature. How about this one:
>> >>
>> >> 6. ML/DL framework to be integrated as core component and feature.
>> (Such as Angel / BigDL / ……)
>> >>
>> >> 3.0 is a very important version for an good open source project.
>> It should be better to drift away the historical burden and focus in new
>> area. Spark has been widely used all over the world as a successful big
>> data framework. And it can be better than that.
>> >>
>> >> Andy
>> >>
>> >>
>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin 
>> wrote:
>> >>>
>> >>> There was a discussion thread on scala-contributors about Apache
>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it 
>> is
>> about time for Spark to work towards the 3.0 release. By the time it 
>> comes
>> out, it will be more than 2 years since Spark 2.0.
>> >>>
>> >>> For contributors less familiar with Spark’s history, I want to
>> give more context on Spark releases:
>> >>>
>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>> Spark 3.0 in 2018.
>> >>>
>> >>> 2. Spark’s versioning policy promises that Spark does not break
>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 
>> to
>> 2.0, 2.x to 3.0).
>> >>>
>> >>> 3. That said, a major version isn’t necessarily the playground
>> for disruptive API changes to make it painful for users to update. The 
>> main
>> purpose of a major release is an opportunity to fix things that are 
>> broken
>> in the current API and remove certain deprecated APIs.
>> >>>
>> >>> 4. Spark as a project has a culture of evolving architecture and
>> developing major new features incrementally, so major releases are not 
>> the
>> only time for exciting new features. For example, the bulk of the work in
>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>> Processing was introduced in Spark 2.3. Both were feature releases rather
>> than major releases.
>> >>>
>> >>>
>> >>> You can find more background in the thread discussing Spark 2.0:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>> >>>
>> >>>
>> >>> The primary motivating factor IMO for a major version bump is to
>> support Scala 2.12, which requires minor API breaking change

Re: time for Apache Spark 3.0?

2018-09-06 Thread Xiao Li
Yesterday, the 2.4 branch was created. Based on the above discussion, I
think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?

Thanks,

Xiao

vaquar khan  于2018年6月16日周六 上午10:21写道:

> +1  for 2.4 next, followed by 3.0.
>
> Where we can get Apache Spark road map for 2.4 and 2.5  3.0 ?
> is it possible we can share future release proposed specification same
> like  releases (https://spark.apache.org/releases/spark-release-2-3-0.html
> )
> Regards,
> Viquar khan
>
> On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan 
> wrote:
>
>> Plz ignore last email link (you tube )not sure how it added .
>> Apologies not sure how to delete it.
>>
>>
>> On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan 
>> wrote:
>>
>>> +1
>>>
>>> https://www.youtube.com/watch?v=-ik7aJ5U6kg
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin 
>>> wrote:
>>>
 Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.


 On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
 wrote:

> I agree, I dont see pressing need for major version bump as well.
>
>
> Regards,
> Mridul
> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
> wrote:
> >
> > Changing major version numbers is not about new features or a vague
> notion that it is time to do something that will be seen to be a
> significant release. It is about breaking stable public APIs.
> >
> > I still remain unconvinced that the next version can't be 2.4.0.
> >
> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
> >>
> >> Dear all:
> >>
> >> It have been 2 months since this topic being proposed. Any progress
> now? 2018 has been passed about 1/2.
> >>
> >> I agree with that the new version should be some exciting new
> feature. How about this one:
> >>
> >> 6. ML/DL framework to be integrated as core component and feature.
> (Such as Angel / BigDL / ……)
> >>
> >> 3.0 is a very important version for an good open source project. It
> should be better to drift away the historical burden and focus in new 
> area.
> Spark has been widely used all over the world as a successful big data
> framework. And it can be better than that.
> >>
> >> Andy
> >>
> >>
> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin 
> wrote:
> >>>
> >>> There was a discussion thread on scala-contributors about Apache
> Spark not yet supporting Scala 2.12, and that got me to think perhaps it 
> is
> about time for Spark to work towards the 3.0 release. By the time it comes
> out, it will be more than 2 years since Spark 2.0.
> >>>
> >>> For contributors less familiar with Spark’s history, I want to
> give more context on Spark releases:
> >>>
> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
> Spark 3.0 in 2018.
> >>>
> >>> 2. Spark’s versioning policy promises that Spark does not break
> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
> 2.0, 2.x to 3.0).
> >>>
> >>> 3. That said, a major version isn’t necessarily the playground for
> disruptive API changes to make it painful for users to update. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs.
> >>>
> >>> 4. Spark as a project has a culture of evolving architecture and
> developing major new features incrementally, so major releases are not the
> only time for exciting new features. For example, the bulk of the work in
> the move towards the DataFrame API was done in Spark 1.3, and Continuous
> Processing was introduced in Spark 2.3. Both were feature releases rather
> than major releases.
> >>>
> >>>
> >>> You can find more background in the thread discussing Spark 2.0:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
> >>>
> >>>
> >>> The primary motivating factor IMO for a major version bump is to
> support Scala 2.12, which requires minor API breaking changes to Spark’s
> APIs. Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for 
> consideration):
> >>>
> >>> 1. Support Scala 2.12.
> >>>
> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
> in Spark 2.x.
> >>>
> >>> 3. Shade all dependencies.
> >>>
> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>>>

Re: python test infrastructure

2018-09-06 Thread Imran Rashid
>
On Wed, Sep 5, 2018 at 11:59 PM Hyukjin Kwon  wrote:

> >

> > > 1. all of the output in target/test-reports & python/unit-tests.log
should be included in the jenkins archived artifacts.

> >

> > Hmmm, I thought they are already archived (
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95734/artifact/target/unit-tests.log
).

> > FWIW, unit-tests.log are pretty messy and they are shown when specific
tests are borken currently.

ah I guess I was looking in the wrong place for the unit-tests.log.  Agree
its messy, could we do something like adding the headers in SparkFunSuite?

And there is still the target/test-reports output that is not getting
archived.

> > 2. That test output needs to be separated by python executable.  It
seems to me that right now if you run python/run-tests with multiple
python-executables, you get separate test output (because each output file
includes a timestamp), but you can't tell which python version was used.

> >

> > It wouldn't be difficult. I can make the changes if they are necessary;
however, I still think it's rather minor since logs are shown when some
tests are broken.

I think its useful even when things are successful.  I use builds on
jenkins all the time so I can compare my runs with a known successful run.
It would be great if on jenkins I could find test-reports for the exact
python version I am testing against locally.

> > 3. the test output should be incorporated into jenkins test output, so
its easier to see which test is failing, which tests are run, test trends,
etc.  Along with the above, that means the tests should be prefixed (or
something) with the python executable in the reports so you can track test
results for each executable.  (it seems this was done at one point by
SPARK-11295, but for whatever reason, doesn't seem to work anymore.)

> >

> > Yea, I have taken a look for organising logs stuff before (for instance
https://github.com/apache/spark/pull/21107) but not for this idea itself. I
agree with this idea in general.

https://issues.apache.org/jira/browse/SPARK-25359


> 2018년 9월 6일 (목) 오전 5:41, Imran Rashid 님이 작성:
>
>> one more: seems like python/run-tests should have an option at least to
>> not bail at the first failure:
>> https://github.com/apache/spark/blob/master/python/run-tests.py#L113-L132
>>
>> this is particularly annoying with flaky tests -- since the rest of the
>> tests aren't run, you don't know whether you *only* had a failure in that
>> flaky test, or if there was some other real failure as well.
>>
>> On Wed, Sep 5, 2018 at 1:31 PM Imran Rashid  wrote:
>>
>>> Hi all,
>>>
>>> More pyspark noob questions from me.  I find it really hard to figure
>>> out what versions of python I should be testing and what is tested
>>> upstream.  While I'd like to just know the answers to those questions, more
>>> importantly I'd like to make sure that info is visible somewhere so all
>>> devs can figure it out themselves.  I think we should have:
>>>
>>> 1. all of the output in target/test-reports & python/unit-tests.log
>>> should be included in the jenkins archived artifacts.
>>>
>>> 2. That test output needs to be separated by python executable.  It
>>> seems to me that right now if you run python/run-tests with multiple
>>> python-executables, you get separate test output (because each output file
>>> includes a timestamp), but you can't tell which python version was used.
>>>
>>> 3. the test output should be incorporated into jenkins test output, so
>>> its easier to see which test is failing, which tests are run, test trends,
>>> etc.  Along with the above, that means the tests should be prefixed (or
>>> something) with the python executable in the reports so you can track test
>>> results for each executable.  (it seems this was done at one point by
>>> SPARK-11295, but for whatever reason, doesn't seem to work anymore.)
>>>
>>> if we had these features as part of the regular testing infrastructure,
>>> I think it would make it easier for everyone to understand what was
>>> happening in the current pyspark tests and to compare their own local tests
>>> with them.
>>>
>>> thoughts?  is this covered somewhere that I don't know about?
>>>
>>> thanks,
>>> Imran
>>>
>>


Re: Branch 2.4 is cut

2018-09-06 Thread Wenchen Fan
Good news! I'll try and update you later. Thanks!

On Thu, Sep 6, 2018 at 9:44 PM Sean Owen  wrote:

> BTW it does appear the Scala 2.12 build works now:
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/229/
>
> Let's try also producing a 2.12 build with this release. The machinery
> should be there in the release scripts, but let me know if something fails
> while running the release for 2.12.
>
> On Thu, Sep 6, 2018 at 12:32 AM Wenchen Fan  wrote:
>
>> Hi all,
>>
>> I've cut the branch-2.4 since all the major blockers are resolved. If no
>> objections I'll shortly followup with an RC to get the QA started in
>> parallel.
>>
>> Committers, please only merge PRs to branch-2.4 that are bug fixes,
>> performance regression fixes, document changes, or test suites changes.
>>
>> Thanks,
>> Wenchen
>>
>


Re: Datasource v2 Select Into support

2018-09-06 Thread Wenchen Fan
Data source v2 catalog support(table/view) is still in progress. There are
several threads in the dev list discussing it, please join the discussion
if you are interested. Thanks for trying!

On Thu, Sep 6, 2018 at 7:23 PM Ross Lawley  wrote:

> Hi,
>
> I hope this is the correct mailinglist. I've been adding v2 support to the
> MongoDB Spark connector using Spark 2.3.1.  I've noticed one of my tests
> pass when using the original DefaultSource but errors with my v2
> implementation:
>
> The code I'm running is:
> val df = spark.loadDS[Character]()
> df.createOrReplaceTempView("people")
> spark.sql("INSERT INTO table people SELECT 'Mort', 1000")
>
> The error I see is:
> unresolved operator 'InsertIntoTable DataSourceV2Relation [name#0, age#1],
> MongoDataSourceReader ...
> 'InsertIntoTable DataSourceV2Relation [name#0, age#1],
> MongoDataSourceReader 
> +- Project [Mort AS Mort#7, 1000 AS 1000#8]
>+- OneRowRelation
>
> My DefaultSource V2 implementation extends DataSourceV2 with ReadSupport
> with ReadSupportWithSchema with WriteSupport
>
> I'm wondering if there is something I'm not implementing, or if there is a
> bug in my implementation or its an issue with Spark?
>
> Any pointers would be great,
>
> Ross
>


Re: Branch 2.4 is cut

2018-09-06 Thread Sean Owen
BTW it does appear the Scala 2.12 build works now:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/229/

Let's try also producing a 2.12 build with this release. The machinery
should be there in the release scripts, but let me know if something fails
while running the release for 2.12.

On Thu, Sep 6, 2018 at 12:32 AM Wenchen Fan  wrote:

> Hi all,
>
> I've cut the branch-2.4 since all the major blockers are resolved. If no
> objections I'll shortly followup with an RC to get the QA started in
> parallel.
>
> Committers, please only merge PRs to branch-2.4 that are bug fixes,
> performance regression fixes, document changes, or test suites changes.
>
> Thanks,
> Wenchen
>


Pool Information Details cannot be accessed from HistoryServer UI

2018-09-06 Thread Sandeep Katta
[image: image.png]

But From the HistoryServer for the same application it throws
exception as *Unknown
pool*

Code which is throwing exception


*// For now, pool information is only accessible in live UIs **val *pool =
parent.*sc*.flatMap(_.getPoolForName(poolName)).getOrElse {
*throw new *IllegalArgumentException(*s"Unknown pool: **$*poolName*"*)
}

As per code it is clear HistoryServer does not have sparkContext so it
can’t get pool details.

*Do you guys think it is required to support for HistoryServer or is this
valid behaviour ?*


Regards

Sandeep Katta


Datasource v2 Select Into support

2018-09-06 Thread Ross Lawley
Hi,

I hope this is the correct mailinglist. I've been adding v2 support to the
MongoDB Spark connector using Spark 2.3.1.  I've noticed one of my tests
pass when using the original DefaultSource but errors with my v2
implementation:

The code I'm running is:
val df = spark.loadDS[Character]()
df.createOrReplaceTempView("people")
spark.sql("INSERT INTO table people SELECT 'Mort', 1000")

The error I see is:
unresolved operator 'InsertIntoTable DataSourceV2Relation [name#0, age#1],
MongoDataSourceReader ...
'InsertIntoTable DataSourceV2Relation [name#0, age#1],
MongoDataSourceReader 
+- Project [Mort AS Mort#7, 1000 AS 1000#8]
   +- OneRowRelation

My DefaultSource V2 implementation extends DataSourceV2 with ReadSupport
with ReadSupportWithSchema with WriteSupport

I'm wondering if there is something I'm not implementing, or if there is a
bug in my implementation or its an issue with Spark?

Any pointers would be great,

Ross


How to parallelize JDBC Read in Spark

2018-09-06 Thread Chetan Khatri
Hello Dev Users,

I am struggling to parallelize JDBC Read in Spark, It is using 1 - 2 task
only to read data and taking so much of time to read.

Ex.

val invoiceLineItemDF = ((spark.read.jdbc(url = t360jdbcURL,
  table = invoiceLineItemQuery,
  columnName = "INVOICE_LINE_ITEM_ID",
  lowerBound = 1L,
  upperBound = 100L,
  numPartitions = 200,
  connectionProperties = connectionProperties
)))


Thanks


Re: Branch 2.4 is cut

2018-09-06 Thread Hyukjin Kwon
Thanks, Wenchen.

2018년 9월 6일 (목) 오후 3:32, Wenchen Fan 님이 작성:

> Hi all,
>
> I've cut the branch-2.4 since all the major blockers are resolved. If no
> objections I'll shortly followup with an RC to get the QA started in
> parallel.
>
> Committers, please only merge PRs to branch-2.4 that are bug fixes,
> performance regression fixes, document changes, or test suites changes.
>
> Thanks,
> Wenchen
>


Re: [VOTE] SPARK 2.3.2 (RC5)

2018-09-06 Thread zzc
Hi Saisai:
  Spark 2.4 was cut, and is there any new process on 2.3.2?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Branch 2.4 is cut

2018-09-06 Thread Wenchen Fan
Hi all,

I've cut the branch-2.4 since all the major blockers are resolved. If no
objections I'll shortly followup with an RC to get the QA started in
parallel.

Committers, please only merge PRs to branch-2.4 that are bug fixes,
performance regression fixes, document changes, or test suites changes.

Thanks,
Wenchen