Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-25 Thread
I'm sorry to post -1 on this, since there is a non-trivial correctness
issue that I believe we should fix in 2.3.

TL;DR; of the issue: A certain pattern of shuffle+repartition in a query
may produce wrong result if some downstream stages failed and trigger retry
of repartition, the reason of this bug is that current implementation of
`repartition()` doesn't generate deterministic output. The JIRA task:
https://issues.apache.org/jira/browse/SPARK-23207

This is NOT a regression, but since it's a non-trivial correctness issue,
we'd better ship the patch along with 2.3,

2018-01-24 11:42 GMT-08:00 Marcelo Vanzin :

> Given that the bugs I was worried about have been dealt with, I'm
> upgrading to +1.
>
> On Mon, Jan 22, 2018 at 5:09 PM, Marcelo Vanzin 
> wrote:
> > +0
> >
> > Signatures check out. Code compiles, although I see the errors in [1]
> > when untarring the source archive; perhaps we should add "use GNU tar"
> > to the RM checklist?
> >
> > Also ran our internal tests and they seem happy.
> >
> > My concern is the list of open bugs targeted at 2.3.0 (ignoring the
> > documentation ones). It is not long, but it seems some of those need
> > to be looked at. It would be nice for the committers who are involved
> > in those bugs to take a look.
> >
> > [1] https://superuser.com/questions/318809/linux-os-x-
> tar-incompatibility-tarballs-created-on-os-x-give-errors-when-unt
> >
> >
> > On Mon, Jan 22, 2018 at 1:36 PM, Sameer Agarwal 
> wrote:
> >> Please vote on releasing the following candidate as Apache Spark version
> >> 2.3.0. The vote is open until Friday January 26, 2018 at 8:00:00 am UTC
> and
> >> passes if a majority of at least 3 PMC +1 votes are cast.
> >>
> >>
> >> [ ] +1 Release this package as Apache Spark 2.3.0
> >>
> >> [ ] -1 Do not release this package because ...
> >>
> >>
> >> To learn more about Apache Spark, please see https://spark.apache.org/
> >>
> >> The tag to be voted on is v2.3.0-rc2:
> >> https://github.com/apache/spark/tree/v2.3.0-rc2
> >> (489ecb0ef23e5d9b705e5e5bae4fa3d871bdac91)
> >>
> >> List of JIRA tickets resolved in this release can be found here:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12339551
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc2-bin/
> >>
> >> Release artifacts are signed with the following key:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1262/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc2-
> docs/_site/index.html
> >>
> >>
> >> FAQ
> >>
> >> ===
> >> What are the unresolved issues targeted for 2.3.0?
> >> ===
> >>
> >> Please see https://s.apache.org/oXKi. At the time of writing, there are
> >> currently no known release blockers.
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking an
> >> existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> the
> >> current RC and see if anything important breaks, in the Java/Scala you
> can
> >> add the staging repository to your projects resolvers and test with the
> RC
> >> (make sure to clean up the artifact cache before/after so you don't end
> up
> >> building with a out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 2.3.0?
> >> ===
> >>
> >> Committers should look at those and triage. Extremely important bug
> fixes,
> >> documentation, and API tweaks that impact compatibility should be
> worked on
> >> immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
> >> appropriate.
> >>
> >> ===
> >> Why is my bug not fixed?
> >> ===
> >>
> >> In order to make timely releases, we will typically not hold the release
> >> unless the bug in question is a regression from 2.2.0. That being said,
> if
> >> there is something which is a regression from 2.2.0 and has not been
> >> correctly targeted please ping me or a committer to help target the
> issue
> >> (you can see the open issues listed as impacting Spark 2.3.0 at
> >> https://s.apache.org/WmoI).
> >>
> >>
> >> Regards,
> >> Sameer
> >
> >
> >
> > --
> > Marcelo
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Kubernetes backend and docker images

2018-01-05 Thread
Agree it should be nice to have this simplification, and users can still
create their custom images by copy/modifying the default one.
Thanks for bring this out Marcelo!

2018-01-05 17:06 GMT-08:00 Marcelo Vanzin :

> Hey all, especially those working on the k8s stuff.
>
> Currently we have 3 docker images that need to be built and provided
> by the user when starting a Spark app: driver, executor, and init
> container.
>
> When the initial review went by, I asked why do we need 3, and I was
> told that's because they have different entry points. That never
> really convinced me, but well, everybody wanted to get things in to
> get the ball rolling.
>
> But I still think that's not the best way to go. I did some pretty
> simple hacking and got things to work with a single image:
>
> https://github.com/vanzin/spark/commit/k8s-img
>
> Is there a reason why that approach would not work? You could still
> create separate images for driver and executor if wanted, but there's
> no reason I can see why we should need 3 images for the simple case.
>
> Note that the code there can be cleaned up still, and I don't love the
> idea of using env variables to propagate arguments to the container,
> but that works for now.
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread
+1


Reynold Xin 于2017年9月7日 周四下午12:04写道:

> +1 as well
>
> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust 
> wrote:
>
>> +1
>>
>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Thanks for making the updates reflected in the current PR. It would be
>>> great to see the doc updated before it is finally published though.
>>>
>>> Right now it feels like this SPIP is focused more on getting the basics
>>> right for what many datasources are already doing in API V1 combined with
>>> other private APIs, vs pushing forward state of the art for performance.
>>>
>>> I think that’s the right approach for this SPIP. We can add the support
>>> you’re talking about later with a more specific plan that doesn’t block
>>> fixing the problems that this addresses.
>>> ​
>>>
>>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
>>> hvanhov...@databricks.com> wrote:
>>>
 +1 (binding)

 I personally believe that there is quite a big difference between
 having a generic data source interface with a low surface area and pushing
 down a significant part of query processing into a datasource. The later
 has much wider wider surface area and will require us to stabilize most of
 the internal catalyst API's which will be a significant burden on the
 community to maintain and has the potential to slow development velocity
 significantly. If you want to write such integrations then you should be
 prepared to work with catalyst internals and own up to the fact that things
 might change across minor versions (and in some cases even maintenance
 releases). If you are willing to go down that road, then your best bet is
 to use the already existing spark session extensions which will allow you
 to write such integrations and can be used as an `escape hatch`.


 On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash 
 wrote:

> +0 (non-binding)
>
> I think there are benefits to unifying all the Spark-internal
> datasources into a common public API for sure.  It will serve as a forcing
> function to ensure that those internal datasources aren't advantaged vs
> datasources developed externally as plugins to Spark, and that all Spark
> features are available to all datasources.
>
> But I also think this read-path proposal avoids the more difficult
> questions around how to continue pushing datasource performance forwards.
> James Baker (my colleague) had a number of questions about advanced
> pushdowns (combined sorting and filtering), and Reynold also noted that
> pushdown of aggregates and joins are desirable on longer timeframes as
> well.  The Spark community saw similar requests, for aggregate pushdown in
> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
> in SPARK-12449.  Clearly a number of people are interested in this kind of
> performance work for datasources.
>
> To leave enough space for datasource developers to continue
> experimenting with advanced interactions between Spark and their
> datasources, I'd propose we leave some sort of escape valve that enables
> these datasources to keep pushing the boundaries without forking Spark.
> Possibly that looks like an additional unsupported/unstable interface that
> pushes down an entire (unstable API) logical plan, which is expected to
> break API on every release.   (Spark attempts this full-plan pushdown, and
> if that fails Spark ignores it and continues on with the rest of the V2 
> API
> for compatibility).  Or maybe it looks like something else that we don't
> know of yet.  Possibly this falls outside of the desired goals for the V2
> API and instead should be a separate SPIP.
>
> If we had a plan for this kind of escape valve for advanced datasource
> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
> focused more on getting the basics right for what many datasources are
> already doing in API V1 combined with other private APIs, vs pushing
> forward state of the art for performance.
>
> Andrew
>
> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
> suresh.thalam...@gmail.com> wrote:
>
>> +1 (non-binding)
>>
>>
>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:
>>
>> Hi all,
>>
>> In the previous discussion, we decided to split the read and write
>> path of data source v2 into 2 SPIPs, and I'm sending this email to call a
>> vote for Data Source V2 read path only.
>>
>> The full document of the Data Source API V2 is:
>>
>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit
>>
>> The ready-for-review PR that implements the basic 

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread
congs!

Takeshi Yamamuro 于2017年8月28日 周一下午7:11写道:

> Congrats!
>
> On Tue, Aug 29, 2017 at 11:04 AM, zhichao  wrote:
>
>> Congratulations, Jerry!
>>
>> On Tue, Aug 29, 2017 at 9:57 AM, Weiqing Yang 
>> wrote:
>>
>>> Congratulations, Jerry!
>>>
>>> On Mon, Aug 28, 2017 at 6:44 PM, Yanbo Liang  wrote:
>>>
 Congratulations, Jerry.

 On Tue, Aug 29, 2017 at 9:42 AM, John Deng  wrote:

>
> Congratulations, Jerry !
>
> On 8/29/2017 09:28,Matei Zaharia
>  wrote:
>
> Hi everyone,
>
> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai 
> has been contributing to many areas of the project for a long time, so 
> it’s great to see him join. Join me in thanking and congratulating him!
>
>
> Matei
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread
+1 (Non-binding)

Xiao Li 于2017年8月28日 周一下午5:38写道:

> +1
>
> 2017-08-28 12:45 GMT-07:00 Cody Koeninger :
>
>> Just wanted to point out that because the jira isn't labeled SPIP, it
>> won't have shown up linked from
>>
>> http://spark.apache.org/improvement-proposals.html
>>
>> On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan  wrote:
>> > Hi all,
>> >
>> > It has been almost 2 weeks since I proposed the data source V2 for
>> > discussion, and we already got some feedbacks on the JIRA ticket and the
>> > prototype PR, so I'd like to call for a vote.
>> >
>> > The full document of the Data Source API V2 is:
>> >
>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit
>> >
>> > Note that, this vote should focus on high-level design/framework, not
>> > specified APIs, as we can always change/improve specified APIs during
>> > development.
>> >
>> > The vote will be up for the next 72 hours. Please reply with your vote:
>> >
>> > +1: Yeah, let's go forward and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I don't think this is a good idea because of the following technical
>> > reasons.
>> >
>> > Thanks!
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread
+1 (non-binding)

Wenchen Fan 于2017年8月17日 周四下午9:05写道:

> adding my own +1 (binding)
>
> On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan  wrote:
>
>> Hi all,
>>
>> Following the SPIP process, I'm putting this SPIP up for a vote.
>>
>> The current data source API doesn't work well because of some limitations
>> like: no partitioning/bucketing support, no columnar read, hard to support
>> more operator push down, etc.
>>
>> I'm proposing a Data Source API V2 to address these problems, please read
>> the full document at
>> https://issues.apache.org/jira/secure/attachment/12882332/SPIP%20Data%20Source%20API%20V2.pdf
>>
>> Since this SPIP is mostly about APIs, I also created a prototype and put
>> java docs on these interfaces, so that it's easier to review these
>> interfaces and discuss: https://github.com/cloud-fan/spark/pull/10/files
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following
>> technical reasons.
>>
>> Thanks!
>>
>
>


Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-07 Thread
Congratulation, Hyukjin and Sameer!

2017-08-07 23:57 GMT+08:00 :

> Congrats!
>
> 2017/08/08 0:55、Bai, Dave  のメッセージ:
>
> > Congrats, leveled up!=)
> >
> >> On 8/7/17, 10:53 AM, "Matei Zaharia"  wrote:
> >>
> >> Hi everyone,
> >>
> >> The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal as
> >> committers. Join me in congratulating both of them and thanking them for
> >> their contributions to the project!
> >>
> >> Matei
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread
FYI there have been a related discussion here:
https://github.com/apache/spark/pull/15417#discussion_r85295977

2017-07-17 15:44 GMT+08:00 Chang Chen :

> Hi All
>
> I don't understand the difference between the semantics, I found Spark
> does the same thing for GroupBy non-deterministic. From Map-Reduce point of
> view, Join is also GroupBy in essence .
>
> @Liang Chi Hsieh
> 
>
> in which situation,  semantics  will be changed?
>
> Thanks
> Chang
>
> On Mon, Jul 17, 2017 at 3:29 PM, Liang-Chi Hsieh  wrote:
>
>>
>> Thinking about it more, I think it changes the semantics only under
>> certain
>> scenarios.
>>
>> For the example SQL query shown in previous discussion, it looks the same
>> semantics.
>>
>>
>> Xiao Li wrote
>> > If the join condition is non-deterministic, pushing it down to the
>> > underlying project will change the semantics. Thus, we are unable to do
>> it
>> > in PullOutNondeterministic. Users can do it manually if they do not care
>> > the semantics difference.
>> >
>> > Thanks,
>> >
>> > Xiao
>> >
>> >
>> >
>> > 2017-07-16 20:07 GMT-07:00 Chang Chen 
>>
>> > baibaichen@
>>
>> > :
>> >
>> >> It is tedious since we have lots of Hive SQL being migrated to Spark.
>> >> And
>> >> this workaround is equivalent  to insert a Project between Join
>> operator
>> >> and its child.
>> >>
>> >> Why not do it in PullOutNondeterministic?
>> >>
>> >> Thanks
>> >> Chang
>> >>
>> >>
>> >> On Fri, Jul 14, 2017 at 5:29 PM, Liang-Chi Hsieh 
>>
>> > viirya@
>>
>> >  wrote:
>> >>
>> >>>
>> >>> A possible workaround is to add the rand column into tbl1 with a
>> >>> projection
>> >>> before the join.
>> >>>
>> >>> SELECT a.col1
>> >>> FROM (
>> >>>   SELECT col1,
>> >>> CASE
>> >>>  WHEN col2 IS NULL
>> >>>THEN cast(rand(9)*1000 - 99 as string)
>> >>>  ELSE
>> >>>col2
>> >>> END AS col2
>> >>> FROM tbl1) a
>> >>> LEFT OUTER JOIN tbl2 b
>> >>> ON a.col2 = b.col3;
>> >>>
>> >>>
>> >>>
>> >>> Chang Chen wrote
>> >>> > Hi Wenchen
>> >>> >
>> >>> > Yes. We also find this error is caused by Rand. However, this is
>> >>> classic
>> >>> > way to solve data skew in Hive.  Is there any equivalent way in
>> Spark?
>> >>> >
>> >>> > Thanks
>> >>> > Chang
>> >>> >
>> >>> > On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan 
>> >>>
>> >>> > cloud0fan@
>> >>>
>> >>> >  wrote:
>> >>> >
>> >>> >> It’s not about case when, but about rand(). Non-deterministic
>> >>> expressions
>> >>> >> are not allowed in join condition.
>> >>> >>
>> >>> >> > On 13 Jul 2017, at 6:43 PM, wangshuang 
>> >>>
>> >>> > cn_wss@
>> >>>
>> >>> >  wrote:
>> >>> >> >
>> >>> >> > I'm trying to execute hive sql on spark sql (Also on spark
>> >>> >> thriftserver), For
>> >>> >> > optimizing data skew, we use "case when" to handle null.
>> >>> >> > Simple sql as following:
>> >>> >> >
>> >>> >> >
>> >>> >> > SELECT a.col1
>> >>> >> > FROM tbl1 a
>> >>> >> > LEFT OUTER JOIN tbl2 b
>> >>> >> > ON
>> >>> >> > * CASE
>> >>> >> >   WHEN a.col2 IS NULL
>> >>> >> >   TNEN cast(rand(9)*1000 - 99 as
>> >>> string)
>> >>> >> >   ELSE
>> >>> >> >   a.col2 END *
>> >>> >> >   = b.col3;
>> >>> >> >
>> >>> >> >
>> >>> >> > But I get the error:
>> >>> >> >
>> >>> >> > == Physical Plan ==
>> >>> >> > *org.apache.spark.sql.AnalysisException: nondeterministic
>> >>> expressions
>> >>> >> are
>> >>> >> > only allowed in
>> >>> >> > Project, Filter, Aggregate or Window, found:*
>> >>> >> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) *
>> CAST(1000
>> >>> AS
>> >>> >> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE
>> >>> a.`nav_tcdt`
>> >>> >> END
>> >>> >> =
>> >>> >> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>> >>> >> (c.`cur_flag`
>> >>> >> =
>> >>> >> > 1))
>> >>> >> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
>> >>> double))
>> >>> as
>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>> >>> (cast(nav_tcd#26
>> >>> as
>> >>> >> int)
>> >>> >> > = 9)) && (cur_flag#77 = 1))
>> >>> >> >   ;;
>> >>> >> > GlobalLimit 10
>> >>> >> > +- LocalLimit 10
>> >>> >> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string)
>> IN
>> >>> >> > (cast(19596 as string),cast(20134 as string),cast(10997 as
>> string))
>> >>> &&
>> >>> >> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21
>> >>> END],
>> >>> >> > [date_id#7]
>> >>> >> >  +- Filter (date_id#7 = 2017-07-12)
>> >>> >> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> >>> >> > cast(((rand(9) * cast(1000 as double)) - cast(99 as
>> >>> double))
>> >>> as
>> >>> >> > string) ELSE nav_tcdt#25 END = site_categ_id#80) &&
>> >>> (cast(nav_tcd#26
>> >>> as
>> >>> >> int)
>> >>> >> > = 9)) &&