Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-25 Thread Kazuaki Ishizaki
Hi Xiao,
Thank you very much for becoming a shepherd.
If you feel the discussion settles, we would appreciate it if you would 
start a voting.

Regards,
Kazuaki Ishizaki



From:   Xiao Li 
To: Kazuaki Ishizaki 
Cc: dev , Takeshi Yamamuro 

Date:   2018/10/22 16:31
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi, Kazuaki, 

Thanks for your great SPIP! I am willing to be the shepherd of this SPIP. 

Cheers,

Xiao


On Mon, Oct 22, 2018 at 12:05 AM Kazuaki Ishizaki  
wrote:
Hi Yamamuro-san,
Thank you for your comments. This SPIP gets several valuable comments and 
feedback on Google Doc: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing
.
I hope that this SPIP could go forward based on these feedback.

Based on this SPIP procedure 
http://spark.apache.org/improvement-proposals.html, can I ask one or more 
PMCs to become a shepherd of this SPIP?
I would appreciate your kindness and cooperation. 

Best Regards,
Kazuaki Ishizaki



From:Takeshi Yamamuro 
To:Spark dev list 
Cc:ishiz...@jp.ibm.com
Date:2018/10/15 12:12
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi, ishizaki-san,

Cool activity, I left some comments on the doc.

best,
takeshi


On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki  
wrote:
Hello community,

I am writing this e-mail in order to start a discussion about adding 
structure intermediate representation for generating Java code from a 
program using DataFrame or Dataset API, in addition to the current 
String-based representation.
This addition is based on the discussions in a thread at 
https://github.com/apache/spark/pull/21537#issuecomment-413268196

Please feel free to comment on the JIRA ticket or Google Doc.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728
Google Doc: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing


Looking forward to hear your feedback

Best Regards,
Kazuaki Ishizaki


-- 
---
Takeshi Yamamuro



-- 





Stream Stream joins with update and complete mode

2018-10-25 Thread sandeep_katta
As per the documentation
http://spark.apache.org/docs/2.3.2/structured-streaming-programming-guide.html#stream-stream-joins
, only append mode is supported

*As of Spark 2.3, you can use joins only when the query is in Append output
mode. Other output modes are not yet supported.*

But as per the code there is no check done for the output mode

// output mode checked is missed (UnsupportedOperationChecker.scala)
case LeftOuter =>
  if (!left.isStreaming && right.isStreaming) {
throwError("Left outer join with a streaming
DataFrame/Dataset " +
  "on the right and a static DataFrame/Dataset on the left
is not supported")
  } else if (left.isStreaming && right.isStreaming) {
val watermarkInJoinKeys =
StreamingJoinHelper.isWatermarkInJoinKeys(subPlan)

val hasValidWatermarkRange =
  StreamingJoinHelper.getStateValueWatermark(
left.outputSet, right.outputSet, condition,
Some(100)).isDefined

if (!watermarkInJoinKeys && !hasValidWatermarkRange) {
  throwError("Stream-stream outer join between two streaming
DataFrame/Datasets " +
"is not supported without a watermark in the join keys,
or a watermark on " +
"the nullable side and an appropriate range condition")
}
  }

If the documentation is correct, then I can raise the PR to fix the code




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS] Support decimals with negative scale in decimal operation

2018-10-25 Thread Marco Gaido
Hi all,

a bit more than one month ago, I sent a proposal for handling properly
decimals with negative scales in our operations. This is a long standing
problem in our codebase as we derived our rules from Hive and SQLServer
where negative scales are forbidden, while in Spark they are not.

The discussion has been stale for a while now. No more comments on the
design doc:
https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm
.

So I am writing this e-mail in order to check whether there are more
comments on it or we can go ahead with the PR.

Thanks,
Marco


Re: What's a blocker?

2018-10-25 Thread Tom Graves
 So just to clarify a few things in case people didn't read the entire thread 
in the PR, the discussion is what is the criteria for a blocker and really my 
concerns are what people are using as criteria for not marking a jira as a 
blocker.
The only thing we have documented to mark a jira as a blocker is for 
correctness issues: http://spark.apache.org/contributing.html.  And really I 
think that is initially mark it as a blocker to bring attention to it.The final 
decision as to whether something is a blocker is up to the PMC who votes on 
whether a release passes.  I think it would be impossible to properly define 
what a blocker is with strict rules.
Personally from this thread I would like to make sure committers and PMC 
members aren't saying its a block for reasons other then the actual impact the 
jira has and if its at all in question it should be brought to the PMC's 
attention for a vote.  I agree with others that if its during an RC it should 
be talked about on the RC thread.
A few specific things that were said that I disagree with are:   - its not a 
blocker because it was also an issue in the last release (meaning feature 
release).  ie the bug was introduced in 2.2 and now we are doing 2.4 so its 
automatically not a blocker.  This to me is just wrong.  Lots of things are not 
found immediately, or aren't reported immediately.   Now I do believe the 
timeframe its been in there does affect the decision on the impact but just 
making the decision on this to me is to strict.    - Committers and PMC members 
should not be saying its not a blocker because they personally or their company 
doesn't care about this feature or api, or state that the Spark project as a 
whole doesn't care about this feature unless that was specifically voted on at 
the project level. They need to follow the api compatibility we have 
documented. This is really a broader issue then just marking a jira, it goes to 
anything checked in and perhaps need to be a separate thread.

For the verbiage of what a regression is, it seems like that should be defined 
by our versioning documents. It states what we do in maintenance, feature, and 
major releases (http://spark.apache.org/versioning-policy.html), if its not 
defined by that we probably need to clarify.   There was a good example we 
might want to clarify about things like scala or java compatibility in feature 
releases.  
Obviously this is my opinion and its here for everyone to discuss and come to a 
consensus on.   
Tom
On Wednesday, October 24, 2018, 2:09:49 PM CDT, Sean Owen 
 wrote:  
 
 Shifting this to dev@. See the PR https://github.com/apache/spark/pull/22144 
for more context.
There will be no objective, complete definition of blocker, or even regression 
or correctness issue. Many cases are clear, some are not. We can draw up more 
guidelines, and feel free to open PRs against the 'contributing' doc. But in 
general these are the same consensus-driven decisions we negotiate all the time.
What isn't said that should be is that there is a cost to not releasing. Keep 
in mind we have, also, decided on a 'release train' cadence. That does properly 
change the calculus about what's a blocker; the right decision could change 
within even a week.

I wouldn't mind some verbiage around what a regression is. Since the last minor 
release?
We can VOTE on anything we like, but we already VOTE on the release. Weirdly, 
technically, the release vote criteria is simple majority, FWIW: 
http://www.apache.org/legal/release-policy.html#release-approval 
Yes, actually, it is only the PMC's votes that literally matter. Those votes 
are, surely, based on input from others too. But that is actually working as 
intended.

Let's understand statements like "X is not a blocker" to mean "I don't think 
that X is a blocker". Interpretations not proclamations, backed up by reasons, 
not all of which are appeals to policy and precedent.
I find it hard to argue about these in the abstract, because I believe it's 
already widely agreed, and written down in ASF policy, that nobody makes 
decisions unilaterally. Done, yes. 
Practically speaking, the urgent issue is the 2.4 release. I don't see process 
failures here that need fixing or debate. I do think those outstanding issues 
merit technical discussion. The outcome will be a tradeoff of some subjective 
issues, not read off of a policy sheet, and will entail tradeoffs. Let's speak 
freely about those technical issues and try to find the consensus position.

On Wed, Oct 24, 2018 at 12:21 PM Mark Hamstra  wrote:


Thanks @tgravescs for your latest posts -- they've saved me from posting 
something similar in many respects but more strongly worded.

What is bothering me (not just in the discussion of this PR, but more broadly) 
is that we have individuals making declarative statements about whether 
something can or can't block a release, or that something "is not that 
important to Spark at this point", etc. -- things for which the

Re: What's a blocker?

2018-10-25 Thread Sean Owen
What does "PMC members aren't saying its a block for reasons other then the
actual impact the jira has" mean that isn't already widely agreed? Likewise
"Committers and PMC members should not be saying its not a blocker because
they personally or their company doesn't care about this feature or api".
It sounds like insinuation, and I'd rather make it explicit -- call out the
bad actions -- or keep it to observable technical issues.

Likewise one could say there's a problem just because A thinks X should be
a blocker and B disagrees. I see no bad faith, process problem, or obvious
errors. Do you? I see disagreement, and it's tempting to suspect motives. I
have seen what I think are actual bad-faith decisions in the past in this
project, too. I don't see it here though and want to stick to 'now'.

(Aside: the implication is that those representing vendors are
steam-rolling a release. Actually, the cynical incentives cut the other way
here. Blessing the latest changes as OSS Apache Spark is predominantly
beneficial to users of OSS, not distros. In fact, it forces distros to make
changes. And broadly, vendors have much more accountability for quality of
releases, because they're paid to.)


I'm still not sure what specifically the objection is to what here? I
understand a lot is in flight and nobody agrees with every decision made,
but, what else is new?
Concretely: the release is held again to fix a few issues, in the end. For
the map_filter issue, that seems like the right call, and there are a few
other important issues that could be quickly fixed too. All is well there,
yes?

This has surfaced some implicit reasoning about releases that we could make
explicit, like:

(Sure, if you want to write down things like, release blockers should be
decided in the interests of the project by the PMC, OK)

We have a time-based release schedule, so time matters. There is an
opportunity cost to not releasing. The bar for blockers goes up over time.

Not all regressions are blockers. Would you hold a release over a trivial
regression? but then which must or should block? There's no objective
answer, but a reasonable rule is: non-trivial regressions from minor
release x.y to x.{y+1} block releases. Regressions from x.{y-1} to x.{y+1}
should, but not necessarily, block the release. We try hard to avoid
regressions in x.y.0 releases because these are generally consumed by
aggressive upgraders, on x.{y-1}.z now. If a bug exists in x.{y-1}, they're
not affected or worked around it. The cautious upgrader goes from maybe
x.{y-2}.z to x.y.1 later. They're affected, but not before, maybe, a
maintenance release. A crude argument, and it's not an argument that
regressions are OK. It's an argument that 'old' regressions matter less.
And maybe it's reasonable to draw the "must" vs "should" line between them.



On Thu, Oct 25, 2018 at 8:51 AM Tom Graves  wrote:

> So just to clarify a few things in case people didn't read the entire
> thread in the PR, the discussion is what is the criteria for a blocker and
> really my concerns are what people are using as criteria for not marking a
> jira as a blocker.
>
> The only thing we have documented to mark a jira as a blocker is for
> correctness issues: http://spark.apache.org/contributing.html.  And
> really I think that is initially mark it as a blocker to bring attention to
> it.
> The final decision as to whether something is a blocker is up to the PMC
> who votes on whether a release passes.  I think it would be impossible to
> properly define what a blocker is with strict rules.
>
> Personally from this thread I would like to make sure committers and PMC
> members aren't saying its a block for reasons other then the actual impact
> the jira has and if its at all in question it should be brought to the
> PMC's attention for a vote.  I agree with others that if its during an RC
> it should be talked about on the RC thread.
>
> A few specific things that were said that I disagree with are:
>- its not a blocker because it was also an issue in the last release
> (meaning feature release).  ie the bug was introduced in 2.2 and now we are
> doing 2.4 so its automatically not a blocker.  This to me is just wrong.
> Lots of things are not found immediately, or aren't reported immediately.
>  Now I do believe the timeframe its been in there does affect the decision
> on the impact but just making the decision on this to me is to strict.
>- Committers and PMC members should not be saying its not a blocker
> because they personally or their company doesn't care about this feature or
> api, or state that the Spark project as a whole doesn't care about this
> feature unless that was specifically voted on at the project level. They
> need to follow the api compatibility we have documented. This is really a
> broader issue then just marking a jira, it goes to anything checked in and
> perhaps need to be a separate thread.
>
>
> For the verbiage of what a regression is, it seems like that shou

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-25 Thread Dongjoon Hyun
Thank you for the decision, All.

As of now, to unblock this, it seems that we are trying to remove them from
the function registry.

https://github.com/apache/spark/pull/22821

One problem here is that users can recover those functions like this simply.

scala> 
spark.sessionState.functionRegistry.createOrReplaceTempFunction("map_filter",
x => org.apache.spark.sql.catalyst.expressions.MapFilter(x(0),x(1)))


Technically, the PR looks like a compromised way to unblock the release and
to allow some users that feature completely.

At first glance, I thought this is a workaround to ignore the discussion
context. But, that sounds like one of the practical ways for Apache Spark.
(We had Spark 2.0 Tech. Preview before.)

I want to finalize the decision on `map_filter` (and related three
functions) issue. Are we good to go with
https://github.com/apache/spark/pull/22821?

Bests,
Dongjoon.

PS. Also, there is a PR to completely remove them, too.
   https://github.com/cloud-fan/spark/pull/11


On Wed, Oct 24, 2018 at 10:14 PM Xiao Li  wrote:

> @Dongjoon Hyun   Thanks! This is a blocking
> ticket. It returns a wrong result due to our undefined behavior. I agree we
> should revert the newly added map-oriented functions. In 3.0 release, we
> need to define the behavior of duplicate keys in the data type MAP and fix
> all the related issues that are confusing to our end users.
>
> Thanks,
>
> Xiao
>
> On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan  wrote:
>
>> Ah now I see the problem. `map_filter` has a very weird semantic that is
>> neither "earlier entry wins" or "latter entry wins".
>>
>> I've opened https://github.com/apache/spark/pull/22821 , to remove these
>> newly added map-related functions from FunctionRegistry(for 2.4.0), so that
>> they are invisible to end-users, and the weird behavior of Spark map type
>> with duplicated keys are not escalated. We should fix it ASAP in the master
>> branch.
>>
>> If others are OK with it, I'll start a new RC after that PR is merged.
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun 
>> wrote:
>>
>>> For the first question, it's `bin/spark-sql` result. I didn't check STS,
>>> but it will return the same with `bin/spark-sql`.
>>>
>>> > I think map_filter is implemented correctly. map(1,2,1,3) is actually
>>> map(1,2) according to the "earlier entry wins" semantic. I don't think
>>> this will change in 2.4.1.
>>>
>>> For the second one, `map_filter` issue is not about `earlier entry wins`
>>> stuff. Please see the following example.
>>>
>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
>>> map_concat(map(1,2), map(1,3)) m);
>>> {1:3} {1:2}
>>>
>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
>>> map_concat(map(1,2), map(1,3)) m);
>>> {1:3} {1:3}
>>>
>>> spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
>>> map_concat(map(1,2), map(1,3)) m);
>>> {1:3} {}
>>>
>>> In other words, `map_filter` works like `push-downed filter` to the map
>>> in terms of the output result
>>> while users assumed that `map_filter` works on top of the result of `m`.
>>>
>>> This is a function semantic issue.
>>>
>>>
>>> On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan  wrote:
>>>
 > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
 > {1:3}

 Are you running in the thrift-server? Then maybe this is caused by the
 bug in `Dateset.collect` as I mentioned above.

 I think map_filter is implemented correctly. map(1,2,1,3) is actually
 map(1,2) according to the "earlier entry wins" semantic. I don't think
 this will change in 2.4.1.

 On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
 wrote:

> Thank you for the follow-ups.
>
> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
> (including Spark/Scala) in the end?
>
> I hoped to fix the `map_filter`, but now Spark looks inconsistent in
> many ways.
>
> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
> +---+
> |map(1, 2, 1, 3)|
> +---+
> |Map(1 -> 3)|
> +---+
>
>
> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> {1:3}
>
>
> hive> select map(1,2,1,3);  // Hive 1.2.2
> OK
> {1:3}
>
>
> presto> SELECT map_concat(map(array[1],array[2]),
> map(array[1],array[3])); // Presto 0.212
>  _col0
> ---
>  {1=3}
>
>
> Bests,
> Dongjoon.
>
>
> On Wed, Oct 24, 2018 at 5:17 PM Wenchen Fan 
> wrote:
>
>> Hi Dongjoon,
>>
>> Thanks for reporting it! This is indeed a bug that needs to be fixed.
>>
>> The problem is not about the function `map_filter`, but about how the
>> map type values are created in Spark, when there are duplicated keys.
>>
>> In programming languages like Java/Scala, when creating map, the
>> later entry wins. e.g. in scala
>> scala> Map(1 -> 2, 1 -> 3)
>> res0:

What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Sean Owen
Forking this thread.

Because we'll have another RC, we could possibly address these two
issues. Only if we have a reliable change of course.

Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.

And is it reasonable to essentially 'disable'
kubernetes/integration-tests by removing it from the kubernetes
profile? it doesn't mean it goes away, just means it's run manually,
not automatically. Is that actually how it's meant to be used anyway?
in the short term? given the discussion around its requirements and
minikube and all that?

(Actually, this would also 'solve' the Scala 2.12 build problem too)

On Tue, Oct 23, 2018 at 2:45 PM Sean Owen  wrote:
>
> To be clear I'm currently +1 on this release, with much commentary.
>
> OK, the explanation for kubernetes tests makes sense. Yes I think we need to 
> propagate the scala-2.12 build profile to make it work. Go for it, if you 
> have a lead on what the change is.
> This doesn't block the release as it's an issue for tests, and only affects 
> 2.12. However if we had a clean fix for this and there were another RC, I'd 
> include it.
>
> Dongjoon has a good point about the spark-kubernetes-integration-tests 
> artifact. That doesn't sound like it should be published in this way, though, 
> of course, we publish the test artifacts from every module already. This is 
> only a bit odd in being a non-test artifact meant for testing. But it's 
> special testing! So I also don't think that needs to block a release.
>
> This happens because the integration tests module is enabled with the 
> 'kubernetes' profile too, and also this output is copied into the release 
> tarball at kubernetes/integration-tests/tests. Do we need that in a binary 
> release?
>
> If these integration tests are meant to be run ad hoc, manually, not part of 
> a normal test cycle, then I think we can just not enable it with 
> -Pkubernetes. If it is meant to run every time, then it sounds like we need a 
> little extra work shown in recent PRs to make that easier, but then, this 
> test code should just be the 'test' artifact parts of the kubernetes module, 
> no?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Erik Erlandson
I would be comfortable making the integration testing manual for now.  A
JIRA for ironing out how to make it reliable for automatic as a goal for
3.0 seems like a good idea.

On Thu, Oct 25, 2018 at 8:11 AM Sean Owen  wrote:

> Forking this thread.
>
> Because we'll have another RC, we could possibly address these two
> issues. Only if we have a reliable change of course.
>
> Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.
>
> And is it reasonable to essentially 'disable'
> kubernetes/integration-tests by removing it from the kubernetes
> profile? it doesn't mean it goes away, just means it's run manually,
> not automatically. Is that actually how it's meant to be used anyway?
> in the short term? given the discussion around its requirements and
> minikube and all that?
>
> (Actually, this would also 'solve' the Scala 2.12 build problem too)
>
> On Tue, Oct 23, 2018 at 2:45 PM Sean Owen  wrote:
> >
> > To be clear I'm currently +1 on this release, with much commentary.
> >
> > OK, the explanation for kubernetes tests makes sense. Yes I think we
> need to propagate the scala-2.12 build profile to make it work. Go for it,
> if you have a lead on what the change is.
> > This doesn't block the release as it's an issue for tests, and only
> affects 2.12. However if we had a clean fix for this and there were another
> RC, I'd include it.
> >
> > Dongjoon has a good point about the spark-kubernetes-integration-tests
> artifact. That doesn't sound like it should be published in this way,
> though, of course, we publish the test artifacts from every module already.
> This is only a bit odd in being a non-test artifact meant for testing. But
> it's special testing! So I also don't think that needs to block a release.
> >
> > This happens because the integration tests module is enabled with the
> 'kubernetes' profile too, and also this output is copied into the release
> tarball at kubernetes/integration-tests/tests. Do we need that in a binary
> release?
> >
> > If these integration tests are meant to be run ad hoc, manually, not
> part of a normal test cycle, then I think we can just not enable it with
> -Pkubernetes. If it is meant to run every time, then it sounds like we need
> a little extra work shown in recent PRs to make that easier, but then, this
> test code should just be the 'test' artifact parts of the kubernetes
> module, no?
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-25 Thread Wenchen Fan
Personally I don't think it matters. Users can build arbitrary
expressions/plans themselves with internal API, and we never guarantee the
result.

Removing these functions from the function registry is a small patch and
easy to review, and to me it's better than a 1000+ LOC patch that removes
the whole thing.

Again I don't have a strong opinion here. I'm OK to remove the entire thing
if a PR is ready and well reviewed.

On Thu, Oct 25, 2018 at 11:00 PM Dongjoon Hyun 
wrote:

> Thank you for the decision, All.
>
> As of now, to unblock this, it seems that we are trying to remove them
> from the function registry.
>
> https://github.com/apache/spark/pull/22821
>
> One problem here is that users can recover those functions like this
> simply.
>
> scala> 
> spark.sessionState.functionRegistry.createOrReplaceTempFunction("map_filter", 
> x => org.apache.spark.sql.catalyst.expressions.MapFilter(x(0),x(1)))
>
>
> Technically, the PR looks like a compromised way to unblock the release
> and to allow some users that feature completely.
>
> At first glance, I thought this is a workaround to ignore the discussion
> context. But, that sounds like one of the practical ways for Apache Spark.
> (We had Spark 2.0 Tech. Preview before.)
>
> I want to finalize the decision on `map_filter` (and related three
> functions) issue. Are we good to go with
> https://github.com/apache/spark/pull/22821?
>
> Bests,
> Dongjoon.
>
> PS. Also, there is a PR to completely remove them, too.
>https://github.com/cloud-fan/spark/pull/11
>
>
> On Wed, Oct 24, 2018 at 10:14 PM Xiao Li  wrote:
>
>> @Dongjoon Hyun   Thanks! This is a blocking
>> ticket. It returns a wrong result due to our undefined behavior. I agree we
>> should revert the newly added map-oriented functions. In 3.0 release, we
>> need to define the behavior of duplicate keys in the data type MAP and fix
>> all the related issues that are confusing to our end users.
>>
>> Thanks,
>>
>> Xiao
>>
>> On Wed, Oct 24, 2018 at 9:54 PM Wenchen Fan  wrote:
>>
>>> Ah now I see the problem. `map_filter` has a very weird semantic that is
>>> neither "earlier entry wins" or "latter entry wins".
>>>
>>> I've opened https://github.com/apache/spark/pull/22821 , to remove
>>> these newly added map-related functions from FunctionRegistry(for 2.4.0),
>>> so that they are invisible to end-users, and the weird behavior of Spark
>>> map type with duplicated keys are not escalated. We should fix it ASAP in
>>> the master branch.
>>>
>>> If others are OK with it, I'll start a new RC after that PR is merged.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Oct 25, 2018 at 10:32 AM Dongjoon Hyun 
>>> wrote:
>>>
 For the first question, it's `bin/spark-sql` result. I didn't check
 STS, but it will return the same with `bin/spark-sql`.

 > I think map_filter is implemented correctly. map(1,2,1,3) is
 actually map(1,2) according to the "earlier entry wins" semantic. I
 don't think this will change in 2.4.1.

 For the second one, `map_filter` issue is not about `earlier entry
 wins` stuff. Please see the following example.

 spark-sql> SELECT m, map_filter(m, (k,v) -> v=2) c FROM (SELECT
 map_concat(map(1,2), map(1,3)) m);
 {1:3} {1:2}

 spark-sql> SELECT m, map_filter(m, (k,v) -> v=3) c FROM (SELECT
 map_concat(map(1,2), map(1,3)) m);
 {1:3} {1:3}

 spark-sql> SELECT m, map_filter(m, (k,v) -> v=4) c FROM (SELECT
 map_concat(map(1,2), map(1,3)) m);
 {1:3} {}

 In other words, `map_filter` works like `push-downed filter` to the map
 in terms of the output result
 while users assumed that `map_filter` works on top of the result of
 `m`.

 This is a function semantic issue.


 On Wed, Oct 24, 2018 at 6:06 PM Wenchen Fan 
 wrote:

> > spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
> > {1:3}
>
> Are you running in the thrift-server? Then maybe this is caused by the
> bug in `Dateset.collect` as I mentioned above.
>
> I think map_filter is implemented correctly. map(1,2,1,3) is actually
> map(1,2) according to the "earlier entry wins" semantic. I don't
> think this will change in 2.4.1.
>
> On Thu, Oct 25, 2018 at 8:56 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for the follow-ups.
>>
>> Then, Spark 2.4.1 will return `{1:2}` differently from the followings
>> (including Spark/Scala) in the end?
>>
>> I hoped to fix the `map_filter`, but now Spark looks inconsistent in
>> many ways.
>>
>> scala> sql("select map(1,2,1,3)").show // Spark 2.2.2
>> +---+
>> |map(1, 2, 1, 3)|
>> +---+
>> |Map(1 -> 3)|
>> +---+
>>
>>
>> spark-sql> select map(1,2,1,3); // Spark 2.4.0 RC4
>> {1:3}
>>
>>
>> hive> select map(1,2,1,3);  // Hive 1.2.2
>> OK
>> {1:3}
>>
>>
>> presto> SELECT map_concat(map(array[1],ar

Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Stavros Kontopoulos
I will open a jira for the profile propagation issue and have a look to fix
it.

Stavros

On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson  wrote:

>
> I would be comfortable making the integration testing manual for now.  A
> JIRA for ironing out how to make it reliable for automatic as a goal for
> 3.0 seems like a good idea.
>
> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen  wrote:
>
>> Forking this thread.
>>
>> Because we'll have another RC, we could possibly address these two
>> issues. Only if we have a reliable change of course.
>>
>> Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.
>>
>> And is it reasonable to essentially 'disable'
>> kubernetes/integration-tests by removing it from the kubernetes
>> profile? it doesn't mean it goes away, just means it's run manually,
>> not automatically. Is that actually how it's meant to be used anyway?
>> in the short term? given the discussion around its requirements and
>> minikube and all that?
>>
>> (Actually, this would also 'solve' the Scala 2.12 build problem too)
>>
>> On Tue, Oct 23, 2018 at 2:45 PM Sean Owen  wrote:
>> >
>> > To be clear I'm currently +1 on this release, with much commentary.
>> >
>> > OK, the explanation for kubernetes tests makes sense. Yes I think we
>> need to propagate the scala-2.12 build profile to make it work. Go for it,
>> if you have a lead on what the change is.
>> > This doesn't block the release as it's an issue for tests, and only
>> affects 2.12. However if we had a clean fix for this and there were another
>> RC, I'd include it.
>> >
>> > Dongjoon has a good point about the spark-kubernetes-integration-tests
>> artifact. That doesn't sound like it should be published in this way,
>> though, of course, we publish the test artifacts from every module already.
>> This is only a bit odd in being a non-test artifact meant for testing. But
>> it's special testing! So I also don't think that needs to block a release.
>> >
>> > This happens because the integration tests module is enabled with the
>> 'kubernetes' profile too, and also this output is copied into the release
>> tarball at kubernetes/integration-tests/tests. Do we need that in a
>> binary release?
>> >
>> > If these integration tests are meant to be run ad hoc, manually, not
>> part of a normal test cycle, then I think we can just not enable it with
>> -Pkubernetes. If it is meant to run every time, then it sounds like we need
>> a little extra work shown in recent PRs to make that easier, but then, this
>> test code should just be the 'test' artifact parts of the kubernetes
>> module, no?
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: What's a blocker?

2018-10-25 Thread Erik Erlandson
I'd like to expand a bit on the phrase "opportunity cost" to try and make
it more concrete: delaying a release means that the  community is *not*
receiving various bug fixes (and features).  Just as a particular example,
the wait for 2.3.2 delayed a fix for the Py3.7 iterator breaking change
that was also causing a correctness bug.  It also delays community feedback
from running new releases.  That in and of itself does not give an answer
to block/not-block for any specific case, but it's another way of saying
that blocking a release *prevents* people from getting bug fixes, as well
as potentially fixing bugs.


On Thu, Oct 25, 2018 at 7:39 AM Sean Owen  wrote:

> What does "PMC members aren't saying its a block for reasons other then
> the actual impact the jira has" mean that isn't already widely agreed?
> Likewise "Committers and PMC members should not be saying its not a
> blocker because they personally or their company doesn't care about this
> feature or api". It sounds like insinuation, and I'd rather make it
> explicit -- call out the bad actions -- or keep it to observable technical
> issues.
>
> Likewise one could say there's a problem just because A thinks X should be
> a blocker and B disagrees. I see no bad faith, process problem, or obvious
> errors. Do you? I see disagreement, and it's tempting to suspect motives. I
> have seen what I think are actual bad-faith decisions in the past in this
> project, too. I don't see it here though and want to stick to 'now'.
>
> (Aside: the implication is that those representing vendors are
> steam-rolling a release. Actually, the cynical incentives cut the other way
> here. Blessing the latest changes as OSS Apache Spark is predominantly
> beneficial to users of OSS, not distros. In fact, it forces distros to make
> changes. And broadly, vendors have much more accountability for quality of
> releases, because they're paid to.)
>
>
> I'm still not sure what specifically the objection is to what here? I
> understand a lot is in flight and nobody agrees with every decision made,
> but, what else is new?
> Concretely: the release is held again to fix a few issues, in the end. For
> the map_filter issue, that seems like the right call, and there are a few
> other important issues that could be quickly fixed too. All is well there,
> yes?
>
> This has surfaced some implicit reasoning about releases that we could
> make explicit, like:
>
> (Sure, if you want to write down things like, release blockers should be
> decided in the interests of the project by the PMC, OK)
>
> We have a time-based release schedule, so time matters. There is an
> opportunity cost to not releasing. The bar for blockers goes up over time.
>
> Not all regressions are blockers. Would you hold a release over a trivial
> regression? but then which must or should block? There's no objective
> answer, but a reasonable rule is: non-trivial regressions from minor
> release x.y to x.{y+1} block releases. Regressions from x.{y-1} to x.{y+1}
> should, but not necessarily, block the release. We try hard to avoid
> regressions in x.y.0 releases because these are generally consumed by
> aggressive upgraders, on x.{y-1}.z now. If a bug exists in x.{y-1}, they're
> not affected or worked around it. The cautious upgrader goes from maybe
> x.{y-2}.z to x.y.1 later. They're affected, but not before, maybe, a
> maintenance release. A crude argument, and it's not an argument that
> regressions are OK. It's an argument that 'old' regressions matter less.
> And maybe it's reasonable to draw the "must" vs "should" line between them.
>
>
>
> On Thu, Oct 25, 2018 at 8:51 AM Tom Graves  wrote:
>
>> So just to clarify a few things in case people didn't read the entire
>> thread in the PR, the discussion is what is the criteria for a blocker and
>> really my concerns are what people are using as criteria for not marking a
>> jira as a blocker.
>>
>> The only thing we have documented to mark a jira as a blocker is for
>> correctness issues: http://spark.apache.org/contributing.html.  And
>> really I think that is initially mark it as a blocker to bring attention to
>> it.
>> The final decision as to whether something is a blocker is up to the PMC
>> who votes on whether a release passes.  I think it would be impossible to
>> properly define what a blocker is with strict rules.
>>
>> Personally from this thread I would like to make sure committers and PMC
>> members aren't saying its a block for reasons other then the actual impact
>> the jira has and if its at all in question it should be brought to the
>> PMC's attention for a vote.  I agree with others that if its during an RC
>> it should be talked about on the RC thread.
>>
>> A few specific things that were said that I disagree with are:
>>- its not a blocker because it was also an issue in the last release
>> (meaning feature release).  ie the bug was introduced in 2.2 and now we are
>> doing 2.4 so its automatically not a blocker.  This to me is just wro

Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Stavros Kontopoulos
I agree these tests should be manual for now but should be run somehow
before a release to make sure things are working right?

For the other issue: https://issues.apache.org/jira/browse/SPARK-25835 .


On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> I will open a jira for the profile propagation issue and have a look to
> fix it.
>
> Stavros
>
> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson 
> wrote:
>
>>
>> I would be comfortable making the integration testing manual for now.  A
>> JIRA for ironing out how to make it reliable for automatic as a goal for
>> 3.0 seems like a good idea.
>>
>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen  wrote:
>>
>>> Forking this thread.
>>>
>>> Because we'll have another RC, we could possibly address these two
>>> issues. Only if we have a reliable change of course.
>>>
>>> Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.
>>>
>>> And is it reasonable to essentially 'disable'
>>> kubernetes/integration-tests by removing it from the kubernetes
>>> profile? it doesn't mean it goes away, just means it's run manually,
>>> not automatically. Is that actually how it's meant to be used anyway?
>>> in the short term? given the discussion around its requirements and
>>> minikube and all that?
>>>
>>> (Actually, this would also 'solve' the Scala 2.12 build problem too)
>>>
>>> On Tue, Oct 23, 2018 at 2:45 PM Sean Owen  wrote:
>>> >
>>> > To be clear I'm currently +1 on this release, with much commentary.
>>> >
>>> > OK, the explanation for kubernetes tests makes sense. Yes I think we
>>> need to propagate the scala-2.12 build profile to make it work. Go for it,
>>> if you have a lead on what the change is.
>>> > This doesn't block the release as it's an issue for tests, and only
>>> affects 2.12. However if we had a clean fix for this and there were another
>>> RC, I'd include it.
>>> >
>>> > Dongjoon has a good point about the spark-kubernetes-integration-tests
>>> artifact. That doesn't sound like it should be published in this way,
>>> though, of course, we publish the test artifacts from every module already.
>>> This is only a bit odd in being a non-test artifact meant for testing. But
>>> it's special testing! So I also don't think that needs to block a release.
>>> >
>>> > This happens because the integration tests module is enabled with the
>>> 'kubernetes' profile too, and also this output is copied into the release
>>> tarball at kubernetes/integration-tests/tests. Do we need that in a
>>> binary release?
>>> >
>>> > If these integration tests are meant to be run ad hoc, manually, not
>>> part of a normal test cycle, then I think we can just not enable it with
>>> -Pkubernetes. If it is meant to run every time, then it sounds like we need
>>> a little extra work shown in recent PRs to make that easier, but then, this
>>> test code should just be the 'test' artifact parts of the kubernetes
>>> module, no?
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
>


Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Sean Owen
Yes, I agree, and perhaps you are best placed to do that for 2.4.0 RC5 :)

On Thu, Oct 25, 2018 at 10:41 AM Stavros Kontopoulos
 wrote:
>
> I agree these tests should be manual for now but should be run somehow before 
> a release to make sure things are working right?
>
> For the other issue: https://issues.apache.org/jira/browse/SPARK-25835 .
>
>
> On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos 
>  wrote:
>>
>> I will open a jira for the profile propagation issue and have a look to fix 
>> it.
>>
>> Stavros
>>
>> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson  wrote:
>>>
>>>
>>> I would be comfortable making the integration testing manual for now.  A 
>>> JIRA for ironing out how to make it reliable for automatic as a goal for 
>>> 3.0 seems like a good idea.
>>>
>>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen  wrote:

 Forking this thread.

 Because we'll have another RC, we could possibly address these two
 issues. Only if we have a reliable change of course.

 Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.

 And is it reasonable to essentially 'disable'
 kubernetes/integration-tests by removing it from the kubernetes
 profile? it doesn't mean it goes away, just means it's run manually,
 not automatically. Is that actually how it's meant to be used anyway?
 in the short term? given the discussion around its requirements and
 minikube and all that?

 (Actually, this would also 'solve' the Scala 2.12 build problem too)

 On Tue, Oct 23, 2018 at 2:45 PM Sean Owen  wrote:
 >
 > To be clear I'm currently +1 on this release, with much commentary.
 >
 > OK, the explanation for kubernetes tests makes sense. Yes I think we 
 > need to propagate the scala-2.12 build profile to make it work. Go for 
 > it, if you have a lead on what the change is.
 > This doesn't block the release as it's an issue for tests, and only 
 > affects 2.12. However if we had a clean fix for this and there were 
 > another RC, I'd include it.
 >
 > Dongjoon has a good point about the spark-kubernetes-integration-tests 
 > artifact. That doesn't sound like it should be published in this way, 
 > though, of course, we publish the test artifacts from every module 
 > already. This is only a bit odd in being a non-test artifact meant for 
 > testing. But it's special testing! So I also don't think that needs to 
 > block a release.
 >
 > This happens because the integration tests module is enabled with the 
 > 'kubernetes' profile too, and also this output is copied into the 
 > release tarball at kubernetes/integration-tests/tests. Do we need that 
 > in a binary release?
 >
 > If these integration tests are meant to be run ad hoc, manually, not 
 > part of a normal test cycle, then I think we can just not enable it with 
 > -Pkubernetes. If it is meant to run every time, then it sounds like we 
 > need a little extra work shown in recent PRs to make that easier, but 
 > then, this test code should just be the 'test' artifact parts of the 
 > kubernetes module, no?

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Xiao Li
Hopefully, this will not delay RC5. Since this is not a blocker ticket, RC5
will start if all the blocker tickets are resolved.

Thanks,

Xiao

Sean Owen  于2018年10月25日周四 上午8:44写道:

> Yes, I agree, and perhaps you are best placed to do that for 2.4.0 RC5 :)
>
> On Thu, Oct 25, 2018 at 10:41 AM Stavros Kontopoulos
>  wrote:
> >
> > I agree these tests should be manual for now but should be run somehow
> before a release to make sure things are working right?
> >
> > For the other issue: https://issues.apache.org/jira/browse/SPARK-25835 .
> >
> >
> > On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
> >>
> >> I will open a jira for the profile propagation issue and have a look to
> fix it.
> >>
> >> Stavros
> >>
> >> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson 
> wrote:
> >>>
> >>>
> >>> I would be comfortable making the integration testing manual for now.
> A JIRA for ironing out how to make it reliable for automatic as a goal for
> 3.0 seems like a good idea.
> >>>
> >>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen  wrote:
> 
>  Forking this thread.
> 
>  Because we'll have another RC, we could possibly address these two
>  issues. Only if we have a reliable change of course.
> 
>  Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.
> 
>  And is it reasonable to essentially 'disable'
>  kubernetes/integration-tests by removing it from the kubernetes
>  profile? it doesn't mean it goes away, just means it's run manually,
>  not automatically. Is that actually how it's meant to be used anyway?
>  in the short term? given the discussion around its requirements and
>  minikube and all that?
> 
>  (Actually, this would also 'solve' the Scala 2.12 build problem too)
> 
>  On Tue, Oct 23, 2018 at 2:45 PM Sean Owen  wrote:
>  >
>  > To be clear I'm currently +1 on this release, with much commentary.
>  >
>  > OK, the explanation for kubernetes tests makes sense. Yes I think
> we need to propagate the scala-2.12 build profile to make it work. Go for
> it, if you have a lead on what the change is.
>  > This doesn't block the release as it's an issue for tests, and only
> affects 2.12. However if we had a clean fix for this and there were another
> RC, I'd include it.
>  >
>  > Dongjoon has a good point about the
> spark-kubernetes-integration-tests artifact. That doesn't sound like it
> should be published in this way, though, of course, we publish the test
> artifacts from every module already. This is only a bit odd in being a
> non-test artifact meant for testing. But it's special testing! So I also
> don't think that needs to block a release.
>  >
>  > This happens because the integration tests module is enabled with
> the 'kubernetes' profile too, and also this output is copied into the
> release tarball at kubernetes/integration-tests/tests. Do we need that in a
> binary release?
>  >
>  > If these integration tests are meant to be run ad hoc, manually,
> not part of a normal test cycle, then I think we can just not enable it
> with -Pkubernetes. If it is meant to run every time, then it sounds like we
> need a little extra work shown in recent PRs to make that easier, but then,
> this test code should just be the 'test' artifact parts of the kubernetes
> module, no?
> 
>  -
>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> >>
> >>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Sean Owen
I think it's worth getting in a change to just not enable this module,
which ought to be entirely safe, and avoid two of the issues we
identified.
that said it didn't block RC4 so need not block RC5.
But should happen today if we're doing it.
On Thu, Oct 25, 2018 at 10:47 AM Xiao Li  wrote:
>
> Hopefully, this will not delay RC5. Since this is not a blocker ticket, RC5 
> will start if all the blocker tickets are resolved.
>
> Thanks,
>
> Xiao
>
> Sean Owen  于2018年10月25日周四 上午8:44写道:
>>
>> Yes, I agree, and perhaps you are best placed to do that for 2.4.0 RC5 :)
>>
>> On Thu, Oct 25, 2018 at 10:41 AM Stavros Kontopoulos
>>  wrote:
>> >
>> > I agree these tests should be manual for now but should be run somehow 
>> > before a release to make sure things are working right?
>> >
>> > For the other issue: https://issues.apache.org/jira/browse/SPARK-25835 .
>> >
>> >
>> > On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos 
>> >  wrote:
>> >>
>> >> I will open a jira for the profile propagation issue and have a look to 
>> >> fix it.
>> >>
>> >> Stavros
>> >>
>> >> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson  
>> >> wrote:
>> >>>
>> >>>
>> >>> I would be comfortable making the integration testing manual for now.  A 
>> >>> JIRA for ironing out how to make it reliable for automatic as a goal for 
>> >>> 3.0 seems like a good idea.
>> >>>
>> >>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen  wrote:
>> 
>>  Forking this thread.
>> 
>>  Because we'll have another RC, we could possibly address these two
>>  issues. Only if we have a reliable change of course.
>> 
>>  Is it easy enough to propagate the -Pscala-2.12 profile? can't hurt.
>> 
>>  And is it reasonable to essentially 'disable'
>>  kubernetes/integration-tests by removing it from the kubernetes
>>  profile? it doesn't mean it goes away, just means it's run manually,
>>  not automatically. Is that actually how it's meant to be used anyway?
>>  in the short term? given the discussion around its requirements and
>>  minikube and all that?
>> 
>>  (Actually, this would also 'solve' the Scala 2.12 build problem too)
>> 
>>  On Tue, Oct 23, 2018 at 2:45 PM Sean Owen  wrote:
>>  >
>>  > To be clear I'm currently +1 on this release, with much commentary.
>>  >
>>  > OK, the explanation for kubernetes tests makes sense. Yes I think we 
>>  > need to propagate the scala-2.12 build profile to make it work. Go 
>>  > for it, if you have a lead on what the change is.
>>  > This doesn't block the release as it's an issue for tests, and only 
>>  > affects 2.12. However if we had a clean fix for this and there were 
>>  > another RC, I'd include it.
>>  >
>>  > Dongjoon has a good point about the 
>>  > spark-kubernetes-integration-tests artifact. That doesn't sound like 
>>  > it should be published in this way, though, of course, we publish the 
>>  > test artifacts from every module already. This is only a bit odd in 
>>  > being a non-test artifact meant for testing. But it's special 
>>  > testing! So I also don't think that needs to block a release.
>>  >
>>  > This happens because the integration tests module is enabled with the 
>>  > 'kubernetes' profile too, and also this output is copied into the 
>>  > release tarball at kubernetes/integration-tests/tests. Do we need 
>>  > that in a binary release?
>>  >
>>  > If these integration tests are meant to be run ad hoc, manually, not 
>>  > part of a normal test cycle, then I think we can just not enable it 
>>  > with -Pkubernetes. If it is meant to run every time, then it sounds 
>>  > like we need a little extra work shown in recent PRs to make that 
>>  > easier, but then, this test code should just be the 'test' artifact 
>>  > parts of the kubernetes module, no?
>> 
>>  -
>>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
>> >>
>> >>
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: KryoSerializer Implementation - Not using KryoPool

2018-10-25 Thread Patrick Brown
Based on my, limited, read through of the code that uses this, it seems
like often a new KryoSerializerInstance is created for whatever task and
then it falls out of scope, instead of being reused. I did notice that
comment about a pool size of 1, however if the use is generally how I just
described, I don't think that really does anything.

I would be happy to open a PR but, pardon my ignorance, how would I go
about doing that properly? Do I need to open a JIRA issue first? Also how
would I demonstrate performance gains? Do you guys use something like
ScalaMeter?

Thanks for your help!

On Wed, Oct 24, 2018 at 2:37 PM Sean Owen  wrote:

> I don't know; possibly just because it wasn't available whenever Kryo
> was first used in the project.
>
> Skimming the code, the KryoSerializerInstance looks like a wrapper
> that provides a Kryo object to do work. It already maintains a 'pool'
> of just 1 instance. Is the point that KryoSerializer can share a
> KryoPool across KryoSerializerInstances that provides them with a Kryo
> rather than allocate a new one? makes sense, though I believe the
> concern is always whether that somehow shares state or config in a way
> that breaks something. I see there's already a reset() call in here to
> try to avoid that.
>
> Well, seems worth a PR, especially if you can demonstrate some
> performance gains.
>
> On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown
>  wrote:
> >
> > Hi,
> >
> > I am wondering about the implementation of KryoSerializer, specifically
> the lack of use of KryoPool, which is recommended by Kryo themselves.
> >
> > Looking at the code, it seems that frequently KryoSerializer.newInstance
> is called, followed by a serialize and then this instance goes out of
> scope, this seems like it causes frequent creation of Kryo instances,
> something which the Kryo documentation says is expensive.
> >
> > By doing flame graphs on our own running software (it processes a lot of
> small jobs) it seems like a good amount of time is spent on this.
> >
> > I have a small patch we are using internally which implements a reused
> KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to
> avoid the creation of many Kryo instances. I am wonder if I am missing
> something as to why this isn't done already. If not I am wondering if this
> might be a patch that Spark would be interested in merging in, and how I
> might go about that.
> >
> > Thanks,
> >
> > Patrick
>


Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Stavros Kontopoulos
>
> I think it's worth getting in a change to just not enable this module,
> which ought to be entirely safe, and avoid two of the issues we
> identified.
>

Besides disabling it, when someone wants to run the tests with 2.12 he
should be able to do so. So propagating the Scala profile still makes sense
but it is not related to the release other than making sure things work
fine.

On Thu, Oct 25, 2018 at 7:02 PM, Sean Owen  wrote:

> I think it's worth getting in a change to just not enable this module,
> which ought to be entirely safe, and avoid two of the issues we
> identified.
> that said it didn't block RC4 so need not block RC5.
> But should happen today if we're doing it.
> On Thu, Oct 25, 2018 at 10:47 AM Xiao Li  wrote:
> >
> > Hopefully, this will not delay RC5. Since this is not a blocker ticket,
> RC5 will start if all the blocker tickets are resolved.
> >
> > Thanks,
> >
> > Xiao
> >
> > Sean Owen  于2018年10月25日周四 上午8:44写道:
> >>
> >> Yes, I agree, and perhaps you are best placed to do that for 2.4.0 RC5
> :)
> >>
> >> On Thu, Oct 25, 2018 at 10:41 AM Stavros Kontopoulos
> >>  wrote:
> >> >
> >> > I agree these tests should be manual for now but should be run
> somehow before a release to make sure things are working right?
> >> >
> >> > For the other issue: https://issues.apache.org/
> jira/browse/SPARK-25835 .
> >> >
> >> >
> >> > On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
> >> >>
> >> >> I will open a jira for the profile propagation issue and have a look
> to fix it.
> >> >>
> >> >> Stavros
> >> >>
> >> >> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson 
> wrote:
> >> >>>
> >> >>>
> >> >>> I would be comfortable making the integration testing manual for
> now.  A JIRA for ironing out how to make it reliable for automatic as a
> goal for 3.0 seems like a good idea.
> >> >>>
> >> >>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen  wrote:
> >> 
> >>  Forking this thread.
> >> 
> >>  Because we'll have another RC, we could possibly address these two
> >>  issues. Only if we have a reliable change of course.
> >> 
> >>  Is it easy enough to propagate the -Pscala-2.12 profile? can't
> hurt.
> >> 
> >>  And is it reasonable to essentially 'disable'
> >>  kubernetes/integration-tests by removing it from the kubernetes
> >>  profile? it doesn't mean it goes away, just means it's run
> manually,
> >>  not automatically. Is that actually how it's meant to be used
> anyway?
> >>  in the short term? given the discussion around its requirements and
> >>  minikube and all that?
> >> 
> >>  (Actually, this would also 'solve' the Scala 2.12 build problem
> too)
> >> 
> >>  On Tue, Oct 23, 2018 at 2:45 PM Sean Owen 
> wrote:
> >>  >
> >>  > To be clear I'm currently +1 on this release, with much
> commentary.
> >>  >
> >>  > OK, the explanation for kubernetes tests makes sense. Yes I
> think we need to propagate the scala-2.12 build profile to make it work. Go
> for it, if you have a lead on what the change is.
> >>  > This doesn't block the release as it's an issue for tests, and
> only affects 2.12. However if we had a clean fix for this and there were
> another RC, I'd include it.
> >>  >
> >>  > Dongjoon has a good point about the 
> >>  > spark-kubernetes-integration-tests
> artifact. That doesn't sound like it should be published in this way,
> though, of course, we publish the test artifacts from every module already.
> This is only a bit odd in being a non-test artifact meant for testing. But
> it's special testing! So I also don't think that needs to block a release.
> >>  >
> >>  > This happens because the integration tests module is enabled
> with the 'kubernetes' profile too, and also this output is copied into the
> release tarball at kubernetes/integration-tests/tests. Do we need that in
> a binary release?
> >>  >
> >>  > If these integration tests are meant to be run ad hoc, manually,
> not part of a normal test cycle, then I think we can just not enable it
> with -Pkubernetes. If it is meant to run every time, then it sounds like we
> need a little extra work shown in recent PRs to make that easier, but then,
> this test code should just be the 'test' artifact parts of the kubernetes
> module, no?
> >> 
> >>  
> -
> >>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >> 
> >> >>
> >> >>
> >> >
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>



-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: KryoSerializer Implementation - Not using KryoPool

2018-10-25 Thread Sean Owen
It's not so much the KryoSerializerInstance that's the problem, but
that it will always make a new Kryo (although at most 1). You mean to
supply it with a reference to a pool instead, shared across all
KryoSerializerInstance? plausible yeah.

See https://spark.apache.org/contributing.html for guidance but
basically you'd make a JIRA and then a pull request at apache/spark,
with the JIRA number in the title and some other conventions.
On Thu, Oct 25, 2018 at 11:51 AM Patrick Brown
 wrote:
>
> Based on my, limited, read through of the code that uses this, it seems like 
> often a new KryoSerializerInstance is created for whatever task and then it 
> falls out of scope, instead of being reused. I did notice that comment about 
> a pool size of 1, however if the use is generally how I just described, I 
> don't think that really does anything.
>
> I would be happy to open a PR but, pardon my ignorance, how would I go about 
> doing that properly? Do I need to open a JIRA issue first? Also how would I 
> demonstrate performance gains? Do you guys use something like ScalaMeter?
>
> Thanks for your help!
>
> On Wed, Oct 24, 2018 at 2:37 PM Sean Owen  wrote:
>>
>> I don't know; possibly just because it wasn't available whenever Kryo
>> was first used in the project.
>>
>> Skimming the code, the KryoSerializerInstance looks like a wrapper
>> that provides a Kryo object to do work. It already maintains a 'pool'
>> of just 1 instance. Is the point that KryoSerializer can share a
>> KryoPool across KryoSerializerInstances that provides them with a Kryo
>> rather than allocate a new one? makes sense, though I believe the
>> concern is always whether that somehow shares state or config in a way
>> that breaks something. I see there's already a reset() call in here to
>> try to avoid that.
>>
>> Well, seems worth a PR, especially if you can demonstrate some
>> performance gains.
>>
>> On Wed, Oct 24, 2018 at 3:09 PM Patrick Brown
>>  wrote:
>> >
>> > Hi,
>> >
>> > I am wondering about the implementation of KryoSerializer, specifically 
>> > the lack of use of KryoPool, which is recommended by Kryo themselves.
>> >
>> > Looking at the code, it seems that frequently KryoSerializer.newInstance 
>> > is called, followed by a serialize and then this instance goes out of 
>> > scope, this seems like it causes frequent creation of Kryo instances, 
>> > something which the Kryo documentation says is expensive.
>> >
>> > By doing flame graphs on our own running software (it processes a lot of 
>> > small jobs) it seems like a good amount of time is spent on this.
>> >
>> > I have a small patch we are using internally which implements a reused 
>> > KryoPool inside KryoSerializer (not KryoSerializerInstance) in order to 
>> > avoid the creation of many Kryo instances. I am wonder if I am missing 
>> > something as to why this isn't done already. If not I am wondering if this 
>> > might be a patch that Spark would be interested in merging in, and how I 
>> > might go about that.
>> >
>> > Thanks,
>> >
>> > Patrick

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What's a blocker?

2018-10-25 Thread Tom Graves
Ignoring everything else in this thread to put sharper point on one issue. In 
the pr multiple people referred to it's not a blocker based on it was also a 
bug/dropped feature in the previous release (note one was phrased slightly 
different as it was stated not a regression, which I read as not a regression 
from the previous feature release).  My thoughts on this are if multiple people 
think this then others may as well so I think we need a discuss thread on it.
My reasons for disagreeing with that are it specifically goes against our 
documented versioning policy.  The jira claims we essentially broke proper 
support for hive udafs, we specifically state in our docs we support hive 
udafs, i consider that an api, our versioning docs say we wont break api 
compatibility in feature releases. It shouldn't matter if that was 1 feature 
release ago or 10, until we do a major release we shouldn't break or drop that 
compatibility.
So we should not be using that as a reason to decide if a jira is a blocker or 
not.

Tom 
 
  On Thu, Oct 25, 2018 at 9:39 AM, Sean Owen wrote:   What 
does "PMC members aren't saying its a block for reasons other then the actual 
impact the jira has" mean that isn't already widely agreed? Likewise 
"Committers and PMC members should not be saying its not a blocker because they 
personally or their company doesn't care about this feature or api". It sounds 
like insinuation, and I'd rather make it explicit -- call out the bad actions 
-- or keep it to observable technical issues.
Likewise one could say there's a problem just because A thinks X should be a 
blocker and B disagrees. I see no bad faith, process problem, or obvious 
errors. Do you? I see disagreement, and it's tempting to suspect motives. I 
have seen what I think are actual bad-faith decisions in the past in this 
project, too. I don't see it here though and want to stick to 'now'.

(Aside: the implication is that those representing vendors are steam-rolling a 
release. Actually, the cynical incentives cut the other way here. Blessing the 
latest changes as OSS Apache Spark is predominantly beneficial to users of OSS, 
not distros. In fact, it forces distros to make changes. And broadly, vendors 
have much more accountability for quality of releases, because they're paid to.)

I'm still not sure what specifically the objection is to what here? I 
understand a lot is in flight and nobody agrees with every decision made, but, 
what else is new? Concretely: the release is held again to fix a few issues, in 
the end. For the map_filter issue, that seems like the right call, and there 
are a few other important issues that could be quickly fixed too. All is well 
there, yes?
This has surfaced some implicit reasoning about releases that we could make 
explicit, like:
(Sure, if you want to write down things like, release blockers should be 
decided in the interests of the project by the PMC, OK)
We have a time-based release schedule, so time matters. There is an opportunity 
cost to not releasing. The bar for blockers goes up over time.
Not all regressions are blockers. Would you hold a release over a trivial 
regression? but then which must or should block? There's no objective answer, 
but a reasonable rule is: non-trivial regressions from minor release x.y to 
x.{y+1} block releases. Regressions from x.{y-1} to x.{y+1} should, but not 
necessarily, block the release. We try hard to avoid regressions in x.y.0 
releases because these are generally consumed by aggressive upgraders, on 
x.{y-1}.z now. If a bug exists in x.{y-1}, they're not affected or worked 
around it. The cautious upgrader goes from maybe x.{y-2}.z to x.y.1 later. 
They're affected, but not before, maybe, a maintenance release. A crude 
argument, and it's not an argument that regressions are OK. It's an argument 
that 'old' regressions matter less. And maybe it's reasonable to draw the 
"must" vs "should" line between them.



On Thu, Oct 25, 2018 at 8:51 AM Tom Graves  wrote:

 So just to clarify a few things in case people didn't read the entire thread 
in the PR, the discussion is what is the criteria for a blocker and really my 
concerns are what people are using as criteria for not marking a jira as a 
blocker.
The only thing we have documented to mark a jira as a blocker is for 
correctness issues: http://spark.apache.org/contributing.html.  And really I 
think that is initially mark it as a blocker to bring attention to it.The final 
decision as to whether something is a blocker is up to the PMC who votes on 
whether a release passes.  I think it would be impossible to properly define 
what a blocker is with strict rules.
Personally from this thread I would like to make sure committers and PMC 
members aren't saying its a block for reasons other then the actual impact the 
jira has and if its at all in question it should be brought to the PMC's 
attention for a vote.  I agree with others that if its during an RC it should 
be talked about on the R

Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-10-25 Thread Reynold Xin
I incorporated the feedbacks here and updated the SPIP page:
https://github.com/apache/spark-website/pull/156

The new version is live now:
https://spark.apache.org/improvement-proposals.html


On Fri, Aug 31, 2018 at 4:35 PM Ryan Blue  wrote:

> +1
>
> I think this is a great suggestion. I agree a bit with Sean, but I think
> it is really about mapping these questions into some of the existing
> structure. These are a great way to think about projects, but they're
> general and it would help to rephrase them for a software project, like
> Matei's comment on considering cost. Similarly, we might rephrase
> objectives to be goals/non-goals and add something to highlight that we
> expect absolutely no Jargon. A design sketch is needed to argue how long it
> will take, what is new, and why it would be successful; adding these
> questions will help people understand how to go from that design sketch to
> an argument for that design. I think these will guide people to write
> proposals that is persuasive and well-formed.
>
> rb
>
> On Fri, Aug 31, 2018 at 4:17 PM Jules Damji  wrote:
>
>> +1
>>
>> One could argue that the litany of the questions are really a
>> double-click on the essence: why, what, how. The three interrogatives ought
>> to be the essence and distillation of any proposal or technical exposition.
>>
>> Cheers
>> Jules
>>
>> Sent from my iPhone
>> Pardon the dumb thumb typos :)
>>
>> On Aug 31, 2018, at 11:23 AM, Reynold Xin  wrote:
>>
>> I helped craft the current SPIP template
>>  last year. I was
>> recently (re-)introduced to the Heilmeier Catechism, a set of questions
>> DARPA developed to evaluate proposals. The set of questions are:
>>
>> - What are you trying to do? Articulate your objectives using absolutely
>> no jargon.
>> - How is it done today, and what are the limits of current practice?
>> - What is new in your approach and why do you think it will be successful?
>> - Who cares? If you are successful, what difference will it make?
>> - What are the risks?
>> - How much will it cost?
>> - How long will it take?
>> - What are the mid-term and final “exams” to check for success?
>>
>> When I read the above list, it resonates really well because they are
>> almost always the same set of questions I ask myself and others before I
>> decide whether something is worth doing. In some ways, our SPIP template
>> tries to capture some of these (e.g. target persona), but are not as
>> explicit and well articulated.
>>
>> What do people think about replacing the current SPIP template with the
>> above?
>>
>> At a high level, I think the Heilmeier's Catechism emphasizes less about
>> the "how", and more the "why" and "what", which is what I'd argue SPIPs
>> should be about. The hows should be left in design docs for larger projects.
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-25 Thread Reynold Xin
I have some pretty serious concerns over this proposal. I agree that there
are many things that can be improved, but at the same time I also think the
cost of introducing a new IR in the middle is extremely high. Having
participated in designing some of the IRs in other systems, I've seen more
failures than successes. The failures typically come from two sources: (1)
in general it is extremely difficult to design IRs that are both expressive
enough and are simple enough; (2) typically another layer of indirection
increases the complexity a lot more, beyond the level of understanding and
expertise that most contributors can obtain without spending years in the
code base and learning about all the gotchas.

In either case, I'm not saying "no please don't do this". This is one of
those cases in which the devils are in the details that cannot be captured
by a high level document, and I want to explicitly express my concern here.




On Thu, Oct 25, 2018 at 12:10 AM Kazuaki Ishizaki 
wrote:

> Hi Xiao,
> Thank you very much for becoming a shepherd.
> If you feel the discussion settles, we would appreciate it if you would
> start a voting.
>
> Regards,
> Kazuaki Ishizaki
>
>
>
> From:Xiao Li 
> To:Kazuaki Ishizaki 
> Cc:dev , Takeshi Yamamuro <
> linguin@gmail.com>
> Date:2018/10/22 16:31
> Subject:Re: SPIP: SPARK-25728 Structured Intermediate
> Representation (Tungsten IR) for generating Java code
> --
>
>
>
> Hi, Kazuaki,
>
> Thanks for your great SPIP! I am willing to be the shepherd of this SPIP.
>
> Cheers,
>
> Xiao
>
>
> On Mon, Oct 22, 2018 at 12:05 AM Kazuaki Ishizaki <*ishiz...@jp.ibm.com*
> > wrote:
> Hi Yamamuro-san,
> Thank you for your comments. This SPIP gets several valuable comments and
> feedback on Google Doc:
> *https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing*
> 
> .
> I hope that this SPIP could go forward based on these feedback.
>
> Based on this SPIP procedure
> *http://spark.apache.org/improvement-proposals.html*
> , can I ask one or
> more PMCs to become a shepherd of this SPIP?
> I would appreciate your kindness and cooperation.
>
> Best Regards,
> Kazuaki Ishizaki
>
>
>
> From:Takeshi Yamamuro <*linguin@gmail.com*
> >
> To:Spark dev list <*dev@spark.apache.org* >
> Cc:*ishiz...@jp.ibm.com* 
> Date:2018/10/15 12:12
> Subject:Re: SPIP: SPARK-25728 Structured Intermediate
> Representation (Tungsten IR) for generating Java code
> --
>
>
>
> Hi, ishizaki-san,
>
> Cool activity, I left some comments on the doc.
>
> best,
> takeshi
>
>
> On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki <*ishiz...@jp.ibm.com*
> > wrote:
> Hello community,
>
> I am writing this e-mail in order to start a discussion about adding
> structure intermediate representation for generating Java code from a
> program using DataFrame or Dataset API, in addition to the current
> String-based representation.
> This addition is based on the discussions in a thread at
> *https://github.com/apache/spark/pull/21537#issuecomment-413268196*
> 
>
> Please feel free to comment on the JIRA ticket or Google Doc.
>
> JIRA ticket: *https://issues.apache.org/jira/browse/SPARK-25728*
> 
> Google Doc:
> *https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing*
> 
>
> Looking forward to hear your feedback
>
> Best Regards,
> Kazuaki Ishizaki
>
>
> --
> ---
> Takeshi Yamamuro
>
>
>
> --
>
> 
>
>


DataSourceV2 hangouts sync

2018-10-25 Thread Ryan Blue
Hi everyone,

There's been some great discussion for DataSourceV2 in the last few months,
but it has been difficult to resolve some of the discussions and I don't
think that we have a very clear roadmap for getting the work done.

To coordinate better as a community, I'd like to start a regular sync-up
over google hangouts. We use this in the Parquet community to have more
effective community discussions about thorny technical issues and to get
aligned on an overall roadmap. It is really helpful in that community and I
think it would help us get DSv2 done more quickly.

Here's how it works: people join the hangout, we go around the list to
gather topics, have about an hour-long discussion, and then send a summary
of the discussion to the dev list for anyone that couldn't participate.
That way we can move topics along, but we keep the broader community in the
loop as well for further discussion on the mailing list.

I'll volunteer to set up the sync and send invites to anyone that wants to
attend. If you're interested, please reply with the email address you'd
like to put on the invite list (if there's a way to do this without
specific invites, let me know). Also for the first sync, please note what
times would work for you so we can try to account for people in different
time zones.

For the first one, I was thinking some day next week (time TBD by those
interested) and starting off with a general roadmap discussion before
diving into specific technical topics.

Thanks,

rb

-- 
Ryan Blue
Software Engineer
Netflix


Re: DataSourceV2 hangouts sync

2018-10-25 Thread Felix Cheung
Yes please!



From: Ryan Blue 
Sent: Thursday, October 25, 2018 1:10 PM
To: Spark Dev List
Subject: DataSourceV2 hangouts sync

Hi everyone,

There's been some great discussion for DataSourceV2 in the last few months, but 
it has been difficult to resolve some of the discussions and I don't think that 
we have a very clear roadmap for getting the work done.

To coordinate better as a community, I'd like to start a regular sync-up over 
google hangouts. We use this in the Parquet community to have more effective 
community discussions about thorny technical issues and to get aligned on an 
overall roadmap. It is really helpful in that community and I think it would 
help us get DSv2 done more quickly.

Here's how it works: people join the hangout, we go around the list to gather 
topics, have about an hour-long discussion, and then send a summary of the 
discussion to the dev list for anyone that couldn't participate. That way we 
can move topics along, but we keep the broader community in the loop as well 
for further discussion on the mailing list.

I'll volunteer to set up the sync and send invites to anyone that wants to 
attend. If you're interested, please reply with the email address you'd like to 
put on the invite list (if there's a way to do this without specific invites, 
let me know). Also for the first sync, please note what times would work for 
you so we can try to account for people in different time zones.

For the first one, I was thinking some day next week (time TBD by those 
interested) and starting off with a general roadmap discussion before diving 
into specific technical topics.

Thanks,

rb

--
Ryan Blue
Software Engineer
Netflix


Re: DataSourceV2 hangouts sync

2018-10-25 Thread John Zhuge
Great idea!

On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue  wrote:

> Hi everyone,
>
> There's been some great discussion for DataSourceV2 in the last few
> months, but it has been difficult to resolve some of the discussions and I
> don't think that we have a very clear roadmap for getting the work done.
>
> To coordinate better as a community, I'd like to start a regular sync-up
> over google hangouts. We use this in the Parquet community to have more
> effective community discussions about thorny technical issues and to get
> aligned on an overall roadmap. It is really helpful in that community and I
> think it would help us get DSv2 done more quickly.
>
> Here's how it works: people join the hangout, we go around the list to
> gather topics, have about an hour-long discussion, and then send a summary
> of the discussion to the dev list for anyone that couldn't participate.
> That way we can move topics along, but we keep the broader community in the
> loop as well for further discussion on the mailing list.
>
> I'll volunteer to set up the sync and send invites to anyone that wants to
> attend. If you're interested, please reply with the email address you'd
> like to put on the invite list (if there's a way to do this without
> specific invites, let me know). Also for the first sync, please note what
> times would work for you so we can try to account for people in different
> time zones.
>
> For the first one, I was thinking some day next week (time TBD by those
> interested) and starting off with a general roadmap discussion before
> diving into specific technical topics.
>
> Thanks,
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
John Zhuge


Re: DataSourceV2 hangouts sync

2018-10-25 Thread Li Jin
Although I am not specifically involved in DSv2, I think having this kind
of meeting is definitely helpful to discuss, move certain effort forward
and keep people on the same page. Glad to see this kind of working group
happening.

On Thu, Oct 25, 2018 at 5:58 PM John Zhuge  wrote:

> Great idea!
>
> On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue 
> wrote:
>
>> Hi everyone,
>>
>> There's been some great discussion for DataSourceV2 in the last few
>> months, but it has been difficult to resolve some of the discussions and I
>> don't think that we have a very clear roadmap for getting the work done.
>>
>> To coordinate better as a community, I'd like to start a regular sync-up
>> over google hangouts. We use this in the Parquet community to have more
>> effective community discussions about thorny technical issues and to get
>> aligned on an overall roadmap. It is really helpful in that community and I
>> think it would help us get DSv2 done more quickly.
>>
>> Here's how it works: people join the hangout, we go around the list to
>> gather topics, have about an hour-long discussion, and then send a summary
>> of the discussion to the dev list for anyone that couldn't participate.
>> That way we can move topics along, but we keep the broader community in the
>> loop as well for further discussion on the mailing list.
>>
>> I'll volunteer to set up the sync and send invites to anyone that wants
>> to attend. If you're interested, please reply with the email address you'd
>> like to put on the invite list (if there's a way to do this without
>> specific invites, let me know). Also for the first sync, please note what
>> times would work for you so we can try to account for people in different
>> time zones.
>>
>> For the first one, I was thinking some day next week (time TBD by those
>> interested) and starting off with a general roadmap discussion before
>> diving into specific technical topics.
>>
>> Thanks,
>>
>> rb
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> John Zhuge
>


Re: DataSourceV2 hangouts sync

2018-10-25 Thread Reynold Xin
+1



On Thu, Oct 25, 2018 at 4:12 PM Li Jin  wrote:

> Although I am not specifically involved in DSv2, I think having this kind
> of meeting is definitely helpful to discuss, move certain effort forward
> and keep people on the same page. Glad to see this kind of working group
> happening.
>
> On Thu, Oct 25, 2018 at 5:58 PM John Zhuge  wrote:
>
>> Great idea!
>>
>> On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> There's been some great discussion for DataSourceV2 in the last few
>>> months, but it has been difficult to resolve some of the discussions and I
>>> don't think that we have a very clear roadmap for getting the work done.
>>>
>>> To coordinate better as a community, I'd like to start a regular sync-up
>>> over google hangouts. We use this in the Parquet community to have more
>>> effective community discussions about thorny technical issues and to get
>>> aligned on an overall roadmap. It is really helpful in that community and I
>>> think it would help us get DSv2 done more quickly.
>>>
>>> Here's how it works: people join the hangout, we go around the list to
>>> gather topics, have about an hour-long discussion, and then send a summary
>>> of the discussion to the dev list for anyone that couldn't participate.
>>> That way we can move topics along, but we keep the broader community in the
>>> loop as well for further discussion on the mailing list.
>>>
>>> I'll volunteer to set up the sync and send invites to anyone that wants
>>> to attend. If you're interested, please reply with the email address you'd
>>> like to put on the invite list (if there's a way to do this without
>>> specific invites, let me know). Also for the first sync, please note what
>>> times would work for you so we can try to account for people in different
>>> time zones.
>>>
>>> For the first one, I was thinking some day next week (time TBD by those
>>> interested) and starting off with a general roadmap discussion before
>>> diving into specific technical topics.
>>>
>>> Thanks,
>>>
>>> rb
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> John Zhuge
>>
>


Re: DataSourceV2 hangouts sync

2018-10-25 Thread Xiao Li
+1

Reynold Xin  于2018年10月25日周四 下午4:16写道:

> +1
>
>
>
> On Thu, Oct 25, 2018 at 4:12 PM Li Jin  wrote:
>
>> Although I am not specifically involved in DSv2, I think having this kind
>> of meeting is definitely helpful to discuss, move certain effort forward
>> and keep people on the same page. Glad to see this kind of working group
>> happening.
>>
>> On Thu, Oct 25, 2018 at 5:58 PM John Zhuge  wrote:
>>
>>> Great idea!
>>>
>>> On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue 
>>> wrote:
>>>
 Hi everyone,

 There's been some great discussion for DataSourceV2 in the last few
 months, but it has been difficult to resolve some of the discussions and I
 don't think that we have a very clear roadmap for getting the work done.

 To coordinate better as a community, I'd like to start a regular
 sync-up over google hangouts. We use this in the Parquet community to have
 more effective community discussions about thorny technical issues and to
 get aligned on an overall roadmap. It is really helpful in that community
 and I think it would help us get DSv2 done more quickly.

 Here's how it works: people join the hangout, we go around the list to
 gather topics, have about an hour-long discussion, and then send a summary
 of the discussion to the dev list for anyone that couldn't participate.
 That way we can move topics along, but we keep the broader community in the
 loop as well for further discussion on the mailing list.

 I'll volunteer to set up the sync and send invites to anyone that wants
 to attend. If you're interested, please reply with the email address you'd
 like to put on the invite list (if there's a way to do this without
 specific invites, let me know). Also for the first sync, please note what
 times would work for you so we can try to account for people in different
 time zones.

 For the first one, I was thinking some day next week (time TBD by those
 interested) and starting off with a general roadmap discussion before
 diving into specific technical topics.

 Thanks,

 rb

 --
 Ryan Blue
 Software Engineer
 Netflix

>>>
>>>
>>> --
>>> John Zhuge
>>>
>>


Re: DataSourceV2 hangouts sync

2018-10-25 Thread Dongjoon Hyun
+1. Thank you for volunteering, Ryan!

Bests,
Dongjoon.


On Thu, Oct 25, 2018 at 4:19 PM Xiao Li  wrote:

> +1
>
> Reynold Xin  于2018年10月25日周四 下午4:16写道:
>
>> +1
>>
>>
>>
>> On Thu, Oct 25, 2018 at 4:12 PM Li Jin  wrote:
>>
>>> Although I am not specifically involved in DSv2, I think having this
>>> kind of meeting is definitely helpful to discuss, move certain effort
>>> forward and keep people on the same page. Glad to see this kind of working
>>> group happening.
>>>
>>> On Thu, Oct 25, 2018 at 5:58 PM John Zhuge  wrote:
>>>
 Great idea!

 On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue 
 wrote:

> Hi everyone,
>
> There's been some great discussion for DataSourceV2 in the last few
> months, but it has been difficult to resolve some of the discussions and I
> don't think that we have a very clear roadmap for getting the work done.
>
> To coordinate better as a community, I'd like to start a regular
> sync-up over google hangouts. We use this in the Parquet community to have
> more effective community discussions about thorny technical issues and to
> get aligned on an overall roadmap. It is really helpful in that community
> and I think it would help us get DSv2 done more quickly.
>
> Here's how it works: people join the hangout, we go around the list to
> gather topics, have about an hour-long discussion, and then send a summary
> of the discussion to the dev list for anyone that couldn't participate.
> That way we can move topics along, but we keep the broader community in 
> the
> loop as well for further discussion on the mailing list.
>
> I'll volunteer to set up the sync and send invites to anyone that
> wants to attend. If you're interested, please reply with the email address
> you'd like to put on the invite list (if there's a way to do this without
> specific invites, let me know). Also for the first sync, please note what
> times would work for you so we can try to account for people in different
> time zones.
>
> For the first one, I was thinking some day next week (time TBD by
> those interested) and starting off with a general roadmap discussion 
> before
> diving into specific technical topics.
>
> Thanks,
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


 --
 John Zhuge

>>>


Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Wenchen Fan
Any updates on this topic? https://github.com/apache/spark/pull/22827 is
merged and 2.4 is unblocked.

I'll cut RC5 shortly after the weekend, and it will be great to include the
change proposed here.

Thanks,
Wenchen

On Fri, Oct 26, 2018 at 12:55 AM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> I think it's worth getting in a change to just not enable this module,
>> which ought to be entirely safe, and avoid two of the issues we
>> identified.
>>
>
> Besides disabling it, when someone wants to run the tests with 2.12 he
> should be able to do so. So propagating the Scala profile still makes sense
> but it is not related to the release other than making sure things work
> fine.
>
> On Thu, Oct 25, 2018 at 7:02 PM, Sean Owen  wrote:
>
>> I think it's worth getting in a change to just not enable this module,
>> which ought to be entirely safe, and avoid two of the issues we
>> identified.
>> that said it didn't block RC4 so need not block RC5.
>> But should happen today if we're doing it.
>> On Thu, Oct 25, 2018 at 10:47 AM Xiao Li  wrote:
>> >
>> > Hopefully, this will not delay RC5. Since this is not a blocker ticket,
>> RC5 will start if all the blocker tickets are resolved.
>> >
>> > Thanks,
>> >
>> > Xiao
>> >
>> > Sean Owen  于2018年10月25日周四 上午8:44写道:
>> >>
>> >> Yes, I agree, and perhaps you are best placed to do that for 2.4.0 RC5
>> :)
>> >>
>> >> On Thu, Oct 25, 2018 at 10:41 AM Stavros Kontopoulos
>> >>  wrote:
>> >> >
>> >> > I agree these tests should be manual for now but should be run
>> somehow before a release to make sure things are working right?
>> >> >
>> >> > For the other issue:
>> https://issues.apache.org/jira/browse/SPARK-25835 .
>> >> >
>> >> >
>> >> > On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>> >> >>
>> >> >> I will open a jira for the profile propagation issue and have a
>> look to fix it.
>> >> >>
>> >> >> Stavros
>> >> >>
>> >> >> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson <
>> eerla...@redhat.com> wrote:
>> >> >>>
>> >> >>>
>> >> >>> I would be comfortable making the integration testing manual for
>> now.  A JIRA for ironing out how to make it reliable for automatic as a
>> goal for 3.0 seems like a good idea.
>> >> >>>
>> >> >>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen 
>> wrote:
>> >> 
>> >>  Forking this thread.
>> >> 
>> >>  Because we'll have another RC, we could possibly address these two
>> >>  issues. Only if we have a reliable change of course.
>> >> 
>> >>  Is it easy enough to propagate the -Pscala-2.12 profile? can't
>> hurt.
>> >> 
>> >>  And is it reasonable to essentially 'disable'
>> >>  kubernetes/integration-tests by removing it from the kubernetes
>> >>  profile? it doesn't mean it goes away, just means it's run
>> manually,
>> >>  not automatically. Is that actually how it's meant to be used
>> anyway?
>> >>  in the short term? given the discussion around its requirements
>> and
>> >>  minikube and all that?
>> >> 
>> >>  (Actually, this would also 'solve' the Scala 2.12 build problem
>> too)
>> >> 
>> >>  On Tue, Oct 23, 2018 at 2:45 PM Sean Owen 
>> wrote:
>> >>  >
>> >>  > To be clear I'm currently +1 on this release, with much
>> commentary.
>> >>  >
>> >>  > OK, the explanation for kubernetes tests makes sense. Yes I
>> think we need to propagate the scala-2.12 build profile to make it work. Go
>> for it, if you have a lead on what the change is.
>> >>  > This doesn't block the release as it's an issue for tests, and
>> only affects 2.12. However if we had a clean fix for this and there were
>> another RC, I'd include it.
>> >>  >
>> >>  > Dongjoon has a good point about the
>> spark-kubernetes-integration-tests artifact. That doesn't sound like it
>> should be published in this way, though, of course, we publish the test
>> artifacts from every module already. This is only a bit odd in being a
>> non-test artifact meant for testing. But it's special testing! So I also
>> don't think that needs to block a release.
>> >>  >
>> >>  > This happens because the integration tests module is enabled
>> with the 'kubernetes' profile too, and also this output is copied into the
>> release tarball at kubernetes/integration-tests/tests. Do we need that in a
>> binary release?
>> >>  >
>> >>  > If these integration tests are meant to be run ad hoc,
>> manually, not part of a normal test cycle, then I think we can just not
>> enable it with -Pkubernetes. If it is meant to run every time, then it
>> sounds like we need a little extra work shown in recent PRs to make that
>> easier, but then, this test code should just be the 'test' artifact parts
>> of the kubernetes module, no?
>> >> 
>> >> 
>> -
>> >>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >> 
>> >> >>
>> >> >>
>>

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Hyukjin Kwon
+1 !

2018년 10월 26일 (금) 오전 7:21, Dongjoon Hyun 님이 작성:

> +1. Thank you for volunteering, Ryan!
>
> Bests,
> Dongjoon.
>
>
> On Thu, Oct 25, 2018 at 4:19 PM Xiao Li  wrote:
>
>> +1
>>
>> Reynold Xin  于2018年10月25日周四 下午4:16写道:
>>
>>> +1
>>>
>>>
>>>
>>> On Thu, Oct 25, 2018 at 4:12 PM Li Jin  wrote:
>>>
 Although I am not specifically involved in DSv2, I think having this
 kind of meeting is definitely helpful to discuss, move certain effort
 forward and keep people on the same page. Glad to see this kind of working
 group happening.

 On Thu, Oct 25, 2018 at 5:58 PM John Zhuge  wrote:

> Great idea!
>
> On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue 
> wrote:
>
>> Hi everyone,
>>
>> There's been some great discussion for DataSourceV2 in the last few
>> months, but it has been difficult to resolve some of the discussions and 
>> I
>> don't think that we have a very clear roadmap for getting the work done.
>>
>> To coordinate better as a community, I'd like to start a regular
>> sync-up over google hangouts. We use this in the Parquet community to 
>> have
>> more effective community discussions about thorny technical issues and to
>> get aligned on an overall roadmap. It is really helpful in that community
>> and I think it would help us get DSv2 done more quickly.
>>
>> Here's how it works: people join the hangout, we go around the list
>> to gather topics, have about an hour-long discussion, and then send a
>> summary of the discussion to the dev list for anyone that couldn't
>> participate. That way we can move topics along, but we keep the broader
>> community in the loop as well for further discussion on the mailing list.
>>
>> I'll volunteer to set up the sync and send invites to anyone that
>> wants to attend. If you're interested, please reply with the email 
>> address
>> you'd like to put on the invite list (if there's a way to do this without
>> specific invites, let me know). Also for the first sync, please note what
>> times would work for you so we can try to account for people in different
>> time zones.
>>
>> For the first one, I was thinking some day next week (time TBD by
>> those interested) and starting off with a general roadmap discussion 
>> before
>> diving into specific technical topics.
>>
>> Thanks,
>>
>> rb
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> John Zhuge
>



Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Sean Owen
Yep, we're going to merge a change to separate the k8s tests into a
separate profile, and fix up the Scala 2.12 thing. While non-critical those
are pretty nice to have for 2.4. I think that's doable within the next 12
hours even.

@skonto I think there's one last minor thing needed on this PR?
https://github.com/apache/spark/pull/22838/files#r228363727

On Thu, Oct 25, 2018 at 6:42 PM Wenchen Fan  wrote:

> Any updates on this topic? https://github.com/apache/spark/pull/22827 is
> merged and 2.4 is unblocked.
>
> I'll cut RC5 shortly after the weekend, and it will be great to include
> the change proposed here.
>
> Thanks,
> Wenchen
>
> On Fri, Oct 26, 2018 at 12:55 AM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> I think it's worth getting in a change to just not enable this module,
>>> which ought to be entirely safe, and avoid two of the issues we
>>> identified.
>>>
>>
>> Besides disabling it, when someone wants to run the tests with 2.12 he
>> should be able to do so. So propagating the Scala profile still makes sense
>> but it is not related to the release other than making sure things work
>> fine.
>>
>> On Thu, Oct 25, 2018 at 7:02 PM, Sean Owen  wrote:
>>
>>> I think it's worth getting in a change to just not enable this module,
>>> which ought to be entirely safe, and avoid two of the issues we
>>> identified.
>>> that said it didn't block RC4 so need not block RC5.
>>> But should happen today if we're doing it.
>>> On Thu, Oct 25, 2018 at 10:47 AM Xiao Li  wrote:
>>> >
>>> > Hopefully, this will not delay RC5. Since this is not a blocker
>>> ticket, RC5 will start if all the blocker tickets are resolved.
>>> >
>>> > Thanks,
>>> >
>>> > Xiao
>>> >
>>> > Sean Owen  于2018年10月25日周四 上午8:44写道:
>>> >>
>>> >> Yes, I agree, and perhaps you are best placed to do that for 2.4.0
>>> RC5 :)
>>> >>
>>> >> On Thu, Oct 25, 2018 at 10:41 AM Stavros Kontopoulos
>>> >>  wrote:
>>> >> >
>>> >> > I agree these tests should be manual for now but should be run
>>> somehow before a release to make sure things are working right?
>>> >> >
>>> >> > For the other issue:
>>> https://issues.apache.org/jira/browse/SPARK-25835 .
>>> >> >
>>> >> >
>>> >> > On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos <
>>> stavros.kontopou...@lightbend.com> wrote:
>>> >> >>
>>> >> >> I will open a jira for the profile propagation issue and have a
>>> look to fix it.
>>> >> >>
>>> >> >> Stavros
>>> >> >>
>>> >> >> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson <
>>> eerla...@redhat.com> wrote:
>>> >> >>>
>>> >> >>>
>>> >> >>> I would be comfortable making the integration testing manual for
>>> now.  A JIRA for ironing out how to make it reliable for automatic as a
>>> goal for 3.0 seems like a good idea.
>>> >> >>>
>>> >> >>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen 
>>> wrote:
>>> >> 
>>> >>  Forking this thread.
>>> >> 
>>> >>  Because we'll have another RC, we could possibly address these
>>> two
>>> >>  issues. Only if we have a reliable change of course.
>>> >> 
>>> >>  Is it easy enough to propagate the -Pscala-2.12 profile? can't
>>> hurt.
>>> >> 
>>> >>  And is it reasonable to essentially 'disable'
>>> >>  kubernetes/integration-tests by removing it from the kubernetes
>>> >>  profile? it doesn't mean it goes away, just means it's run
>>> manually,
>>> >>  not automatically. Is that actually how it's meant to be used
>>> anyway?
>>> >>  in the short term? given the discussion around its requirements
>>> and
>>> >>  minikube and all that?
>>> >> 
>>> >>  (Actually, this would also 'solve' the Scala 2.12 build problem
>>> too)
>>> >> 
>>> >>  On Tue, Oct 23, 2018 at 2:45 PM Sean Owen 
>>> wrote:
>>> >>  >
>>> >>  > To be clear I'm currently +1 on this release, with much
>>> commentary.
>>> >>  >
>>> >>  > OK, the explanation for kubernetes tests makes sense. Yes I
>>> think we need to propagate the scala-2.12 build profile to make it work. Go
>>> for it, if you have a lead on what the change is.
>>> >>  > This doesn't block the release as it's an issue for tests, and
>>> only affects 2.12. However if we had a clean fix for this and there were
>>> another RC, I'd include it.
>>> >>  >
>>> >>  > Dongjoon has a good point about the
>>> spark-kubernetes-integration-tests artifact. That doesn't sound like it
>>> should be published in this way, though, of course, we publish the test
>>> artifacts from every module already. This is only a bit odd in being a
>>> non-test artifact meant for testing. But it's special testing! So I also
>>> don't think that needs to block a release.
>>> >>  >
>>> >>  > This happens because the integration tests module is enabled
>>> with the 'kubernetes' profile too, and also this output is copied into the
>>> release tarball at kubernetes/integration-tests/tests. Do we need that in a
>>> binary release?
>>> >>  >
>>> >>  > If these integration tests are meant to be 

Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Reynold Xin
I also think we should get this in:
https://github.com/apache/spark/pull/22841

It's to deprecate a confusing & broken window function API, so we can
remove them in 3.0 and redesign a better one. See
https://issues.apache.org/jira/browse/SPARK-25841 for more information.


On Thu, Oct 25, 2018 at 4:55 PM Sean Owen  wrote:

> Yep, we're going to merge a change to separate the k8s tests into a
> separate profile, and fix up the Scala 2.12 thing. While non-critical those
> are pretty nice to have for 2.4. I think that's doable within the next 12
> hours even.
>
> @skonto I think there's one last minor thing needed on this PR?
> https://github.com/apache/spark/pull/22838/files#r228363727
>
> On Thu, Oct 25, 2018 at 6:42 PM Wenchen Fan  wrote:
>
>> Any updates on this topic? https://github.com/apache/spark/pull/22827 is
>> merged and 2.4 is unblocked.
>>
>> I'll cut RC5 shortly after the weekend, and it will be great to include
>> the change proposed here.
>>
>> Thanks,
>> Wenchen
>>
>> On Fri, Oct 26, 2018 at 12:55 AM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> I think it's worth getting in a change to just not enable this module,
 which ought to be entirely safe, and avoid two of the issues we
 identified.

>>>
>>> Besides disabling it, when someone wants to run the tests with 2.12 he
>>> should be able to do so. So propagating the Scala profile still makes sense
>>> but it is not related to the release other than making sure things work
>>> fine.
>>>
>>> On Thu, Oct 25, 2018 at 7:02 PM, Sean Owen  wrote:
>>>
 I think it's worth getting in a change to just not enable this module,
 which ought to be entirely safe, and avoid two of the issues we
 identified.
 that said it didn't block RC4 so need not block RC5.
 But should happen today if we're doing it.
 On Thu, Oct 25, 2018 at 10:47 AM Xiao Li  wrote:
 >
 > Hopefully, this will not delay RC5. Since this is not a blocker
 ticket, RC5 will start if all the blocker tickets are resolved.
 >
 > Thanks,
 >
 > Xiao
 >
 > Sean Owen  于2018年10月25日周四 上午8:44写道:
 >>
 >> Yes, I agree, and perhaps you are best placed to do that for 2.4.0
 RC5 :)
 >>
 >> On Thu, Oct 25, 2018 at 10:41 AM Stavros Kontopoulos
 >>  wrote:
 >> >
 >> > I agree these tests should be manual for now but should be run
 somehow before a release to make sure things are working right?
 >> >
 >> > For the other issue:
 https://issues.apache.org/jira/browse/SPARK-25835 .
 >> >
 >> >
 >> > On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos <
 stavros.kontopou...@lightbend.com> wrote:
 >> >>
 >> >> I will open a jira for the profile propagation issue and have a
 look to fix it.
 >> >>
 >> >> Stavros
 >> >>
 >> >> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson <
 eerla...@redhat.com> wrote:
 >> >>>
 >> >>>
 >> >>> I would be comfortable making the integration testing manual for
 now.  A JIRA for ironing out how to make it reliable for automatic as a
 goal for 3.0 seems like a good idea.
 >> >>>
 >> >>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen 
 wrote:
 >> 
 >>  Forking this thread.
 >> 
 >>  Because we'll have another RC, we could possibly address these
 two
 >>  issues. Only if we have a reliable change of course.
 >> 
 >>  Is it easy enough to propagate the -Pscala-2.12 profile? can't
 hurt.
 >> 
 >>  And is it reasonable to essentially 'disable'
 >>  kubernetes/integration-tests by removing it from the kubernetes
 >>  profile? it doesn't mean it goes away, just means it's run
 manually,
 >>  not automatically. Is that actually how it's meant to be used
 anyway?
 >>  in the short term? given the discussion around its requirements
 and
 >>  minikube and all that?
 >> 
 >>  (Actually, this would also 'solve' the Scala 2.12 build problem
 too)
 >> 
 >>  On Tue, Oct 23, 2018 at 2:45 PM Sean Owen 
 wrote:
 >>  >
 >>  > To be clear I'm currently +1 on this release, with much
 commentary.
 >>  >
 >>  > OK, the explanation for kubernetes tests makes sense. Yes I
 think we need to propagate the scala-2.12 build profile to make it work. Go
 for it, if you have a lead on what the change is.
 >>  > This doesn't block the release as it's an issue for tests,
 and only affects 2.12. However if we had a clean fix for this and there
 were another RC, I'd include it.
 >>  >
 >>  > Dongjoon has a good point about the
 spark-kubernetes-integration-tests artifact. That doesn't sound like it
 should be published in this way, though, of course, we publish the test
 artifacts from every module already. This is only a bit odd in being a
 non-t

Re: DataSourceV2 hangouts sync

2018-10-25 Thread Wenchen Fan
Big +1 on this!

I live in UTC+8 and I'm available from 8 am, which is 5 pm in the bay area.
Hopefully we can coordinate a time that fits everyone.

Thanks
Wenchen



On Fri, Oct 26, 2018 at 7:21 AM Dongjoon Hyun 
wrote:

> +1. Thank you for volunteering, Ryan!
>
> Bests,
> Dongjoon.
>
>
> On Thu, Oct 25, 2018 at 4:19 PM Xiao Li  wrote:
>
>> +1
>>
>> Reynold Xin  于2018年10月25日周四 下午4:16写道:
>>
>>> +1
>>>
>>>
>>>
>>> On Thu, Oct 25, 2018 at 4:12 PM Li Jin  wrote:
>>>
 Although I am not specifically involved in DSv2, I think having this
 kind of meeting is definitely helpful to discuss, move certain effort
 forward and keep people on the same page. Glad to see this kind of working
 group happening.

 On Thu, Oct 25, 2018 at 5:58 PM John Zhuge  wrote:

> Great idea!
>
> On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue 
> wrote:
>
>> Hi everyone,
>>
>> There's been some great discussion for DataSourceV2 in the last few
>> months, but it has been difficult to resolve some of the discussions and 
>> I
>> don't think that we have a very clear roadmap for getting the work done.
>>
>> To coordinate better as a community, I'd like to start a regular
>> sync-up over google hangouts. We use this in the Parquet community to 
>> have
>> more effective community discussions about thorny technical issues and to
>> get aligned on an overall roadmap. It is really helpful in that community
>> and I think it would help us get DSv2 done more quickly.
>>
>> Here's how it works: people join the hangout, we go around the list
>> to gather topics, have about an hour-long discussion, and then send a
>> summary of the discussion to the dev list for anyone that couldn't
>> participate. That way we can move topics along, but we keep the broader
>> community in the loop as well for further discussion on the mailing list.
>>
>> I'll volunteer to set up the sync and send invites to anyone that
>> wants to attend. If you're interested, please reply with the email 
>> address
>> you'd like to put on the invite list (if there's a way to do this without
>> specific invites, let me know). Also for the first sync, please note what
>> times would work for you so we can try to account for people in different
>> time zones.
>>
>> For the first one, I was thinking some day next week (time TBD by
>> those interested) and starting off with a general roadmap discussion 
>> before
>> diving into specific technical topics.
>>
>> Thanks,
>>
>> rb
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> John Zhuge
>



Re: DataSourceV2 hangouts sync

2018-10-25 Thread Ryan Blue
Since not many people have replied with a time window, how about we aim for
5PM PDT? That should work for Wenchen and most people here in the bay area.

If that makes it so some people can't attend, we can do the next one
earlier for people in Europe.

If we go with 5PM PDT, then what day works best for everyone?

On Thu, Oct 25, 2018 at 5:01 PM Wenchen Fan  wrote:

> Big +1 on this!
>
> I live in UTC+8 and I'm available from 8 am, which is 5 pm in the bay
> area. Hopefully we can coordinate a time that fits everyone.
>
> Thanks
> Wenchen
>
>
>
> On Fri, Oct 26, 2018 at 7:21 AM Dongjoon Hyun 
> wrote:
>
>> +1. Thank you for volunteering, Ryan!
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Oct 25, 2018 at 4:19 PM Xiao Li  wrote:
>>
>>> +1
>>>
>>> Reynold Xin  于2018年10月25日周四 下午4:16写道:
>>>
 +1



 On Thu, Oct 25, 2018 at 4:12 PM Li Jin  wrote:

> Although I am not specifically involved in DSv2, I think having this
> kind of meeting is definitely helpful to discuss, move certain effort
> forward and keep people on the same page. Glad to see this kind of working
> group happening.
>
> On Thu, Oct 25, 2018 at 5:58 PM John Zhuge  wrote:
>
>> Great idea!
>>
>> On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> There's been some great discussion for DataSourceV2 in the last few
>>> months, but it has been difficult to resolve some of the discussions 
>>> and I
>>> don't think that we have a very clear roadmap for getting the work done.
>>>
>>> To coordinate better as a community, I'd like to start a regular
>>> sync-up over google hangouts. We use this in the Parquet community to 
>>> have
>>> more effective community discussions about thorny technical issues and 
>>> to
>>> get aligned on an overall roadmap. It is really helpful in that 
>>> community
>>> and I think it would help us get DSv2 done more quickly.
>>>
>>> Here's how it works: people join the hangout, we go around the list
>>> to gather topics, have about an hour-long discussion, and then send a
>>> summary of the discussion to the dev list for anyone that couldn't
>>> participate. That way we can move topics along, but we keep the broader
>>> community in the loop as well for further discussion on the mailing 
>>> list.
>>>
>>> I'll volunteer to set up the sync and send invites to anyone that
>>> wants to attend. If you're interested, please reply with the email 
>>> address
>>> you'd like to put on the invite list (if there's a way to do this 
>>> without
>>> specific invites, let me know). Also for the first sync, please note 
>>> what
>>> times would work for you so we can try to account for people in 
>>> different
>>> time zones.
>>>
>>> For the first one, I was thinking some day next week (time TBD by
>>> those interested) and starting off with a general roadmap discussion 
>>> before
>>> diving into specific technical topics.
>>>
>>> Thanks,
>>>
>>> rb
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> John Zhuge
>>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: DataSourceV2 hangouts sync

2018-10-25 Thread Wenchen Fan
Friday at the bay area is Saturday at my side, it will be great if we can
pick a day from Monday to Thursday.

On Fri, Oct 26, 2018 at 8:08 AM Ryan Blue  wrote:

> Since not many people have replied with a time window, how about we aim
> for 5PM PDT? That should work for Wenchen and most people here in the bay
> area.
>
> If that makes it so some people can't attend, we can do the next one
> earlier for people in Europe.
>
> If we go with 5PM PDT, then what day works best for everyone?
>
> On Thu, Oct 25, 2018 at 5:01 PM Wenchen Fan  wrote:
>
>> Big +1 on this!
>>
>> I live in UTC+8 and I'm available from 8 am, which is 5 pm in the bay
>> area. Hopefully we can coordinate a time that fits everyone.
>>
>> Thanks
>> Wenchen
>>
>>
>>
>> On Fri, Oct 26, 2018 at 7:21 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1. Thank you for volunteering, Ryan!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Oct 25, 2018 at 4:19 PM Xiao Li  wrote:
>>>
 +1

 Reynold Xin  于2018年10月25日周四 下午4:16写道:

> +1
>
>
>
> On Thu, Oct 25, 2018 at 4:12 PM Li Jin  wrote:
>
>> Although I am not specifically involved in DSv2, I think having this
>> kind of meeting is definitely helpful to discuss, move certain effort
>> forward and keep people on the same page. Glad to see this kind of 
>> working
>> group happening.
>>
>> On Thu, Oct 25, 2018 at 5:58 PM John Zhuge  wrote:
>>
>>> Great idea!
>>>
>>> On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue 
>>> wrote:
>>>
 Hi everyone,

 There's been some great discussion for DataSourceV2 in the last few
 months, but it has been difficult to resolve some of the discussions 
 and I
 don't think that we have a very clear roadmap for getting the work 
 done.

 To coordinate better as a community, I'd like to start a regular
 sync-up over google hangouts. We use this in the Parquet community to 
 have
 more effective community discussions about thorny technical issues and 
 to
 get aligned on an overall roadmap. It is really helpful in that 
 community
 and I think it would help us get DSv2 done more quickly.

 Here's how it works: people join the hangout, we go around the list
 to gather topics, have about an hour-long discussion, and then send a
 summary of the discussion to the dev list for anyone that couldn't
 participate. That way we can move topics along, but we keep the broader
 community in the loop as well for further discussion on the mailing 
 list.

 I'll volunteer to set up the sync and send invites to anyone that
 wants to attend. If you're interested, please reply with the email 
 address
 you'd like to put on the invite list (if there's a way to do this 
 without
 specific invites, let me know). Also for the first sync, please note 
 what
 times would work for you so we can try to account for people in 
 different
 time zones.

 For the first one, I was thinking some day next week (time TBD by
 those interested) and starting off with a general roadmap discussion 
 before
 diving into specific technical topics.

 Thanks,

 rb

 --
 Ryan Blue
 Software Engineer
 Netflix

>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: DataSourceV2 hangouts sync

2018-10-25 Thread Ryan Blue
Good point. How about Monday or Wednesday at 5PM PDT then?

Everyone, please reply to me (no need to spam the list) with which option
works for you and I'll send an invite for the one with the most votes.

On Thu, Oct 25, 2018 at 5:14 PM Wenchen Fan  wrote:

> Friday at the bay area is Saturday at my side, it will be great if we can
> pick a day from Monday to Thursday.
>
> On Fri, Oct 26, 2018 at 8:08 AM Ryan Blue  wrote:
>
>> Since not many people have replied with a time window, how about we aim
>> for 5PM PDT? That should work for Wenchen and most people here in the bay
>> area.
>>
>> If that makes it so some people can't attend, we can do the next one
>> earlier for people in Europe.
>>
>> If we go with 5PM PDT, then what day works best for everyone?
>>
>> On Thu, Oct 25, 2018 at 5:01 PM Wenchen Fan  wrote:
>>
>>> Big +1 on this!
>>>
>>> I live in UTC+8 and I'm available from 8 am, which is 5 pm in the bay
>>> area. Hopefully we can coordinate a time that fits everyone.
>>>
>>> Thanks
>>> Wenchen
>>>
>>>
>>>
>>> On Fri, Oct 26, 2018 at 7:21 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1. Thank you for volunteering, Ryan!

 Bests,
 Dongjoon.


 On Thu, Oct 25, 2018 at 4:19 PM Xiao Li  wrote:

> +1
>
> Reynold Xin  于2018年10月25日周四 下午4:16写道:
>
>> +1
>>
>>
>>
>> On Thu, Oct 25, 2018 at 4:12 PM Li Jin  wrote:
>>
>>> Although I am not specifically involved in DSv2, I think having this
>>> kind of meeting is definitely helpful to discuss, move certain effort
>>> forward and keep people on the same page. Glad to see this kind of 
>>> working
>>> group happening.
>>>
>>> On Thu, Oct 25, 2018 at 5:58 PM John Zhuge 
>>> wrote:
>>>
 Great idea!

 On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue 
 wrote:

> Hi everyone,
>
> There's been some great discussion for DataSourceV2 in the last
> few months, but it has been difficult to resolve some of the 
> discussions
> and I don't think that we have a very clear roadmap for getting the 
> work
> done.
>
> To coordinate better as a community, I'd like to start a regular
> sync-up over google hangouts. We use this in the Parquet community to 
> have
> more effective community discussions about thorny technical issues 
> and to
> get aligned on an overall roadmap. It is really helpful in that 
> community
> and I think it would help us get DSv2 done more quickly.
>
> Here's how it works: people join the hangout, we go around the
> list to gather topics, have about an hour-long discussion, and then 
> send a
> summary of the discussion to the dev list for anyone that couldn't
> participate. That way we can move topics along, but we keep the 
> broader
> community in the loop as well for further discussion on the mailing 
> list.
>
> I'll volunteer to set up the sync and send invites to anyone that
> wants to attend. If you're interested, please reply with the email 
> address
> you'd like to put on the invite list (if there's a way to do this 
> without
> specific invites, let me know). Also for the first sync, please note 
> what
> times would work for you so we can try to account for people in 
> different
> time zones.
>
> For the first one, I was thinking some day next week (time TBD by
> those interested) and starting off with a general roadmap discussion 
> before
> diving into specific technical topics.
>
> Thanks,
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


 --
 John Zhuge

>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: DataSourceV2 hangouts sync

2018-10-25 Thread Hyukjin Kwon
I didn't know I live in the same timezone with you Wenchen :D.
Monday or Wednesday at 5PM PDT sounds good to me too FWIW.

2018년 10월 26일 (금) 오전 8:29, Ryan Blue 님이 작성:

> Good point. How about Monday or Wednesday at 5PM PDT then?
>
> Everyone, please reply to me (no need to spam the list) with which option
> works for you and I'll send an invite for the one with the most votes.
>
> On Thu, Oct 25, 2018 at 5:14 PM Wenchen Fan  wrote:
>
>> Friday at the bay area is Saturday at my side, it will be great if we can
>> pick a day from Monday to Thursday.
>>
>> On Fri, Oct 26, 2018 at 8:08 AM Ryan Blue  wrote:
>>
>>> Since not many people have replied with a time window, how about we aim
>>> for 5PM PDT? That should work for Wenchen and most people here in the bay
>>> area.
>>>
>>> If that makes it so some people can't attend, we can do the next one
>>> earlier for people in Europe.
>>>
>>> If we go with 5PM PDT, then what day works best for everyone?
>>>
>>> On Thu, Oct 25, 2018 at 5:01 PM Wenchen Fan  wrote:
>>>
 Big +1 on this!

 I live in UTC+8 and I'm available from 8 am, which is 5 pm in the bay
 area. Hopefully we can coordinate a time that fits everyone.

 Thanks
 Wenchen



 On Fri, Oct 26, 2018 at 7:21 AM Dongjoon Hyun 
 wrote:

> +1. Thank you for volunteering, Ryan!
>
> Bests,
> Dongjoon.
>
>
> On Thu, Oct 25, 2018 at 4:19 PM Xiao Li  wrote:
>
>> +1
>>
>> Reynold Xin  于2018年10月25日周四 下午4:16写道:
>>
>>> +1
>>>
>>>
>>>
>>> On Thu, Oct 25, 2018 at 4:12 PM Li Jin 
>>> wrote:
>>>
 Although I am not specifically involved in DSv2, I think having
 this kind of meeting is definitely helpful to discuss, move certain 
 effort
 forward and keep people on the same page. Glad to see this kind of 
 working
 group happening.

 On Thu, Oct 25, 2018 at 5:58 PM John Zhuge 
 wrote:

> Great idea!
>
> On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue
>  wrote:
>
>> Hi everyone,
>>
>> There's been some great discussion for DataSourceV2 in the last
>> few months, but it has been difficult to resolve some of the 
>> discussions
>> and I don't think that we have a very clear roadmap for getting the 
>> work
>> done.
>>
>> To coordinate better as a community, I'd like to start a regular
>> sync-up over google hangouts. We use this in the Parquet community 
>> to have
>> more effective community discussions about thorny technical issues 
>> and to
>> get aligned on an overall roadmap. It is really helpful in that 
>> community
>> and I think it would help us get DSv2 done more quickly.
>>
>> Here's how it works: people join the hangout, we go around the
>> list to gather topics, have about an hour-long discussion, and then 
>> send a
>> summary of the discussion to the dev list for anyone that couldn't
>> participate. That way we can move topics along, but we keep the 
>> broader
>> community in the loop as well for further discussion on the mailing 
>> list.
>>
>> I'll volunteer to set up the sync and send invites to anyone that
>> wants to attend. If you're interested, please reply with the email 
>> address
>> you'd like to put on the invite list (if there's a way to do this 
>> without
>> specific invites, let me know). Also for the first sync, please note 
>> what
>> times would work for you so we can try to account for people in 
>> different
>> time zones.
>>
>> For the first one, I was thinking some day next week (time TBD by
>> those interested) and starting off with a general roadmap discussion 
>> before
>> diving into specific technical topics.
>>
>> Thanks,
>>
>> rb
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> John Zhuge
>

>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: DataSourceV2 hangouts sync

2018-10-25 Thread Saikat Kanjilal
Ditto, I’d also like to join and am in Seattle, generally afternoons work 
better for me.

Sent from my iPhone

On Oct 25, 2018, at 5:02 PM, Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:

Big +1 on this!

I live in UTC+8 and I'm available from 8 am, which is 5 pm in the bay area. 
Hopefully we can coordinate a time that fits everyone.

Thanks
Wenchen



On Fri, Oct 26, 2018 at 7:21 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
+1. Thank you for volunteering, Ryan!

Bests,
Dongjoon.


On Thu, Oct 25, 2018 at 4:19 PM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
+1

Reynold Xin mailto:r...@databricks.com>> 于2018年10月25日周四 
下午4:16写道:
+1



On Thu, Oct 25, 2018 at 4:12 PM Li Jin 
mailto:ice.xell...@gmail.com>> wrote:
Although I am not specifically involved in DSv2, I think having this kind of 
meeting is definitely helpful to discuss, move certain effort forward and keep 
people on the same page. Glad to see this kind of working group happening.

On Thu, Oct 25, 2018 at 5:58 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Great idea!

On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue 
mailto:rb...@netflix.com.invalid>> wrote:
Hi everyone,

There's been some great discussion for DataSourceV2 in the last few months, but 
it has been difficult to resolve some of the discussions and I don't think that 
we have a very clear roadmap for getting the work done.

To coordinate better as a community, I'd like to start a regular sync-up over 
google hangouts. We use this in the Parquet community to have more effective 
community discussions about thorny technical issues and to get aligned on an 
overall roadmap. It is really helpful in that community and I think it would 
help us get DSv2 done more quickly.

Here's how it works: people join the hangout, we go around the list to gather 
topics, have about an hour-long discussion, and then send a summary of the 
discussion to the dev list for anyone that couldn't participate. That way we 
can move topics along, but we keep the broader community in the loop as well 
for further discussion on the mailing list.

I'll volunteer to set up the sync and send invites to anyone that wants to 
attend. If you're interested, please reply with the email address you'd like to 
put on the invite list (if there's a way to do this without specific invites, 
let me know). Also for the first sync, please note what times would work for 
you so we can try to account for people in different time zones.

For the first one, I was thinking some day next week (time TBD by those 
interested) and starting off with a general roadmap discussion before diving 
into specific technical topics.

Thanks,

rb

--
Ryan Blue
Software Engineer
Netflix


--
John Zhuge