Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Zhan Zhang
Thanks Reynold. Not sure why doExecute is not invoked, since CollectLimit does not support wholeStage case class CollectLimit(limit: Int, child: SparkPlan) extends UnaryNode { I will dig further into this. Zhan Zhang On Apr 18, 2016, at 10:36 PM, Reynold Xin

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
Anyway we can verify this easily. I just added a println to each row and verified that only limit + 1 row was printed after the join and before the limit. It'd be great if you do some debugging yourself and see if it is going through some other code path. On Mon, Apr 18, 2016 at 10:35 PM,

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
But doExecute is not called? On Mon, Apr 18, 2016 at 10:32 PM, Zhan Zhang wrote: > Hi Reynold, > > I just check the code for CollectLimit, there is a shuffle happening to > collect them in one partition. > > protected override def doExecute(): RDD[InternalRow] = { >

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Zhan Zhang
Hi Reynold, I just check the code for CollectLimit, there is a shuffle happening to collect them in one partition. protected override def doExecute(): RDD[InternalRow] = { val shuffled = new ShuffledRowRDD( ShuffleExchange.prepareShuffleDependency( child.execute(), child.output,

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Reynold Xin
Unless I'm really missing something I don't think so. As I said, it goes through an iterator and after processing each stream side we do a shouldStop check. The generated code looks like /* 094 */ protected void processNext() throws java.io.IOException { /* 095 */ /*** PRODUCE: Project

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
Thanks a lot for commenting. We are getting great feedback on this thread. The take-aways are: 1. In general people prefer having explicit reasons why pull requests should be closed. We should push committers to leave messages that are more explicit about why certain PR should be closed or not. I

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Zhan Zhang
>From the physical plan, the limit is one level up than the WholeStageCodegen, >Thus, I don’t think shouldStop would work here. To move it work, the limit has >to be part of the wholeStageCodeGen. Correct me if I am wrong. Thanks. Zhan Zhang On Apr 18, 2016, at 11:09 AM, Reynold Xin

Re: more uniform exception handling?

2016-04-18 Thread Zhan Zhang
+1 Both of the would be very helpful in debugging Thanks. Zhan Zhang On Apr 18, 2016, at 1:18 PM, Evan Chan wrote: > +1000. > > Especially if the UI can help correlate exceptions, and we can reduce > some exceptions. > > There are some exceptions which are in

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
bq. there should be more committers or they are asked to be more active. Bingo. bq. they can't be closed only because it is "expired" with a copy and pasted message. +1 On Mon, Apr 18, 2016 at 9:14 PM, Hyukjin Kwon wrote: > I don't think asking committers to be more

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Saisai Shao
>>>By the way, some people noted that closing PRs may discourage contributors. I think our open PR count alone is very discouraging. Under what circumstances would you feel encouraged to open a PR against a project that has hundreds of open PRs, some from many, many months ago

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Hyukjin Kwon
I don't think asking committers to be more active is impractical. I am not too sure if other projects apply the same rules here but I think if a project is being more popular, then I think it is appropriate that there should be more committers or they are asked to be more active. In addition, I

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Nicholas Chammas
Relevant: https://github.com/databricks/spark-pr-dashboard/issues/1 A lot of this was discussed a while back when the PR Dashboard was first introduced, and several times before and after that as well. (e.g. August 2014

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
During the months of November / December, the 30 day period should be relaxed. Some people(at least in US) may take extended vacation during that time. For Chinese developers, Spring Festival would bear similar circumstance. On Mon, Apr 18, 2016 at 7:25 PM, Hyukjin Kwon

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Hyukjin Kwon
I also think this might not have to be closed only because it is inactive. How about closing issues after 30 days when a committer's comment is added at the last without responses from the author? IMHO, If the committers are not sure whether the patch would be useful, then I think they should

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Saisai Shao
It would be better to have a specific technical reason why this PR should be closed, either the implementation is not good or the problem is not valid, or something else. That will actually help the contributor to shape their codes and reopen the PR again. Otherwise reasons like "feel free to

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Mark Grover
Thanks for responding, Reynold, Marcelo and Marcin. >And I think that's really what Mark is proposing. Basically, "don't >intentionally break backwards compatibility unless it's really >required" (e.g. SPARK-12130). That would allow option B to work. Yeah, that's exactly what Option B is

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Marcelo Vanzin
On Mon, Apr 18, 2016 at 3:09 PM, Reynold Xin wrote: > IIUC, the reason for that PR is that they found the string comparison to > increase the size in large shuffles. Maybe we should add the ability to > support the short name to Spark 1.6.2? Is that something that really

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
Got it. So Mark is pushing for "best-effort" support. IIUC, the reason for that PR is that they found the string comparison to increase the size in large shuffles. Maybe we should add the ability to support the short name to Spark 1.6.2? On Mon, Apr 18, 2016 at 3:05 PM, Marcelo Vanzin

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Marcelo Vanzin
On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin wrote: > The bigger problem is that it is much easier to maintain backward > compatibility rather than dictating forward compatibility. For example, as > Marcin said, if we come up with a slightly different shuffle layout to >

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
Yea I re-read the email again. It'd work in this case. The bigger problem is that it is much easier to maintain backward compatibility rather than dictating forward compatibility. For example, as Marcin said, if we come up with a slightly different shuffle layout to improve shuffle performance,

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Marcelo Vanzin
On Mon, Apr 18, 2016 at 1:53 PM, Reynold Xin wrote: > That's not the only one. For example, the hash shuffle manager has been off > by default since Spark 1.2, and we'd like to remove it in 2.0: > https://github.com/apache/spark/pull/12423 If I understand things correctly,

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Marcin Tustin
I'm good with option B at least until it blocks something utterly wonderful (like shuffles are 10x faster). On Mon, Apr 18, 2016 at 4:51 PM, Mark Grover wrote: > Hi all, > If you don't use Spark on YARN, you probably don't need to read further. > > Here's the *user scenario*: >

Re: YARN Shuffle service and its compatibility

2016-04-18 Thread Reynold Xin
That's not the only one. For example, the hash shuffle manager has been off by default since Spark 1.2, and we'd like to remove it in 2.0: https://github.com/apache/spark/pull/12423 How difficult it is to just change the package name to say v2? On Mon, Apr 18, 2016 at 1:51 PM, Mark Grover

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Sean Busbey
Having a PR closed, especially if due to committers not having hte bandwidth to check on things, will be very discouraging to new folks. Doubly so for those inexperienced with opensource. Even if the message says "feel free to reopen for so-and-so reason", new folks who lack confidence are going

YARN Shuffle service and its compatibility

2016-04-18 Thread Mark Grover
Hi all, If you don't use Spark on YARN, you probably don't need to read further. Here's the *user scenario*: There are going to be folks who may be interested in running two versions of Spark (say Spark 1.6.x and Spark 2.x) on the same YARN cluster. And, here's the *problem*: That's all fine,

Re: more uniform exception handling?

2016-04-18 Thread Evan Chan
+1000. Especially if the UI can help correlate exceptions, and we can reduce some exceptions. There are some exceptions which are in practice very common, such as the nasty ClassNotFoundException, that most folks end up spending tons of time debugging. On Mon, Apr 18, 2016 at 12:16 PM, Reynold

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
The cost of "reopen" is close to zero, because it is just clicking a button. I think you were referring to the cost of closing the pull request, and you are assuming people look at the pull requests that have been inactive for a long time. That seems equally likely (or unlikely) as committers

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
>From committers' perspective, would they look at closed PRs ? If not, the cost is not close to zero. Meaning, some potentially useful PRs would never see the light of day. My two cents. On Mon, Apr 18, 2016 at 12:43 PM, Reynold Xin wrote: > Part of it is how difficult it

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
Part of it is how difficult it is to automate this. We can build a perfect engine with a lot of rules that understand everything. But the more complicated rules we need, the more unlikely for any of these to happen. So I'd rather do this and create a nice enough message to tell contributors

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
bq. close the ones where they don't respond for a week Does this imply that the script understands response from human ? Meaning, would the script use some regex which signifies that the contributor is willing to close the PR ? If the contributor is willing to close, why wouldn't he / she do it

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Marcin Tustin
+1 and at the same time maybe surface a report to this list of PRs which need committer action and have only had submitters responding to pings in the last 30 days? On Mon, Apr 18, 2016 at 3:33 PM, Holden Karau wrote: > Personally I'd rather err on the side of keeping PRs

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Holden Karau
Personally I'd rather err on the side of keeping PRs open, but I understand wanting to keep the open PRs limited to ones which have a reasonable chance of being merged. What about if we filtered for non-mergeable PRs or instead left a comment asking the author to respond if they are still

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
Cody, Thanks for commenting. "inactive" here means no code push nor comments. So any "ping" would actually keep the pr in the open queue. Getting auto-closed also by no means indicate the pull request can't be reopened. On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger wrote:

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Ted Yu
I had one PR which got merged after 3 months. If the inactivity was due to contributor, I think it can be closed after 30 days. But if the inactivity was due to lack of review, the PR should be kept open. On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger wrote: > For what

more uniform exception handling?

2016-04-18 Thread Reynold Xin
Josh's pull request on rpc exception handling got me to think ... In my experience, there have been a few things related exceptions that created a lot of trouble for us in production debugging: 1. Some exception is thrown, but is caught by some

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Cody Koeninger
For what it's worth, I have definitely had PRs that sat inactive for more than 30 days due to committers not having time to look at them, but did eventually end up successfully being merged. I guess if this just ends up being a committer ping and reopening the PR, it's fine, but I don't know if

inter spark application communication

2016-04-18 Thread Soumitra Johri
Hi, I have two applications : App1 and App2. On a single cluster I have to spawn 5 instances os App1 and 1 instance of App2. What would be the best way to send data from the 5 App1 instances to the single App2 instance ? Right now I am using Kafka to send data from one spark application to the

auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Reynold Xin
We have hit a new high in open pull requests: 469 today. While we can certainly get more review bandwidth, many of these are old and still open for other reasons. Some are stale because the original authors have become busy and inactive, and some others are stale because the committers are not

More elaborate toString for StreamExecution?

2016-04-18 Thread Jacek Laskowski
Hi, I'd love having a more elaborate toString to StreamExecution: scala> sqlContext.streams.active.foreach(println) Continuous Query - memStream [state = ACTIVE] Continuous Query - hello2 [state = ACTIVE] Continuous Query - hello [state = ACTIVE] Any work in this area? trigger is something it

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-18 Thread Luciano Resende
Evan, As long as you meet the criteria we discussed on this thread, you are welcome to join. Having said that, I have already seen other contributors that are very active on some of connectors but are not Apache Committers yet, and i wanted to be fair, and also avoid using the project as an

Re: Implicit from ProcessingTime to scala.concurrent.duration.Duration?

2016-04-18 Thread Reynold Xin
Nope. It is unclear whether they would be useful enough or not. But when designing APIs we always need to anticipate future changes. On Monday, April 18, 2016, Jacek Laskowski wrote: > When you say "in the future", do you have any specific timeframe in > mind? You got me

Re: SparkSQL - Limit pushdown on BroadcastHashJoin

2016-04-18 Thread Andrew Ray
While you can't automatically push the limit *through* the join, we could push it *into* the join (stop processing after generating 10 records). I believe that is what Rajesh is suggesting. On Tue, Apr 12, 2016 at 7:46 AM, Herman van Hövell tot Westerflier < hvanhov...@questtec.nl> wrote: > I am

Re: Implicit from ProcessingTime to scala.concurrent.duration.Duration?

2016-04-18 Thread Jacek Laskowski
When you say "in the future", do you have any specific timeframe in mind? You got me curious :) Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Mon, Apr 18, 2016

Re: Implicit from ProcessingTime to scala.concurrent.duration.Duration?

2016-04-18 Thread Reynold Xin
The problem with this is that we might introduce event time based trigger in the future, and then it would be more confusing... On Monday, April 18, 2016, Jacek Laskowski wrote: > Hi, > > While working with structured streaming (aka SparkSQL Streams :)) I > thought about adding

Re: [build system] issue w/jenkins

2016-04-18 Thread shane knapp
somehow DNS, internal to berkeley, got borked and the redirect failed. we've hard-coded in some entries in to /etc/hosts, and re-ordered our nameservers, and are still trying to figure out what happened. anyways, we're back: https://amplab.cs.berkeley.edu/jenkins/ On Mon, Apr 18, 2016 at 10:22

Implicit from ProcessingTime to scala.concurrent.duration.Duration?

2016-04-18 Thread Jacek Laskowski
Hi, While working with structured streaming (aka SparkSQL Streams :)) I thought about adding implicit def toProcessingTime(duration: Duration) = ProcessingTime(duration) What do you think? I think it'd improve the API: .trigger(ProcessingTime(10 seconds)) vs .trigger(10 seconds) (since

Re: [build system] issue w/jenkins

2016-04-18 Thread shane knapp
for now, you can log in to jenkins by ignoring the http reverse proxy: https://hadrian.ist.berkeley.edu/jenkins/ this still doesn't allow for things like the pull request builder and whatnot to run... i'm still digging in to this. thanks, shane On Mon, Apr 18, 2016 at 10:02 AM, shane knapp

Re: BytesToBytes and unaligned memory

2016-04-18 Thread Ted Yu
bq. run the tests claiming to require unaligned memory access on a platform where unaligned memory access is definitely not supported for shorts/ints/longs. That would help us understand interactions on s390x platform better. On Mon, Apr 18, 2016 at 6:49 AM, Adam Roberts

Re: BytesToBytes and unaligned memory

2016-04-18 Thread Adam Roberts
Ted, yes with the forced true value all tests pass, we use the unaligned check in 15 other suites. Our java.nio.Bits.unaligned() function checks that the detected os.arch value matches a list of known implementations (not including s390x). We could add it to the known architectures in the

Re: Code freeze?

2016-04-18 Thread Sean Owen
FWIW, here's what I do to look at JIRA's answer to this: 1) Go download http://almworks.com/jiraclient/overview.html 2) Set up a query for "target = 2.0.0 and status = Open, In Progress, Reopened" 3) Set up sub-queries for bugs vs non-bugs, and for critical, blocker and other Right now there are

Re: Code freeze?

2016-04-18 Thread Pete Robbins
Is there a list of Jiras to be considered for 2.0? I would really like to get https://issues.apache.org/jira/browse/SPARK-13745 in so that Big Endian platforms are not broken. Cheers, On Wed, 13 Apr 2016 at 08:51 Reynold Xin wrote: > I think the main things are API things