Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Manish Amde
Sean, sorry for missing out on the discussion. Evan, you are correct, we are using the heuristic Sean suggested during the multiclass PR for ordering high-arity categorical variables using the impurity values for each categorical feature. Joseph, thanks for fixing the bug which I think was a regr

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar
Well done guys. MapReduce sort at that time was a good feat and Spark now has raised the bar with the ability to sort a PB. Like some of the folks in the list, a summary of what worked (and didn't) as well as the monitoring practices would be good. Cheers P.S: What are you folks planning next ? O

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Ilya Ganelin
Thank you for the details! Would you mind speaking to what tools proved most useful as far as identifying bottlenecks or bugs? Thanks again. On Oct 13, 2014 5:36 PM, "Matei Zaharia" wrote: > The biggest scaling issue was supporting a large number of reduce tasks > efficiently, which the JIRAs in

Re: new jenkins update + tentative release date

2014-10-13 Thread Nicholas Chammas
*fingers crossed* On Mon, Oct 13, 2014 at 5:54 PM, shane knapp wrote: > ok, i found something that may help: > > https://issues.jenkins-ci.org/browse/JENKINS-20445?focusedCommentId=195638&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-195638 > > i set this to 20 mi

Re: new jenkins update + tentative release date

2014-10-13 Thread shane knapp
ok, i found something that may help: https://issues.jenkins-ci.org/browse/JENKINS-20445?focusedCommentId=195638&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-195638 i set this to 20 minutes... let's see if that helps. On Mon, Oct 13, 2014 at 2:48 PM, Nicholas Cham

Re: new jenkins update + tentative release date

2014-10-13 Thread Nicholas Chammas
Ah, that sucks. Thank you for looking into this. On Mon, Oct 13, 2014 at 5:43 PM, shane knapp wrote: > On Mon, Oct 13, 2014 at 2:28 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Thanks for doing this work Shane. >> >> So is Jenkins in the new datacenter now? Do you know if the

Re: new jenkins update + tentative release date

2014-10-13 Thread shane knapp
On Mon, Oct 13, 2014 at 2:28 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Thanks for doing this work Shane. > > So is Jenkins in the new datacenter now? Do you know if the problems with > checking out patches from GitHub should be resolved now? Here's an > example from the past hour

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Matei Zaharia
The biggest scaling issue was supporting a large number of reduce tasks efficiently, which the JIRAs in that post handle. In particular, our current default shuffle (the hash-based one) has each map task open a separate file output stream for each reduce task, which wastes a lot of memory (since

Re: new jenkins update + tentative release date

2014-10-13 Thread Nicholas Chammas
Thanks for doing this work Shane. So is Jenkins in the new datacenter now? Do you know if the problems with checking out patches from GitHub should be resolved now? Here's an example from the past hour . Nick On

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
One thing made me very confused during debuggin is the error message. The important one WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@xxx:50278] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. is of Log Level WARN. Jianshi

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Turned out it was caused by this issue: https://issues.apache.org/jira/browse/SPARK-3923 Set spark.akka.heartbeat.interval to 100 solved it. Jianshi On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang wrote: > Hmm... it failed again, just lasted a little bit longer. > > Jianshi > > On Mon, Oct 13,

Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Joseph Bradley
I think this is the fix: In this file: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala methods "getFeatureOffset" and "getLeftRightFeatureOffsets" have sanity checks ("require") which are correct for DecisionTree but not for RandomForest. You can remove those. I'v

Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Sean Owen
Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send some details directly under separate cover. On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley wrote: > Hi Sean, > > Sorry I didn't see this thread earlier! (Thanks Ameet for pinging me.) > > Short version: That exception should

Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Joseph Bradley
Hi Sean, Sorry I didn't see this thread earlier! (Thanks Ameet for pinging me.) Short version: That exception should not be thrown, so there is a bug somewhere. The intended logic for handling high-arity categorical features is about the best one can do, as far as I know. Bug finding: For my c

Re: new jenkins update + tentative release date

2014-10-13 Thread shane knapp
AND WE ARE LIIIVE! https://amplab.cs.berkeley.edu/jenkins/ have at it, folks! On Mon, Oct 13, 2014 at 10:15 AM, shane knapp wrote: > quick update: we should be back up and running in the next ~60mins. > > On Mon, Oct 13, 2014 at 7:54 AM, shane knapp wrote: > >> Jenkins is in quiet mode a

Re: Scalastyle improvements / large code reformatting

2014-10-13 Thread Erik Erlandson
- Original Message - > I'm also against these huge reformattings. They slow down development and > backporting for trivial reasons. Let's not do that at this point, the style > of the current code is quite consistent and we have plenty of other things > to worry about. Instead, what you c

Re: Scalastyle improvements / large code reformatting

2014-10-13 Thread Nicholas Chammas
On Mon, Oct 13, 2014 at 11:57 AM, Patrick Wendell wrote: > That would even work for imports as well, > you'd just have a thing where if anyone modified some imports they > would have to fix all the imports in that file. It's at least worth a > try. > OK, that sounds like a fair compromise. I've

Re: new jenkins update + tentative release date

2014-10-13 Thread shane knapp
quick update: we should be back up and running in the next ~60mins. On Mon, Oct 13, 2014 at 7:54 AM, shane knapp wrote: > Jenkins is in quiet mode and the move will be starting after i have my > coffee. :) > > On Sun, Oct 12, 2014 at 11:26 PM, Josh Rosen wrote: > >> Reminder: this Jenkins mig

Re: Scalastyle improvements / large code reformatting

2014-10-13 Thread Marcelo Vanzin
Another option is to add new style rules that trigger too many errors as warnings, and slowly clean them up. This means that reviewers will be burdened with manually enforcing the rules for a while, and we need to remember to turn them to errors once some threshold is reached. (The Hadoop build ha

Re: Scalastyle improvements / large code reformatting

2014-10-13 Thread Patrick Wendell
Hey Nick, I think the best solution is really to find a way to only apply certain rules to code modified after a certain date. I also don't think it would be that hard to implement because git can output per-line information about modification times. So you'd just run the scalastyle rules and then

Re: new jenkins update + tentative release date

2014-10-13 Thread shane knapp
Jenkins is in quiet mode and the move will be starting after i have my coffee. :) On Sun, Oct 12, 2014 at 11:26 PM, Josh Rosen wrote: > Reminder: this Jenkins migration is happening tomorrow morning (Monday). > > On Fri, Oct 10, 2014 at 1:01 PM, shane knapp wrote: > >> reminder: this IS happe

Re:Breaking the previous large-scale sort record with Spark

2014-10-13 Thread 欧阳晋(欧阳晋)
Great News! Still some questions for this Sort benchmark 1 How many Map tasks run in the sort? I think the 2,50,000 is the # of reduce tasks. (if the unsorted file is read from HDFS and use the default 64MB chunk size, the # of map task will be about 1PB/64MB?)2 How long a single Reduce task r

Re: Scalastyle improvements / large code reformatting

2014-10-13 Thread Nicholas Chammas
The arguments against large scale refactorings make sense. Doing them, if at all, during QA cycles or around releases sounds like a promising idea. Coupled with that, would it be useful to implement new rules outside of these potential windows for refactoring in such a way that they report on styl

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Hmm... it failed again, just lasted a little bit longer. Jianshi On Mon, Oct 13, 2014 at 4:15 PM, Jianshi Huang wrote: > https://issues.apache.org/jira/browse/SPARK-3106 > > I'm having the saming errors described in SPARK-3106 (no other types of > errors confirmed), running a bunch sql queries

Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Sean Owen
Hm, no I don't think I'm quite right there. There's an issue but that's not quite it. So I have a categorical feature with 40 value, and 300 bins. The error I see in the end is: java.lang.IllegalArgumentException: requirement failed: DTStatsAggregator.getLeftRightFeatureOffsets is for unordered f

SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
https://issues.apache.org/jira/browse/SPARK-3106 I'm having the saming errors described in SPARK-3106 (no other types of errors confirmed), running a bunch sql queries on spark 1.2.0 built from latest master HEAD. Any updates to this issue? My main task is to join a huge fact table with a dozen

Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Sean Owen
I'm looking at this bit of code in DecisionTreeMetadata ... val maxCategoriesForUnorderedFeature = ((math.log(maxPossibleBins / 2 + 1) / math.log(2.0)) + 1).floor.toInt strategy.categoricalFeaturesInfo.foreach { case (featureIndex, numCategories) => // Decide if some categorical features shoul