Decision forests don't work with non-trivial categorical features
I'm having trouble getting decision forests to work with categorical features. I have a dataset with a categorical feature with 40 values. It seems to be treated as a continuous/numeric value by the implementation. Digging deeper, I see there is some logic in the code that indicates that categorical features over N values do not work unless the number of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the naive brute force condition, wherein the decision tree will test all possible splits of the categorical value. However, this gets unusable quickly as the number of bins should be tens or hundreds at best, and this requirement rules out categorical values over more than 10 or so features as a result. But, of course, it's not unusual to have categorical features with high cardinality. It's almost common. There are some pretty fine heuristics for selecting 'bins' over categorical features when the number of bins is far fewer than the complete, exhaustive set. Before I open a JIRA or continue, does anyone know what I am talking about, am I mistaken? Is this a real limitation and is it worth pursuing these heuristics? I can't figure out how to proceed with decision forests in MLlib otherwise. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
reading/writing parquet decimal type
Hello, I'm interested in reading/writing parquet SchemaRDDs that support the Parquet Decimal converted type. The first thing I did was update the Spark parquet dependency to version 1.5.0, as this version introduced support for decimals in parquet. However, conversion between the catalyst decimal type and the parquet decimal type is complicated by the fact that the catalyst type does not specify a decimal precision and scale but the parquet type requires them. I'm wondering if perhaps we could add an optional precision and scale to the catalyst decimal type? The catalyst decimal type would have unspecified precision and scale by default for backwards compatibility, but users who want to serialize a SchemaRDD with decimal(s) to parquet would have to narrow their decimal type(s) by specifying a precision and scale. Thoughts? Michael - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: reading/writing parquet decimal type
Hi Michael, I've been working on this in my repo: https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests with these features soon, but meanwhile you can try this branch. See https://github.com/mateiz/spark/compare/decimal for the individual commits that went into it. It has exactly the precision stuff you need, plus some optimizations for working on decimals. Matei On Oct 12, 2014, at 1:51 PM, Michael Allman wrote: > Hello, > > I'm interested in reading/writing parquet SchemaRDDs that support the Parquet > Decimal converted type. The first thing I did was update the Spark parquet > dependency to version 1.5.0, as this version introduced support for decimals > in parquet. However, conversion between the catalyst decimal type and the > parquet decimal type is complicated by the fact that the catalyst type does > not specify a decimal precision and scale but the parquet type requires them. > > I'm wondering if perhaps we could add an optional precision and scale to the > catalyst decimal type? The catalyst decimal type would have unspecified > precision and scale by default for backwards compatibility, but users who > want to serialize a SchemaRDD with decimal(s) to parquet would have to narrow > their decimal type(s) by specifying a precision and scale. > > Thoughts? > > Michael > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Decision forests don't work with non-trivial categorical features
I was under the impression that we were using the usual sort by average response value heuristic when storing histogram bins (and searching for optimal splits) in the tree code. Maybe Manish or Joseph can clarify? > On Oct 12, 2014, at 2:50 PM, Sean Owen wrote: > > I'm having trouble getting decision forests to work with categorical > features. I have a dataset with a categorical feature with 40 values. > It seems to be treated as a continuous/numeric value by the > implementation. > > Digging deeper, I see there is some logic in the code that indicates > that categorical features over N values do not work unless the number > of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the > naive brute force condition, wherein the decision tree will test all > possible splits of the categorical value. > > However, this gets unusable quickly as the number of bins should be > tens or hundreds at best, and this requirement rules out categorical > values over more than 10 or so features as a result. But, of course, > it's not unusual to have categorical features with high cardinality. > It's almost common. > > There are some pretty fine heuristics for selecting 'bins' over > categorical features when the number of bins is far fewer than the > complete, exhaustive set. > > Before I open a JIRA or continue, does anyone know what I am talking > about, am I mistaken? Is this a real limitation and is it worth > pursuing these heuristics? I can't figure out how to proceed with > decision forests in MLlib otherwise. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: reading/writing parquet decimal type
Hi Matei, Thanks, I can see you've been hard at work on this! I examined your patch and do have a question. It appears you're limiting the precision of decimals written to parquet to those that will fit in a long, yet you're writing the values as a parquet binary type. Why not write them using the int64 parquet type instead? Cheers, Michael On Oct 12, 2014, at 3:32 PM, Matei Zaharia wrote: > Hi Michael, > > I've been working on this in my repo: > https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests > with these features soon, but meanwhile you can try this branch. See > https://github.com/mateiz/spark/compare/decimal for the individual commits > that went into it. It has exactly the precision stuff you need, plus some > optimizations for working on decimals. > > Matei > > On Oct 12, 2014, at 1:51 PM, Michael Allman wrote: > >> Hello, >> >> I'm interested in reading/writing parquet SchemaRDDs that support the >> Parquet Decimal converted type. The first thing I did was update the Spark >> parquet dependency to version 1.5.0, as this version introduced support for >> decimals in parquet. However, conversion between the catalyst decimal type >> and the parquet decimal type is complicated by the fact that the catalyst >> type does not specify a decimal precision and scale but the parquet type >> requires them. >> >> I'm wondering if perhaps we could add an optional precision and scale to the >> catalyst decimal type? The catalyst decimal type would have unspecified >> precision and scale by default for backwards compatibility, but users who >> want to serialize a SchemaRDD with decimal(s) to parquet would have to >> narrow their decimal type(s) by specifying a precision and scale. >> >> Thoughts? >> >> Michael >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Scalastyle improvements / large code reformatting
There are a number of open pull requests that aim to extend Spark’s automated style checks (see https://issues.apache.org/jira/browse/SPARK-3849 for an umbrella JIRA). These fixes are mostly good, but I have some concerns about merging these patches. Several of these patches make large reformatting changes in nearly every file of Spark, which makes it more difficult to use `git blame` and has the potential to introduce merge conflicts with all open PRs and all backport patches. I feel that most of the value of automated style-checks comes from allowing reviewers/committers to focus on the technical content of pull requests rather than their formatting. My concern is that the convenience added by these new style rules will not outweigh the other overheads that these reformatting patches will create for the committers. If possible, it would be great if we could extend the style checker to enforce the more stringent rules only for new code additions / deletions. If not, I don’t think that we should proceed with the reformatting. Others might disagree, though, so I welcome comments / discussion. - Josh
Re: Scalastyle improvements / large code reformatting
I actually think we should just take the bite and follow through with the reformatting. Many rules are simply not possible to enforce only on deltas (e.g. import ordering). That said, maybe there are better windows to do this, e.g. during the QA period. On Sun, Oct 12, 2014 at 9:37 PM, Josh Rosen wrote: > There are a number of open pull requests that aim to extend Spark’s > automated style checks (see > https://issues.apache.org/jira/browse/SPARK-3849 for an umbrella JIRA). > These fixes are mostly good, but I have some concerns about merging these > patches. Several of these patches make large reformatting changes in > nearly every file of Spark, which makes it more difficult to use `git > blame` and has the potential to introduce merge conflicts with all open PRs > and all backport patches. > > I feel that most of the value of automated style-checks comes from > allowing reviewers/committers to focus on the technical content of pull > requests rather than their formatting. My concern is that the convenience > added by these new style rules will not outweigh the other overheads that > these reformatting patches will create for the committers. > > If possible, it would be great if we could extend the style checker to > enforce the more stringent rules only for new code additions / deletions. > If not, I don’t think that we should proceed with the reformatting. Others > might disagree, though, so I welcome comments / discussion. > > - Josh
Re: Scalastyle improvements / large code reformatting
Another big problem with these patches are that they make it almost impossible to backport changes to older branches cleanly (there becomes like 100% chance of a merge conflict). One proposal is to do this: 1. We only consider new style rules at the end of a release cycle, when there is the smallest chance of wanting to backport stuff. 2. We require that they are submitted in individual patches with a (a) new style rule and (b) the associated changes. Then we can also evaluate on a case-by-case basis how large the change is for each rule. For rules that require sweeping changes across the codebase, personally I'd vote against them. For rules like import ordering that won't cause too much pain on the diff (it's pretty easy to deal with those conflicts) I'd be okay with it. If we went with this, we'd also have to warn people that we might not accept new style rules if they are too costly to enforce. I'm guessing people will still contribute even with those expectations. - Patrick On Sun, Oct 12, 2014 at 9:39 PM, Reynold Xin wrote: > I actually think we should just take the bite and follow through with the > reformatting. Many rules are simply not possible to enforce only on deltas > (e.g. import ordering). > > That said, maybe there are better windows to do this, e.g. during the QA > period. > > On Sun, Oct 12, 2014 at 9:37 PM, Josh Rosen wrote: > >> There are a number of open pull requests that aim to extend Spark's >> automated style checks (see >> https://issues.apache.org/jira/browse/SPARK-3849 for an umbrella JIRA). >> These fixes are mostly good, but I have some concerns about merging these >> patches. Several of these patches make large reformatting changes in >> nearly every file of Spark, which makes it more difficult to use `git >> blame` and has the potential to introduce merge conflicts with all open PRs >> and all backport patches. >> >> I feel that most of the value of automated style-checks comes from >> allowing reviewers/committers to focus on the technical content of pull >> requests rather than their formatting. My concern is that the convenience >> added by these new style rules will not outweigh the other overheads that >> these reformatting patches will create for the committers. >> >> If possible, it would be great if we could extend the style checker to >> enforce the more stringent rules only for new code additions / deletions. >> If not, I don't think that we should proceed with the reformatting. Others >> might disagree, though, so I welcome comments / discussion. >> >> - Josh - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Scalastyle improvements / large code reformatting
I'm also against these huge reformattings. They slow down development and backporting for trivial reasons. Let's not do that at this point, the style of the current code is quite consistent and we have plenty of other things to worry about. Instead, what you can do is as you edit a file when you're working on a feature, fix up style issues you see. Or, as Josh suggested, some way to make this apply only to new files would help. Matei On Oct 12, 2014, at 10:16 PM, Patrick Wendell wrote: > Another big problem with these patches are that they make it almost > impossible to backport changes to older branches cleanly (there > becomes like 100% chance of a merge conflict). > > One proposal is to do this: > 1. We only consider new style rules at the end of a release cycle, > when there is the smallest chance of wanting to backport stuff. > 2. We require that they are submitted in individual patches with a (a) > new style rule and (b) the associated changes. Then we can also > evaluate on a case-by-case basis how large the change is for each > rule. For rules that require sweeping changes across the codebase, > personally I'd vote against them. For rules like import ordering that > won't cause too much pain on the diff (it's pretty easy to deal with > those conflicts) I'd be okay with it. > > If we went with this, we'd also have to warn people that we might not > accept new style rules if they are too costly to enforce. I'm guessing > people will still contribute even with those expectations. > > - Patrick > > On Sun, Oct 12, 2014 at 9:39 PM, Reynold Xin wrote: >> I actually think we should just take the bite and follow through with the >> reformatting. Many rules are simply not possible to enforce only on deltas >> (e.g. import ordering). >> >> That said, maybe there are better windows to do this, e.g. during the QA >> period. >> >> On Sun, Oct 12, 2014 at 9:37 PM, Josh Rosen wrote: >> >>> There are a number of open pull requests that aim to extend Spark's >>> automated style checks (see >>> https://issues.apache.org/jira/browse/SPARK-3849 for an umbrella JIRA). >>> These fixes are mostly good, but I have some concerns about merging these >>> patches. Several of these patches make large reformatting changes in >>> nearly every file of Spark, which makes it more difficult to use `git >>> blame` and has the potential to introduce merge conflicts with all open PRs >>> and all backport patches. >>> >>> I feel that most of the value of automated style-checks comes from >>> allowing reviewers/committers to focus on the technical content of pull >>> requests rather than their formatting. My concern is that the convenience >>> added by these new style rules will not outweigh the other overheads that >>> these reformatting patches will create for the committers. >>> >>> If possible, it would be great if we could extend the style checker to >>> enforce the more stringent rules only for new code additions / deletions. >>> If not, I don't think that we should proceed with the reformatting. Others >>> might disagree, though, so I welcome comments / discussion. >>> >>> - Josh > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: reading/writing parquet decimal type
The fixed-length binary type can hold fewer bytes than an int64, though many encodings of int64 can probably do the right thing. We can look into supporting multiple ways to do this -- the spec does say that you should at least be able to read int32s and int64s. Matei On Oct 12, 2014, at 8:20 PM, Michael Allman wrote: > Hi Matei, > > Thanks, I can see you've been hard at work on this! I examined your patch and > do have a question. It appears you're limiting the precision of decimals > written to parquet to those that will fit in a long, yet you're writing the > values as a parquet binary type. Why not write them using the int64 parquet > type instead? > > Cheers, > > Michael > > On Oct 12, 2014, at 3:32 PM, Matei Zaharia wrote: > >> Hi Michael, >> >> I've been working on this in my repo: >> https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests >> with these features soon, but meanwhile you can try this branch. See >> https://github.com/mateiz/spark/compare/decimal for the individual commits >> that went into it. It has exactly the precision stuff you need, plus some >> optimizations for working on decimals. >> >> Matei >> >> On Oct 12, 2014, at 1:51 PM, Michael Allman wrote: >> >>> Hello, >>> >>> I'm interested in reading/writing parquet SchemaRDDs that support the >>> Parquet Decimal converted type. The first thing I did was update the Spark >>> parquet dependency to version 1.5.0, as this version introduced support for >>> decimals in parquet. However, conversion between the catalyst decimal type >>> and the parquet decimal type is complicated by the fact that the catalyst >>> type does not specify a decimal precision and scale but the parquet type >>> requires them. >>> >>> I'm wondering if perhaps we could add an optional precision and scale to >>> the catalyst decimal type? The catalyst decimal type would have unspecified >>> precision and scale by default for backwards compatibility, but users who >>> want to serialize a SchemaRDD with decimal(s) to parquet would have to >>> narrow their decimal type(s) by specifying a precision and scale. >>> >>> Thoughts? >>> >>> Michael >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >> > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: new jenkins update + tentative release date
Reminder: this Jenkins migration is happening tomorrow morning (Monday). On Fri, Oct 10, 2014 at 1:01 PM, shane knapp wrote: > reminder: this IS happening, first thing monday morning PDT. :) > > On Wed, Oct 8, 2014 at 3:01 PM, shane knapp wrote: > > > greetings! > > > > i've got some updates regarding our new jenkins infrastructure, as well > as > > the initial date and plan for rolling things out: > > > > *** current testing/build break whack-a-mole: > > a lot of out of date artifacts are cached in the current jenkins, which > > has caused a few builds during my testing to break due to dependency > > resolution failure[1][2]. > > > > bumping these versions can cause your builds to fail, due to public api > > changes and the like. consider yourself warned that some projects might > > require some debugging... :) > > > > tomorrow, i will be at databricks working w/@joshrosen to make sure that > > the spark builds have any bugs hammered out. > > > > *** deployment plan: > > unless something completely horrible happens, THE NEW JENKINS WILL GO > LIVE > > ON MONDAY (october 13th). > > > > all jenkins infrastructure will be DOWN for the entirety of the day > > (starting at ~8am). this means no builds, period. i'm hoping that the > > downtime will be much shorter than this, but we'll have to see how > > everything goes. > > > > all test/build history WILL BE PRESERVED. i will be rsyncing the jenkins > > jobs/ directory over, complete w/history as part of the deployment. > > > > once i'm feeling good about the state of things, i'll point the original > > url to the new instances and send out an all clear. > > > > if you are a student at UC berkeley, you can log in to jenkins using your > > LDAP login, and (by default) view but not change plans. if you do not > have > > a UC berkeley LDAP login, you can still view plans anonymously. > > > > IF YOU ARE A PLAN ADMIN, THEN PLEASE REACH OUT, ASAP, PRIVATELY AND I > WILL > > SET UP ADMIN ACCESS TO YOUR BUILDS. > > > > *** post deployment plan: > > fix all of the things that break! > > > > i will be keeping a VERY close eye on the builds, checking for breaks, > and > > helping out where i can. if the situation is dire, i can always roll > back > > to the old jenkins infra... but i hope we never get to that point! :) > > > > i'm hoping that things will go smoothly, but please be patient as i'm > > certain we'll hit a few bumps in the road. > > > > please let me know if you guys have any comments/questions/concerns... > :) > > > > shane > > > > 1 - https://github.com/bigdatagenomics/bdg-services/pull/18 > > 2 - https://github.com/bigdatagenomics/avocado/pull/111 > > >