Decision forests don't work with non-trivial categorical features

2014-10-12 Thread Sean Owen
I'm having trouble getting decision forests to work with categorical
features. I have a dataset with a categorical feature with 40 values.
It seems to be treated as a continuous/numeric value by the
implementation.

Digging deeper, I see there is some logic in the code that indicates
that categorical features over N values do not work unless the number
of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the
naive brute force condition, wherein the decision tree will test all
possible splits of the categorical value.

However, this gets unusable quickly as the number of bins should be
tens or hundreds at best, and this requirement rules out categorical
values over more than 10 or so features as a result. But, of course,
it's not unusual to have categorical features with high cardinality.
It's almost common.

There are some pretty fine heuristics for selecting 'bins' over
categorical features when the number of bins is far fewer than the
complete, exhaustive set.

Before I open a JIRA or continue, does anyone know what I am talking
about, am I mistaken? Is this a real limitation and is it worth
pursuing these heuristics? I can't figure out how to proceed with
decision forests in MLlib otherwise.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



reading/writing parquet decimal type

2014-10-12 Thread Michael Allman
Hello,

I'm interested in reading/writing parquet SchemaRDDs that support the Parquet 
Decimal converted type. The first thing I did was update the Spark parquet 
dependency to version 1.5.0, as this version introduced support for decimals in 
parquet. However, conversion between the catalyst decimal type and the parquet 
decimal type is complicated by the fact that the catalyst type does not specify 
a decimal precision and scale but the parquet type requires them.

I'm wondering if perhaps we could add an optional precision and scale to the 
catalyst decimal type? The catalyst decimal type would have unspecified 
precision and scale by default for backwards compatibility, but users who want 
to serialize a SchemaRDD with decimal(s) to parquet would have to narrow their 
decimal type(s) by specifying a precision and scale.

Thoughts?

Michael
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: reading/writing parquet decimal type

2014-10-12 Thread Matei Zaharia
Hi Michael,

I've been working on this in my repo: 
https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests with 
these features soon, but meanwhile you can try this branch. See 
https://github.com/mateiz/spark/compare/decimal for the individual commits that 
went into it. It has exactly the precision stuff you need, plus some 
optimizations for working on decimals.

Matei

On Oct 12, 2014, at 1:51 PM, Michael Allman  wrote:

> Hello,
> 
> I'm interested in reading/writing parquet SchemaRDDs that support the Parquet 
> Decimal converted type. The first thing I did was update the Spark parquet 
> dependency to version 1.5.0, as this version introduced support for decimals 
> in parquet. However, conversion between the catalyst decimal type and the 
> parquet decimal type is complicated by the fact that the catalyst type does 
> not specify a decimal precision and scale but the parquet type requires them.
> 
> I'm wondering if perhaps we could add an optional precision and scale to the 
> catalyst decimal type? The catalyst decimal type would have unspecified 
> precision and scale by default for backwards compatibility, but users who 
> want to serialize a SchemaRDD with decimal(s) to parquet would have to narrow 
> their decimal type(s) by specifying a precision and scale.
> 
> Thoughts?
> 
> Michael
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Decision forests don't work with non-trivial categorical features

2014-10-12 Thread Evan Sparks
I was under the impression that we were using the usual sort by average 
response value heuristic when storing histogram bins (and searching for optimal 
splits) in the tree code. 

Maybe Manish or Joseph can clarify?

> On Oct 12, 2014, at 2:50 PM, Sean Owen  wrote:
> 
> I'm having trouble getting decision forests to work with categorical
> features. I have a dataset with a categorical feature with 40 values.
> It seems to be treated as a continuous/numeric value by the
> implementation.
> 
> Digging deeper, I see there is some logic in the code that indicates
> that categorical features over N values do not work unless the number
> of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the
> naive brute force condition, wherein the decision tree will test all
> possible splits of the categorical value.
> 
> However, this gets unusable quickly as the number of bins should be
> tens or hundreds at best, and this requirement rules out categorical
> values over more than 10 or so features as a result. But, of course,
> it's not unusual to have categorical features with high cardinality.
> It's almost common.
> 
> There are some pretty fine heuristics for selecting 'bins' over
> categorical features when the number of bins is far fewer than the
> complete, exhaustive set.
> 
> Before I open a JIRA or continue, does anyone know what I am talking
> about, am I mistaken? Is this a real limitation and is it worth
> pursuing these heuristics? I can't figure out how to proceed with
> decision forests in MLlib otherwise.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: reading/writing parquet decimal type

2014-10-12 Thread Michael Allman
Hi Matei,

Thanks, I can see you've been hard at work on this! I examined your patch and 
do have a question. It appears you're limiting the precision of decimals 
written to parquet to those that will fit in a long, yet you're writing the 
values as a parquet binary type. Why not write them using the int64 parquet 
type instead?

Cheers,

Michael

On Oct 12, 2014, at 3:32 PM, Matei Zaharia  wrote:

> Hi Michael,
> 
> I've been working on this in my repo: 
> https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests 
> with these features soon, but meanwhile you can try this branch. See 
> https://github.com/mateiz/spark/compare/decimal for the individual commits 
> that went into it. It has exactly the precision stuff you need, plus some 
> optimizations for working on decimals.
> 
> Matei
> 
> On Oct 12, 2014, at 1:51 PM, Michael Allman  wrote:
> 
>> Hello,
>> 
>> I'm interested in reading/writing parquet SchemaRDDs that support the 
>> Parquet Decimal converted type. The first thing I did was update the Spark 
>> parquet dependency to version 1.5.0, as this version introduced support for 
>> decimals in parquet. However, conversion between the catalyst decimal type 
>> and the parquet decimal type is complicated by the fact that the catalyst 
>> type does not specify a decimal precision and scale but the parquet type 
>> requires them.
>> 
>> I'm wondering if perhaps we could add an optional precision and scale to the 
>> catalyst decimal type? The catalyst decimal type would have unspecified 
>> precision and scale by default for backwards compatibility, but users who 
>> want to serialize a SchemaRDD with decimal(s) to parquet would have to 
>> narrow their decimal type(s) by specifying a precision and scale.
>> 
>> Thoughts?
>> 
>> Michael
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Scalastyle improvements / large code reformatting

2014-10-12 Thread Josh Rosen
There are a number of open pull requests that aim to extend Spark’s automated 
style checks (see https://issues.apache.org/jira/browse/SPARK-3849 for an 
umbrella JIRA).  These fixes are mostly good, but I have some concerns about 
merging these patches.  Several of these patches make large reformatting 
changes in nearly every file of Spark, which makes it more difficult to use 
`git blame` and has the potential to introduce merge conflicts with all open 
PRs and all backport patches.

I feel that most of the value of automated style-checks comes from allowing 
reviewers/committers to focus on the technical content of pull requests rather 
than their formatting.  My concern is that the convenience added by these new 
style rules will not outweigh the other overheads that these reformatting 
patches will create for the committers.

If possible, it would be great if we could extend the style checker to enforce 
the more stringent rules only for new code additions / deletions.  If not, I 
don’t think that we should proceed with the reformatting.  Others might 
disagree, though, so I welcome comments / discussion.

- Josh

Re: Scalastyle improvements / large code reformatting

2014-10-12 Thread Reynold Xin
I actually think we should just take the bite and follow through with the
reformatting. Many rules are simply not possible to enforce only on deltas
(e.g. import ordering).

That said, maybe there are better windows to do this, e.g. during the QA
period.

On Sun, Oct 12, 2014 at 9:37 PM, Josh Rosen  wrote:

> There are a number of open pull requests that aim to extend Spark’s
> automated style checks (see
> https://issues.apache.org/jira/browse/SPARK-3849 for an umbrella JIRA).
> These fixes are mostly good, but I have some concerns about merging these
> patches.  Several of these patches make large reformatting changes in
> nearly every file of Spark, which makes it more difficult to use `git
> blame` and has the potential to introduce merge conflicts with all open PRs
> and all backport patches.
>
> I feel that most of the value of automated style-checks comes from
> allowing reviewers/committers to focus on the technical content of pull
> requests rather than their formatting.  My concern is that the convenience
> added by these new style rules will not outweigh the other overheads that
> these reformatting patches will create for the committers.
>
> If possible, it would be great if we could extend the style checker to
> enforce the more stringent rules only for new code additions / deletions.
> If not, I don’t think that we should proceed with the reformatting.  Others
> might disagree, though, so I welcome comments / discussion.
>
> - Josh


Re: Scalastyle improvements / large code reformatting

2014-10-12 Thread Patrick Wendell
Another big problem with these patches are that they make it almost
impossible to backport changes to older branches cleanly (there
becomes like 100% chance of a merge conflict).

One proposal is to do this:
1. We only consider new style rules at the end of a release cycle,
when there is the smallest chance of wanting to backport stuff.
2. We require that they are submitted in individual patches with a (a)
new style rule and (b) the associated changes. Then we can also
evaluate on a case-by-case basis how large the change is for each
rule. For rules that require sweeping changes across the codebase,
personally I'd vote against them. For rules like import ordering that
won't cause too much pain on the diff (it's pretty easy to deal with
those conflicts) I'd be okay with it.

If we went with this, we'd also have to warn people that we might not
accept new style rules if they are too costly to enforce. I'm guessing
people will still contribute even with those expectations.

- Patrick

On Sun, Oct 12, 2014 at 9:39 PM, Reynold Xin  wrote:
> I actually think we should just take the bite and follow through with the
> reformatting. Many rules are simply not possible to enforce only on deltas
> (e.g. import ordering).
>
> That said, maybe there are better windows to do this, e.g. during the QA
> period.
>
> On Sun, Oct 12, 2014 at 9:37 PM, Josh Rosen  wrote:
>
>> There are a number of open pull requests that aim to extend Spark's
>> automated style checks (see
>> https://issues.apache.org/jira/browse/SPARK-3849 for an umbrella JIRA).
>> These fixes are mostly good, but I have some concerns about merging these
>> patches.  Several of these patches make large reformatting changes in
>> nearly every file of Spark, which makes it more difficult to use `git
>> blame` and has the potential to introduce merge conflicts with all open PRs
>> and all backport patches.
>>
>> I feel that most of the value of automated style-checks comes from
>> allowing reviewers/committers to focus on the technical content of pull
>> requests rather than their formatting.  My concern is that the convenience
>> added by these new style rules will not outweigh the other overheads that
>> these reformatting patches will create for the committers.
>>
>> If possible, it would be great if we could extend the style checker to
>> enforce the more stringent rules only for new code additions / deletions.
>> If not, I don't think that we should proceed with the reformatting.  Others
>> might disagree, though, so I welcome comments / discussion.
>>
>> - Josh

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Scalastyle improvements / large code reformatting

2014-10-12 Thread Matei Zaharia
I'm also against these huge reformattings. They slow down development and 
backporting for trivial reasons. Let's not do that at this point, the style of 
the current code is quite consistent and we have plenty of other things to 
worry about. Instead, what you can do is as you edit a file when you're working 
on a feature, fix up style issues you see. Or, as Josh suggested, some way to 
make this apply only to new files would help.

Matei

On Oct 12, 2014, at 10:16 PM, Patrick Wendell  wrote:

> Another big problem with these patches are that they make it almost
> impossible to backport changes to older branches cleanly (there
> becomes like 100% chance of a merge conflict).
> 
> One proposal is to do this:
> 1. We only consider new style rules at the end of a release cycle,
> when there is the smallest chance of wanting to backport stuff.
> 2. We require that they are submitted in individual patches with a (a)
> new style rule and (b) the associated changes. Then we can also
> evaluate on a case-by-case basis how large the change is for each
> rule. For rules that require sweeping changes across the codebase,
> personally I'd vote against them. For rules like import ordering that
> won't cause too much pain on the diff (it's pretty easy to deal with
> those conflicts) I'd be okay with it.
> 
> If we went with this, we'd also have to warn people that we might not
> accept new style rules if they are too costly to enforce. I'm guessing
> people will still contribute even with those expectations.
> 
> - Patrick
> 
> On Sun, Oct 12, 2014 at 9:39 PM, Reynold Xin  wrote:
>> I actually think we should just take the bite and follow through with the
>> reformatting. Many rules are simply not possible to enforce only on deltas
>> (e.g. import ordering).
>> 
>> That said, maybe there are better windows to do this, e.g. during the QA
>> period.
>> 
>> On Sun, Oct 12, 2014 at 9:37 PM, Josh Rosen  wrote:
>> 
>>> There are a number of open pull requests that aim to extend Spark's
>>> automated style checks (see
>>> https://issues.apache.org/jira/browse/SPARK-3849 for an umbrella JIRA).
>>> These fixes are mostly good, but I have some concerns about merging these
>>> patches.  Several of these patches make large reformatting changes in
>>> nearly every file of Spark, which makes it more difficult to use `git
>>> blame` and has the potential to introduce merge conflicts with all open PRs
>>> and all backport patches.
>>> 
>>> I feel that most of the value of automated style-checks comes from
>>> allowing reviewers/committers to focus on the technical content of pull
>>> requests rather than their formatting.  My concern is that the convenience
>>> added by these new style rules will not outweigh the other overheads that
>>> these reformatting patches will create for the committers.
>>> 
>>> If possible, it would be great if we could extend the style checker to
>>> enforce the more stringent rules only for new code additions / deletions.
>>> If not, I don't think that we should proceed with the reformatting.  Others
>>> might disagree, though, so I welcome comments / discussion.
>>> 
>>> - Josh
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: reading/writing parquet decimal type

2014-10-12 Thread Matei Zaharia
The fixed-length binary type can hold fewer bytes than an int64, though many 
encodings of int64 can probably do the right thing. We can look into supporting 
multiple ways to do this -- the spec does say that you should at least be able 
to read int32s and int64s.

Matei

On Oct 12, 2014, at 8:20 PM, Michael Allman  wrote:

> Hi Matei,
> 
> Thanks, I can see you've been hard at work on this! I examined your patch and 
> do have a question. It appears you're limiting the precision of decimals 
> written to parquet to those that will fit in a long, yet you're writing the 
> values as a parquet binary type. Why not write them using the int64 parquet 
> type instead?
> 
> Cheers,
> 
> Michael
> 
> On Oct 12, 2014, at 3:32 PM, Matei Zaharia  wrote:
> 
>> Hi Michael,
>> 
>> I've been working on this in my repo: 
>> https://github.com/mateiz/spark/tree/decimal. I'll make some pull requests 
>> with these features soon, but meanwhile you can try this branch. See 
>> https://github.com/mateiz/spark/compare/decimal for the individual commits 
>> that went into it. It has exactly the precision stuff you need, plus some 
>> optimizations for working on decimals.
>> 
>> Matei
>> 
>> On Oct 12, 2014, at 1:51 PM, Michael Allman  wrote:
>> 
>>> Hello,
>>> 
>>> I'm interested in reading/writing parquet SchemaRDDs that support the 
>>> Parquet Decimal converted type. The first thing I did was update the Spark 
>>> parquet dependency to version 1.5.0, as this version introduced support for 
>>> decimals in parquet. However, conversion between the catalyst decimal type 
>>> and the parquet decimal type is complicated by the fact that the catalyst 
>>> type does not specify a decimal precision and scale but the parquet type 
>>> requires them.
>>> 
>>> I'm wondering if perhaps we could add an optional precision and scale to 
>>> the catalyst decimal type? The catalyst decimal type would have unspecified 
>>> precision and scale by default for backwards compatibility, but users who 
>>> want to serialize a SchemaRDD with decimal(s) to parquet would have to 
>>> narrow their decimal type(s) by specifying a precision and scale.
>>> 
>>> Thoughts?
>>> 
>>> Michael
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>> 
>> 
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: new jenkins update + tentative release date

2014-10-12 Thread Josh Rosen
Reminder: this Jenkins migration is happening tomorrow morning (Monday).

On Fri, Oct 10, 2014 at 1:01 PM, shane knapp  wrote:

> reminder:  this IS happening, first thing monday morning PDT.  :)
>
> On Wed, Oct 8, 2014 at 3:01 PM, shane knapp  wrote:
>
> > greetings!
> >
> > i've got some updates regarding our new jenkins infrastructure, as well
> as
> > the initial date and plan for rolling things out:
> >
> > *** current testing/build break whack-a-mole:
> > a lot of out of date artifacts are cached in the current jenkins, which
> > has caused a few builds during my testing to break due to dependency
> > resolution failure[1][2].
> >
> > bumping these versions can cause your builds to fail, due to public api
> > changes and the like.  consider yourself warned that some projects might
> > require some debugging...  :)
> >
> > tomorrow, i will be at databricks working w/@joshrosen to make sure that
> > the spark builds have any bugs hammered out.
> >
> > ***  deployment plan:
> > unless something completely horrible happens, THE NEW JENKINS WILL GO
> LIVE
> > ON MONDAY (october 13th).
> >
> > all jenkins infrastructure will be DOWN for the entirety of the day
> > (starting at ~8am).  this means no builds, period.  i'm hoping that the
> > downtime will be much shorter than this, but we'll have to see how
> > everything goes.
> >
> > all test/build history WILL BE PRESERVED.  i will be rsyncing the jenkins
> > jobs/ directory over, complete w/history as part of the deployment.
> >
> > once i'm feeling good about the state of things, i'll point the original
> > url to the new instances and send out an all clear.
> >
> > if you are a student at UC berkeley, you can log in to jenkins using your
> > LDAP login, and (by default) view but not change plans.  if you do not
> have
> > a UC berkeley LDAP login, you can still view plans anonymously.
> >
> > IF YOU ARE A PLAN ADMIN, THEN PLEASE REACH OUT, ASAP, PRIVATELY AND I
> WILL
> > SET UP ADMIN ACCESS TO YOUR BUILDS.
> >
> > ***  post deployment plan:
> > fix all of the things that break!
> >
> > i will be keeping a VERY close eye on the builds, checking for breaks,
> and
> > helping out where i can.  if the situation is dire, i can always roll
> back
> > to the old jenkins infra...  but i hope we never get to that point!  :)
> >
> > i'm hoping that things will go smoothly, but please be patient as i'm
> > certain we'll hit a few bumps in the road.
> >
> > please let me know if you guys have any comments/questions/concerns...
> :)
> >
> > shane
> >
> > 1 - https://github.com/bigdatagenomics/bdg-services/pull/18
> > 2 - https://github.com/bigdatagenomics/avocado/pull/111
> >
>