Re: [jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices

2008-10-20 Thread Ted Dunning
I would love to hear from Karl.  Whatever we choose to do has implications
on what kinds of attributes are reasonable to serialize.  He was the driving
force behind having something fancier than simple strings as attribute
values.

On Mon, Oct 20, 2008 at 10:03 PM, Jeff Eastman
<[EMAIL PROTECTED]>wrote:

> Ted Dunning wrote:
>
>> I see what you mean.
>>
>> To repeat in other words, the problems that need to be solved are:
>>
>> a) there are many uses already so adding attributes should be transparent
>> to
>> those who don't use them
>>
>> b) the encoding should not be ad hoc because this would be our second ad
>> hoc
>> encoding and only one should ever be allowed before using a standard
>>
>>
> +1
>
>> So here is a (kind of) concrete proposal:
>>
>> a) use JSON or Thrift for concrete syntax
>>
>>
> Any preferences here? This might also impact other Mahout packages in the
> future, so everybody please weigh in. In general, it seems that having a
> common, public encoding for matrix and vector data would help users mix and
> match the Mahout services. What are the requirements of these other
> services? From inspection, it looks like only the clustering packages use
> them currently.
>
> Jeff
>



-- 
ted


Re: [jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices

2008-10-20 Thread Jeff Eastman

Ted Dunning wrote:

I see what you mean.

To repeat in other words, the problems that need to be solved are:

a) there are many uses already so adding attributes should be transparent to
those who don't use them

b) the encoding should not be ad hoc because this would be our second ad hoc
encoding and only one should ever be allowed before using a standard
  

+1

So here is a (kind of) concrete proposal:

a) use JSON or Thrift for concrete syntax
  
Any preferences here? This might also impact other Mahout packages in 
the future, so everybody please weigh in. In general, it seems that 
having a common, public encoding for matrix and vector data would help 
users mix and match the Mahout services. What are the requirements of 
these other services? From inspection, it looks like only the clustering 
packages use them currently.


Jeff


PGP.sig
Description: PGP signature


Re: More proposed changes across code

2008-10-20 Thread Sean Owen
Ugh, yeah, thinking about that a couple more seconds would have been
useful. That is completely right.

On Mon, Oct 20, 2008 at 3:33 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>> Double.longBitsToDouble(Random.nextLong())
>
>
> NO!!


Re: More proposed changes across code

2008-10-20 Thread Ted Dunning
On Mon, Oct 20, 2008 at 7:00 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

> ..
> If you're looking for a good value to initialize some kind of "max"
> variable to, try {Float,Double}.NEGATIVE_INFINITY. It is smaller than
> every actual float/double.


Try using the first element that you are comparing to!

Anything else is pretty dangerous.


> To generate a truly random float or double, try
>
> Double.longBitsToDouble(Random.nextLong())


NO!!

This will not be the uniformly distributed number that you are probably
wanting.  It will have a very strange distribution that includes
denormalized numbers, NAN's, and infinities of various kinds.

Try gen.nextDouble() where gen is a math.Random or better random number
generator.

Even (double) Random.nextLong() / Long.MAX_VALUE would be better.


-- 
ted


Re: More proposed changes across code

2008-10-20 Thread Sean Owen
PS Looks like Hadoop does have a DoubleWritable now -- I'll switch to
that where approrpriate as part of this pretty-big change list I am
preparing.

Also, while scouring the code I spotted what I believe is a common
misconception that has led to a bug or two: {Double,Float}.MIN_VALUE
is *not* the most negative value that a double/float can take on -- it
is the smallest *positive* value it can take on.

If you're looking for a good value to initialize some kind of "max"
variable to, try {Float,Double}.NEGATIVE_INFINITY. It is smaller than
every actual float/double.

If you really need the most negative value, I think you want the
negative of MAX_VALUE, but not 100% sure on that. Actually be careful
there because in some cases I saw code trying to compute:

(Float.MAX_VALUE - Float.MIN_VALUE)

If you fix this to

(Float.MAX_VALUE - (-Float.MAX_VALUE))

well you see the overflow problem.

To generate a truly random float or double, try

Double.longBitsToDouble(Random.nextLong())

though once again I would have to think about whether this actually
generates NaN or INFINITY in some cases!


Re: More proposed changes across code

2008-10-20 Thread Sean Owen
OK proceeding with these changes but will hold back submitting a bit
for more comments.

On Mon, Oct 20, 2008 at 12:06 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> No cases that I know of where floats actually help in our code and there are
> bound to be places where they hurt.

I agree, I think this is best.

If the reason was just that Hadoop, shockingly, doesn't have a
DoubleWritable, we can write one (and submit to Hadoop if desired) --
but we can also use FloatWritable too! you lose precision, yes, but
just at the end. It doesn't mean floats should be used everywhere.


> Furthermore, I try to follow the Spring philosophy that if the intermediate
> caller doesn't have much of a chance of fixing the problem, then it should
> be an unchecked exception.
>
> Thus, index out of bounds, numerical instability, bad argument and bad
> encoding cases should be runtime exceptions.  Any code where you find people
> repacking exceptions into runtime exceptions to meet external API
> requirements or where exceptions have to be declared but nobody ever catches
> them except the framework are prime candidates for this treatment.

This is a tangent -- I am not going to modify any exceptions -- but I
suppose I follow a somewhat different rule.

I have always understood RuntimeException to be for situations that
should not happen when the an API is called correctly, in a debugged
program. So, NullPointerException, IllegalArgumentException, etc. are
not checked. It is unreasonable to force the caller to tell you what
happens if it's calling an API wrong.

However anything else, any other situation that can come up in a
correctly-programmed bit of code, should be a checked exception. So I
think stuff like an IOException of SQLException are properly checked
exceptions. You can totally correctly call an HTTP API and still fail
to get a web page if the internet is not available. You *should* be
asked to say what happens if your method invocation can't complete
normally -- or else why have exceptions in the first place? So in that
sense I differ with the Spring/Hibernate theory.

... and that said I do not see any RuntimeExceptions which I think
should be checked exceptions! just a tangent.


Re: In case you haven't noticed

2008-10-20 Thread Robin Anil
Hi, when I was working on Bayes Classifier, I did feel that float will
overflow/loose precision in some extraneous case. But the reason for using
float was due to the limitation of hadoop. There was no DoubleWritable
(equivalent to FloatWritable) which could be used in M/R mappers and
reduces. I would prefer sed s/float/double/g .

Robin

On Mon, Oct 20, 2008 at 3:36 PM, Sean Owen <[EMAIL PROTECTED]> wrote:

> On Sun, Oct 19, 2008 at 11:57 PM, Ted Dunning <[EMAIL PROTECTED]>
> wrote:
> >> I see some more complex cases (particularly the Float fields in many
> >> classes) that probably could be improved too but am being
> >> conservative.
> >
> > Can you point to an example?  You have made me very curious.
>
> Try BayesThetaNormalizerMapper -- I suspect, though am not totally
> sure, that the Float fields can be primitives. I don't see the need
> for it to be an object.
>
> >> Incidentally why aren't we using doubles? In cases where storage isn't
> >> a concern.
> >
> >
> > I have no idea.  There may be some hold-over in traditions from Lucene,
> but
> > there are not many places any more where floats are truly better.  Most
> > importantly, there are many cases where the extremely limited precision
> of
> > floats causes complete loss of all data.
>
> I agree. The above is another example where a double would be more
> appropriate I think.
>


Re: In case you haven't noticed

2008-10-20 Thread Sean Owen
On Sun, Oct 19, 2008 at 11:57 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>> I see some more complex cases (particularly the Float fields in many
>> classes) that probably could be improved too but am being
>> conservative.
>
> Can you point to an example?  You have made me very curious.

Try BayesThetaNormalizerMapper -- I suspect, though am not totally
sure, that the Float fields can be primitives. I don't see the need
for it to be an object.

>> Incidentally why aren't we using doubles? In cases where storage isn't
>> a concern.
>
>
> I have no idea.  There may be some hold-over in traditions from Lucene, but
> there are not many places any more where floats are truly better.  Most
> importantly, there are many cases where the extremely limited precision of
> floats causes complete loss of all data.

I agree. The above is another example where a double would be more
appropriate I think.


Re: More proposed changes across code

2008-10-20 Thread deneche abdelhakim



--- En date de : Dim 19.10.08, Grant Ingersoll <[EMAIL PROTECTED]> a écrit :

> De: Grant Ingersoll <[EMAIL PROTECTED]>
> Objet: Re: More proposed changes across code
> À: mahout-dev@lucene.apache.org
> Date: Dimanche 19 Octobre 2008, 18h30
> On Oct 19, 2008, at 11:16 AM, Sean Owen wrote:
> 
> > On Sun, Oct 19, 2008 at 4:07 PM, Grant Ingersoll  
> > <[EMAIL PROTECTED]> wrote:
> >> Doesn't the javadoc tool used @inherit to fill
> in the inherited  
> >> docs when
> >> viewing?
> >
> > Yes... I suppose I find that redundant. The subclass
> method gets
> > documented exactly as the superclass does. It looks
> like the subclass
> > had been explicitly documented, when it hadn't
> been. I think its
> > intent is to copy in documentation and add to it; I am
> thinking only
> > of cases where the javadoc only has a single element,
> [EMAIL PROTECTED]
> >
> >
> >>> 3. UpdatableFloat/Long -- just use Float[1] /
> Long[1]? these classes
> >>> don't seem to be used.
> >>
> >> Hmmm, they were used, but sure that works too.
> >
> > I can't find any usages of these classes, where
> are they?
> 
> Right, they aren't used any longer.  Feel free to
> remove.
> 
> >
> >
> >
> >>> 5. BruteForceTravellingSalesman says
> "copyright Daniel Dwyer" -- can
> >>> this be replaced by the standard copyright
> header?
> >>
> >> No, this is in fact his code, licensed under the
> ASL.  I believe  
> >> the current
> >> way we are handling it is correct.  The original
> code is his, and  
> >> the mods
> >> are ours.
> >
> > Roger that, will leave it. But two notes then...
> > - what about all the other code that game from
> watchmaker? all the
> > classes in the package say they came from watchmaker
> > - I was told that for my stuff, yeah, I still own the
> code/copyright
> > but am licensing a copy to this project, and so it all
> just gets
> > licensed within Mahout according to the boilerplate
> which says
> > "Licensed to the ASF..."
> >
> > I'm not a lawyer and don't want to pick nits
> but I do want to take
> > extra care to get licensing right.
> 
> Right.  I believe the difference is you donated your code
> to the ASF,  
> Daniel has merely published his code under the ASL, but has
> not  
> donated to the ASF.  It's a subtle distinction, I
> suppose.Any of  
> the classes that came from watchmaker should say that,
> although I know  
> many were developed by Deneche for the Watchmaker API.  We
> can go  
> review them again.

In the case of the travellingSalesman example, I modified the original code to 
use Mahout when needed. My own modifications are a couple of lines in two or 
three classes, I included a readme.txt that describes the modified code and 
links to the original one. I replaced all the copyright headers with the 
standard one (I forgot BruteForceTravellingSalesman.java), and added a link to 
the original code in the class comments.
I've been reading the Apache License 2.0, I'm not a lawyer and if I'm not 
mistaken, the "travellingSalesman" code included with Mahout is a "Derivative 
Work" of the original code, so we need to :
. Point "in" the modified files that they have been changed, this files are: 
StrategyPanel.java, TravellingSalesman.java and 
EvolutionaryTravellingSalesman.java.
. because the Watchmaker library contains a NOTICE.TXT file, Mahout must 
include a readable copy of the attribution notices contained within 
Watchmaker's NOTICE file.

__
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail