Re: [jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices
I would love to hear from Karl. Whatever we choose to do has implications on what kinds of attributes are reasonable to serialize. He was the driving force behind having something fancier than simple strings as attribute values. On Mon, Oct 20, 2008 at 10:03 PM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > Ted Dunning wrote: > >> I see what you mean. >> >> To repeat in other words, the problems that need to be solved are: >> >> a) there are many uses already so adding attributes should be transparent >> to >> those who don't use them >> >> b) the encoding should not be ad hoc because this would be our second ad >> hoc >> encoding and only one should ever be allowed before using a standard >> >> > +1 > >> So here is a (kind of) concrete proposal: >> >> a) use JSON or Thrift for concrete syntax >> >> > Any preferences here? This might also impact other Mahout packages in the > future, so everybody please weigh in. In general, it seems that having a > common, public encoding for matrix and vector data would help users mix and > match the Mahout services. What are the requirements of these other > services? From inspection, it looks like only the clustering packages use > them currently. > > Jeff > -- ted
Re: [jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices
Ted Dunning wrote: I see what you mean. To repeat in other words, the problems that need to be solved are: a) there are many uses already so adding attributes should be transparent to those who don't use them b) the encoding should not be ad hoc because this would be our second ad hoc encoding and only one should ever be allowed before using a standard +1 So here is a (kind of) concrete proposal: a) use JSON or Thrift for concrete syntax Any preferences here? This might also impact other Mahout packages in the future, so everybody please weigh in. In general, it seems that having a common, public encoding for matrix and vector data would help users mix and match the Mahout services. What are the requirements of these other services? From inspection, it looks like only the clustering packages use them currently. Jeff PGP.sig Description: PGP signature
Re: More proposed changes across code
Ugh, yeah, thinking about that a couple more seconds would have been useful. That is completely right. On Mon, Oct 20, 2008 at 3:33 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> Double.longBitsToDouble(Random.nextLong()) > > > NO!!
Re: More proposed changes across code
On Mon, Oct 20, 2008 at 7:00 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > .. > If you're looking for a good value to initialize some kind of "max" > variable to, try {Float,Double}.NEGATIVE_INFINITY. It is smaller than > every actual float/double. Try using the first element that you are comparing to! Anything else is pretty dangerous. > To generate a truly random float or double, try > > Double.longBitsToDouble(Random.nextLong()) NO!! This will not be the uniformly distributed number that you are probably wanting. It will have a very strange distribution that includes denormalized numbers, NAN's, and infinities of various kinds. Try gen.nextDouble() where gen is a math.Random or better random number generator. Even (double) Random.nextLong() / Long.MAX_VALUE would be better. -- ted
Re: More proposed changes across code
PS Looks like Hadoop does have a DoubleWritable now -- I'll switch to that where approrpriate as part of this pretty-big change list I am preparing. Also, while scouring the code I spotted what I believe is a common misconception that has led to a bug or two: {Double,Float}.MIN_VALUE is *not* the most negative value that a double/float can take on -- it is the smallest *positive* value it can take on. If you're looking for a good value to initialize some kind of "max" variable to, try {Float,Double}.NEGATIVE_INFINITY. It is smaller than every actual float/double. If you really need the most negative value, I think you want the negative of MAX_VALUE, but not 100% sure on that. Actually be careful there because in some cases I saw code trying to compute: (Float.MAX_VALUE - Float.MIN_VALUE) If you fix this to (Float.MAX_VALUE - (-Float.MAX_VALUE)) well you see the overflow problem. To generate a truly random float or double, try Double.longBitsToDouble(Random.nextLong()) though once again I would have to think about whether this actually generates NaN or INFINITY in some cases!
Re: More proposed changes across code
OK proceeding with these changes but will hold back submitting a bit for more comments. On Mon, Oct 20, 2008 at 12:06 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > No cases that I know of where floats actually help in our code and there are > bound to be places where they hurt. I agree, I think this is best. If the reason was just that Hadoop, shockingly, doesn't have a DoubleWritable, we can write one (and submit to Hadoop if desired) -- but we can also use FloatWritable too! you lose precision, yes, but just at the end. It doesn't mean floats should be used everywhere. > Furthermore, I try to follow the Spring philosophy that if the intermediate > caller doesn't have much of a chance of fixing the problem, then it should > be an unchecked exception. > > Thus, index out of bounds, numerical instability, bad argument and bad > encoding cases should be runtime exceptions. Any code where you find people > repacking exceptions into runtime exceptions to meet external API > requirements or where exceptions have to be declared but nobody ever catches > them except the framework are prime candidates for this treatment. This is a tangent -- I am not going to modify any exceptions -- but I suppose I follow a somewhat different rule. I have always understood RuntimeException to be for situations that should not happen when the an API is called correctly, in a debugged program. So, NullPointerException, IllegalArgumentException, etc. are not checked. It is unreasonable to force the caller to tell you what happens if it's calling an API wrong. However anything else, any other situation that can come up in a correctly-programmed bit of code, should be a checked exception. So I think stuff like an IOException of SQLException are properly checked exceptions. You can totally correctly call an HTTP API and still fail to get a web page if the internet is not available. You *should* be asked to say what happens if your method invocation can't complete normally -- or else why have exceptions in the first place? So in that sense I differ with the Spring/Hibernate theory. ... and that said I do not see any RuntimeExceptions which I think should be checked exceptions! just a tangent.
Re: In case you haven't noticed
Hi, when I was working on Bayes Classifier, I did feel that float will overflow/loose precision in some extraneous case. But the reason for using float was due to the limitation of hadoop. There was no DoubleWritable (equivalent to FloatWritable) which could be used in M/R mappers and reduces. I would prefer sed s/float/double/g . Robin On Mon, Oct 20, 2008 at 3:36 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > On Sun, Oct 19, 2008 at 11:57 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > >> I see some more complex cases (particularly the Float fields in many > >> classes) that probably could be improved too but am being > >> conservative. > > > > Can you point to an example? You have made me very curious. > > Try BayesThetaNormalizerMapper -- I suspect, though am not totally > sure, that the Float fields can be primitives. I don't see the need > for it to be an object. > > >> Incidentally why aren't we using doubles? In cases where storage isn't > >> a concern. > > > > > > I have no idea. There may be some hold-over in traditions from Lucene, > but > > there are not many places any more where floats are truly better. Most > > importantly, there are many cases where the extremely limited precision > of > > floats causes complete loss of all data. > > I agree. The above is another example where a double would be more > appropriate I think. >
Re: In case you haven't noticed
On Sun, Oct 19, 2008 at 11:57 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> I see some more complex cases (particularly the Float fields in many >> classes) that probably could be improved too but am being >> conservative. > > Can you point to an example? You have made me very curious. Try BayesThetaNormalizerMapper -- I suspect, though am not totally sure, that the Float fields can be primitives. I don't see the need for it to be an object. >> Incidentally why aren't we using doubles? In cases where storage isn't >> a concern. > > > I have no idea. There may be some hold-over in traditions from Lucene, but > there are not many places any more where floats are truly better. Most > importantly, there are many cases where the extremely limited precision of > floats causes complete loss of all data. I agree. The above is another example where a double would be more appropriate I think.
Re: More proposed changes across code
--- En date de : Dim 19.10.08, Grant Ingersoll <[EMAIL PROTECTED]> a écrit : > De: Grant Ingersoll <[EMAIL PROTECTED]> > Objet: Re: More proposed changes across code > À: mahout-dev@lucene.apache.org > Date: Dimanche 19 Octobre 2008, 18h30 > On Oct 19, 2008, at 11:16 AM, Sean Owen wrote: > > > On Sun, Oct 19, 2008 at 4:07 PM, Grant Ingersoll > > <[EMAIL PROTECTED]> wrote: > >> Doesn't the javadoc tool used @inherit to fill > in the inherited > >> docs when > >> viewing? > > > > Yes... I suppose I find that redundant. The subclass > method gets > > documented exactly as the superclass does. It looks > like the subclass > > had been explicitly documented, when it hadn't > been. I think its > > intent is to copy in documentation and add to it; I am > thinking only > > of cases where the javadoc only has a single element, > [EMAIL PROTECTED] > > > > > >>> 3. UpdatableFloat/Long -- just use Float[1] / > Long[1]? these classes > >>> don't seem to be used. > >> > >> Hmmm, they were used, but sure that works too. > > > > I can't find any usages of these classes, where > are they? > > Right, they aren't used any longer. Feel free to > remove. > > > > > > > > >>> 5. BruteForceTravellingSalesman says > "copyright Daniel Dwyer" -- can > >>> this be replaced by the standard copyright > header? > >> > >> No, this is in fact his code, licensed under the > ASL. I believe > >> the current > >> way we are handling it is correct. The original > code is his, and > >> the mods > >> are ours. > > > > Roger that, will leave it. But two notes then... > > - what about all the other code that game from > watchmaker? all the > > classes in the package say they came from watchmaker > > - I was told that for my stuff, yeah, I still own the > code/copyright > > but am licensing a copy to this project, and so it all > just gets > > licensed within Mahout according to the boilerplate > which says > > "Licensed to the ASF..." > > > > I'm not a lawyer and don't want to pick nits > but I do want to take > > extra care to get licensing right. > > Right. I believe the difference is you donated your code > to the ASF, > Daniel has merely published his code under the ASL, but has > not > donated to the ASF. It's a subtle distinction, I > suppose.Any of > the classes that came from watchmaker should say that, > although I know > many were developed by Deneche for the Watchmaker API. We > can go > review them again. In the case of the travellingSalesman example, I modified the original code to use Mahout when needed. My own modifications are a couple of lines in two or three classes, I included a readme.txt that describes the modified code and links to the original one. I replaced all the copyright headers with the standard one (I forgot BruteForceTravellingSalesman.java), and added a link to the original code in the class comments. I've been reading the Apache License 2.0, I'm not a lawyer and if I'm not mistaken, the "travellingSalesman" code included with Mahout is a "Derivative Work" of the original code, so we need to : . Point "in" the modified files that they have been changed, this files are: StrategyPanel.java, TravellingSalesman.java and EvolutionaryTravellingSalesman.java. . because the Watchmaker library contains a NOTICE.TXT file, Mahout must include a readable copy of the attribution notices contained within Watchmaker's NOTICE file. __ Do You Yahoo!? En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités http://mail.yahoo.fr Yahoo! Mail