Re: Publish code quality reports on web-site?

2009-11-28 Thread deneche abdelhakim
df/mapred works with the old hadoop API
df/mapreduce works with hadoop 0.20 API

On Saturday, November 28, 2009, Sean Owen  wrote:
> I'm all for generating and publishing this.
>
>
> The CPD results highlight a question I had: what's up with the amount
> of duplication between org/apache/mahout/df/mapred and
> org/apache/mahout/df/mapreduce -- what is the difference supposed to
> be.
>
>
> PMD is complaining a lot about the "foo == false" vs "!foo" style. I
> prefer the latter too but we had agreed to use the former, so we could
> disable this check if possible.
>
>
> Checkstyle: can we set it to allow a 120 character line, and adjust it
> to consider an indent to be 2 spaces? it's flagging like every line of
> code right now !
>
>
> On that note, if possible, I would suggest disabling the following
> FindBugs checks, as they are flagging a lot of stuff that isn't
> 'wrong', to me.
>
> SE_NO_SERIALVERSIONID
> I completely disagree with it. serialVersionUID itself is bad
> practice, in my book.
>
> EI_EXPOSE_REP2
> it's a fair point but only relevant to security, and we have no such
> issue. The items it flags are done on purpose for performance, it
> looks like.
>
> SQL_PREPARED_STATEMENT_GENERATED_FROM_NONCONSTANT_STRING
> SQL_NONCONSTANT_STRING_PASSED_TO_EXECUTE
> It's a good point in general, but I'm the only one writing JDBC code,
> and there is actually no security issue here. It's a false positive
> and we could disable this.
>
> SE_BAD_FIELD
> This one is a little aggressive. It assumes that types not known to be
> Serializable must not be Serializable, which isn't true.
>
> RV_RETURN_VALUE_IGNORED
> It's a decent idea but flags a lot of legitimate code. For example
> it's complaining about ignoring Queue.poll(), which, like a lot of
> Collection API methods,
>
> UWF_FIELD_NOT_INITIALIZED_IN_CONSTRUCTOR
> I don't necessarily agree with this one, explicitly setting fields to
> null and primitives to zero? tidy but I'm not used to it.
>
>
> I didn't see anything big flagged, good, but we should all have a look
> at the results and tweak accordingly. In some cases it had a good
> small point, or I was indifferent about the approach it was suggesting
> versus what was in the code, so I changed to comply with the check.
>
>
> On Fri, Nov 27, 2009 at 8:26 PM, Isabel Drost  wrote:
>>
>> Hello,
>>
>> I just ran several code analysis reports over the Mahout source code.
>> Results are published at
>>
>> http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html
>>
>> It includes several reports on code quality, test coverage, java docs
>> and the like. When generated regularly say on Hudson I think it could
>> be beneficial both for us (for getting a quick impression of where
>> cleanup is necessary most) as well as for potential users.
>>
>> I would like to see a third tab added to our homepage that points to
>> a page containing reports for each of our modules. I would try to cleanup the
>> generated site a little before - we certainly do not need the "Project
>> information" stuff in there, as most of this is already generated
>> through forest. In addition I can take care of setting up a hudson
>> job to recreate the site on a regular schedule.
>>
>> Cheers,
>> Isabel
>>
>> --
>>  |\      _,,,---,,_       Web:   
>>  /,`.-'`'    -.  ;-;;,_
>>  |,4-  ) )-,_..;\ (  `'-'
>> '---''(_/--'  `-'\_) (fL)  IM:  
>>
>>
>


Re: Publish code quality reports on web-site?

2009-11-28 Thread Isabel Drost
On Saturday 28 November 2009 21:29:05 Drew Farris wrote:
> It will be be interesting to see the reports for the other modules as
> well. examples, utils, matrix.

As a little preview: Just substitute mahout-core with mahout- in 
the url below:

http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.
html

Fixing the report links is on my list already ;)

Isabel


-- 
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_  
 |,4-  ) )-,_..;\ (  `'-' 
'---''(_/--'  `-'\_) (fL)  IM:  



signature.asc
Description: This is a digitally signed message part.


[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-11-28 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-11:
--

Attachment: MAHOUT-11-all-cleanup-20091128.patch

MAHOUT-11-all-cleanup-20091128.patch eliminates the use of static fields for 
configuration in the clustering code in all cases where it was present: canopy, 
kmeans, fuzzykmeans and meanshift. It retains Isabel's original patch to the 
kmeans package, with the exception of the items discussed previously, and adds 
similar changes to the other packages. It also includes the fix to and unit 
test for RandomSeedGenerator previously included.

Applied against rev 883446, all unit tests are passing, and I've run the kmeans 
code on real data. It would be really great if someone could double check the 
changes and comment.


> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
>     Attachments: MAHOUT-11-all-cleanup-20091128.patch, 
> MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, 
> MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Publish code quality reports on web-site?

2009-11-28 Thread Drew Farris
Isabel,

Wow, this looks great. There's lots of information in here. Sean
definitely has a point where it would be very nice to eliminate the
information about things we're not really concerned with. Also, I
wonder if these are cases where we need to add more checks.Is there
some report that tracks the usage of deprecated classes/methods with
optional whitelisting of certain cases (e.g the hadoop/mapred apis)?

In addition to the reports generated nightly from trunk, it would be
nice if some of the reports (e.g Javadoc, JIRA or some sort of
Changelog) could be generated for each release and preserved on the
site regardless of the current state of trunk. This way, for example,
the javadoc for each release can be found. Other projects don't always
do this, and it is kind of a pain if one is using an older version of
the API and trying to find docs for that version.

It will be be interesting to see the reports for the other modules as
well. examples, utils, matrix. This will certainly help with
determining where unit tests need to be written for the new matrix
code inherited from colt, etc.

Drew



On Fri, Nov 27, 2009 at 3:26 PM, Isabel Drost  wrote:
>
> Hello,
>
> I just ran several code analysis reports over the Mahout source code.
> Results are published at
>
> http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html
>
> It includes several reports on code quality, test coverage, java docs
> and the like. When generated regularly say on Hudson I think it could
> be beneficial both for us (for getting a quick impression of where
> cleanup is necessary most) as well as for potential users.
>
> I would like to see a third tab added to our homepage that points to
> a page containing reports for each of our modules. I would try to cleanup the
> generated site a little before - we certainly do not need the "Project
> information" stuff in there, as most of this is already generated
> through forest. In addition I can take care of setting up a hudson
> job to recreate the site on a regular schedule.
>
> Cheers,
> Isabel
>
> --
>  |\      _,,,---,,_       Web:   
>  /,`.-'`'    -.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  
>
>


Re: NMF for Taste

2009-11-28 Thread Ted Dunning
Jake,

Do you have any concrete information about how much difference there
actually is in these decompositions?

On Sat, Nov 28, 2009 at 8:31 AM, Jake Mannix  wrote:

> or more precisely, a sparse SVD which doesn't treat
> missing data as the numerical 0 or mean of the values
>



-- 
Ted Dunning, CTO
DeepDyve


Re: NMF for Taste

2009-11-28 Thread Ted Dunning
Restricted Boltzmann are of real interest, but again, I repeat the
obligatory warning about replicating all things from the Netflix
competition.

To take a few concrete examples,

- user biases were a huge advance in terms of RMS error, but they don't
affect the ordering of the results presented to a user and thus are of no
interest for almost all production recommender applications

- temporal dynamics as a time variation in user bias is just like the first
case.

- temporal dynamics in the sense of block-busters that decay quickly are
often of little interest in a production recommender because blockbusters
are typically presented outside of the context of recommendations which are
instead used to help users find back catalog items of interest *in*spite* of
current popularity trends.  This means that tracking this kind of temporal
dynamics is good in the Netflix challenge, but neutral or bad in most
recommendation applications.  An exception is magical navigation links that
populate themselves using recommendation technology.

On the other side,

- portfolio approaches that increase the diversity of results presented to
the user increase the probability of user clicking, but decrease RMS score

- dithering of results to give a changing set of recommendations increases
users click rates, but decreases RMS score

The take-away is that the Netflix results can't be used as a blueprint for
all recommendation needs.

On Sat, Nov 28, 2009 at 8:31 AM, Jake Mannix  wrote:

> Machine based recommender, because this makes the final leap from linear
> and quasi-linear decompositions to the truly nonlinear case (my friend on
> the executive team over at Netflix tells me that it was pretty apparent
> that the
> winners were going to be blendings of the RBM and SVD-based approaches
> pretty early on - and he was right!)
>



-- 
Ted Dunning, CTO
DeepDyve


Re: NMF for Taste

2009-11-28 Thread Jake Mannix
On Fri, Nov 27, 2009 at 11:23 PM, Ted Dunning  wrote:

> Summarize yes.
>


> But this is, actually, theoretically better because the summarization
> introduces useful smoothing.  That way you get recommendations for items
> even if there is no direct overlap.
>

Summarize, smooth, and enhance clustering: distances are *not* preserved in
truncated decompositions, and the *hope* is that the "meaningful" distances
are
decreased, and the less meaningful distances are not.  This can be seen in
a simple example of user preferences (on the netflix scale of 1-5)


user1: item1 = 4, item2: 1, item3: 5
user2: item2: 1, item5:1, item7: 3, item8: 3
user3: item4: 4, item5: 1, item6: 5

A first-order recommender won't be able to infer any similarity or
dissimilarity between
user1 and user 3 (although it can tell some similarity between 1 and 2 and 3
and 2).
A decomposing recommender will notice that user1 and 2 both hated item2, and
that
another item which user2 hated was the same item that user3 hated, and infer

transitive similarity, not just to 2nd degree as in this example, but to
nth-order.

The difference between the various decompositional approaches is how they
approximate these transitive similarities - LDA would be best in the very
low
overlap case, and SVD (or more precisely, a sparse SVD which doesn't treat
missing data as the numerical 0 or mean of the values) approaching that
level
of quality in the bigger data case (but SVD / randomized SVD should be a lot
faster than LDA on the big big data case).

What I'd really like to see (once I get this decomposer stuff in - soon!
We've
got good linear primitives now, so I'm working on it!) is also a Restricted
Boltzmann
Machine based recommender, because this makes the final leap from linear
and quasi-linear decompositions to the truly nonlinear case (my friend on
the
executive team over at Netflix tells me that it was pretty apparent that the

winners were going to be blendings of the RBM and SVD-based approaches
pretty early on - and he was right!)


  -jake



> Your point about noisy is trenchant because small count data is inherently
> noisy because you can't have an exact 0.04 of an observation.  Small counts
> dominate in recommendations.
>
> On Fri, Nov 27, 2009 at 10:00 PM, Sean Owen  wrote:
>
> >
> > Correct me if I'm wrong, but my impression of matrix factorization
> > approaches is that they're just a way to effectively "summarize" input
> > data. They're not a theoretically better, or even different, approach
> > to recommendation, but more a transformation of the input into
> > conventional algorithms. (Though this process of simplification could,
> > I imagine, sometimes be an improvement on the input, if it's noisy.)
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


[jira] Created: (MAHOUT-210) Publish code quality reports through maven

2009-11-28 Thread Isabel Drost (JIRA)
Publish code quality reports through maven
--

 Key: MAHOUT-210
 URL: https://issues.apache.org/jira/browse/MAHOUT-210
 Project: Mahout
  Issue Type: New Feature
  Components: Website
Affects Versions: 0.1, 0.2
Reporter: Isabel Drost
 Fix For: 0.3


We should use mvn site:site to generate code reports and publish them online 
for users to review and developers to easily spot problems.

First version that still needs checks adjusted to our needs is available online 
at:

http://people.apache.org/~isabel/mahout_site/mahout-core/project-reports.html

Further discussion on-list at

http://www.lucidimagination.com/search/document/a13aa5127b47fda3/publish_code_quality_reports_on_web_site##a13aa5127b47fda3

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Publish code quality reports on web-site?

2009-11-28 Thread Isabel Drost
On Saturday 28 November 2009 08:30:26 Sean Owen wrote:
> I'm all for generating and publishing this.

Great. Than I will go an tweak the checks to match our guidelines, twiddle a 
bit with the output format and than integrate the stuff into our nightly 
build.


> I didn't see anything big flagged, good, but we should all have a look
> at the results and tweak accordingly. In some cases it had a good
> small point, or I was indifferent about the approach it was suggesting
> versus what was in the code, so I changed to comply with the check.

The reports generated are just "examples" - I am all for adjusting all checks 
(or adding new ones) that do not fit our needs. Going through your list and 
doing the proposed changes, reupload the site so everyone can have a look.

Isabel


-- 
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_  
 |,4-  ) )-,_..;\ (  `'-' 
'---''(_/--'  `-'\_) (fL)  IM:  



signature.asc
Description: This is a digitally signed message part.