Egor,

I pursued a requirement to compare HTML documents of a different
nature. For this requirement comparing a number of documented items vs
not documented ones doesn't work. For source files your metrics is the
best.

Let me comment on unique words. The goal was to subtract words like
Copyright, Apache, Software, Foundation which appear on every page.

I debugged my metrics using the following pattern: I compared pages
visually and than compared their metrics. If the results of comparison
were different, I fixed a metric.

Thanks, Alexei

On 03 Nov 2006 15:17:38 +0600, Egor Pasko <[EMAIL PROTECTED]> wrote:
On the 0x216 day of Apache Harmony Alexei Fedotov wrote:
> Egor,
>
> Thank you for your interest.

We definitely need to improve our documentation. Necessity is not a
real interest :)

> Here is an algorithm:
>
> 1. Create a list of words from HTML files.
> 2. Merge a dictionary of all words used in documentation.
> 3. Remove a half of the most frequently used words from the dictionary
> - I believe they do not add much sense.
> 4. Remove misspelled words (including identifiers) from the dictionary.
> 5. Give a page +1 for each rare, correctly spelled word according to
> the dictionary.
> 6. Divide to the total number of words on the page.

hm, strange heuristic. More unique correctly spelled words is
beneficial. It does not give a clue on the overall quality of
documentation, which is rather confusing..

I thought of something more natural. Number of documented items
vs. number of non-documented. Plus a penalty to the relative number of
misspelled words.

> I've collected nice RFEs from your letter. Most of them make me think
> and I like them.
> a. Update an ASF block comment
> b. Improve readability. Some things are really easy - like removing
> awk and rewriting most things in perl. Others are a bit more complex -
> I targeted script performance when created auto-generated perl script.
> Also, initial algorithm was a bit more complex - different words had a
> different cost based on their popularity.
> c. Use junit test output format to integrate with
> http://harmonytest.org. I believe I need a feature request for that
> site as well - we need some way to import performance-like rankings to
> the site.

Yes, I thought of the RFE to harmonytest. At least, put the doc items
on a separate page from the build items.

> d. I will think of parsing sources. But I don't think we need to
> maintain both scripts. The generic rule is simple - improve your .h
> and .java files - .cpp files don't count. I suggest better to link
> .html files to contributors.

can you calculate a list of relevant filenames from a doc page? give
filename +1 for each documented item, give a -1 for each undocumented,
divide on the number of items. Is it easy to implement?  Maybe doxygen
has some features to assist this?

> Thank you for ideas. I will certainly update the script. I just want
> to wait a bit - many scripts die just because people are not
> interested to run them a second time. Also, if anyone suggest any
> changes in algorithm or any other RFEs, I want to implement them all
> at once.
>
> Nadya, could you please point us a good documentation file so we can
> use it as a pattern?

--
Egor Pasko


Reply via email to