Hi.
There hasn't been much activity on this list, so I thought I'd throw out
my idea on how I see my little side project working.
I have to write a web content analysis tool which will take the HTML
from a site and figure out various metrics from it (eg page size, # of
JS calls etc).
What I was planning to do was use the nutch tool to fetch the URL data
into segments, and then write a custom tool to extract the HTML out of
the segment and run it through my code, similar to what the 'crawl'
does, but dumping the metrics into a mysql DB.
Is this similar to what you guys had in mind with Tika?
Regards
Ian