Hi.

There hasn't been much activity on this list, so I thought I'd throw out my idea on how I see my little side project working.

I have to write a web content analysis tool which will take the HTML from a site and figure out various metrics from it (eg page size, # of JS calls etc).

What I was planning to do was use the nutch tool to fetch the URL data into segments, and then write a custom tool to extract the HTML out of the segment and run it through my code, similar to what the 'crawl' does, but dumping the metrics into a mysql DB.

Is this similar to what you guys had in mind with Tika?

Regards
Ian

Reply via email to