Hi Drew,

currently we are using a HTML Filter module of the Univeristy Duisburg-Essen, that can be found here: http://www.is.informatik.uni-duisburg.de/projects/java-unidu/filter.html

Another idea was to try Jericho or NekoHTML.
http://www.java2s.com/Product/Java/Development/HTML-Parser.htm

Thanks for your advice, we will test it and let you know, whether it works well.

Marc

Drew Farris schrieb:
Hi Marc,

How are you planning on cleaning up the HTML documents?

Perhaps something like this would be useful: I came across an
interesting approach a few days ago, it would be interesting to hear
more from someone who has tried something like this:
http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

Described further, with java implementations here:
http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html

Drew

On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <[email protected]> wrote:
Hello everybody,

having already presented the draft of our architecture, I would like now to
discuss the second layer more in detail. As mentioned before we have chosen
UIMA for this layer. The main aggregate currently consists of the Whitespace
Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based
StopwordFilter. Before processing this aggregate in a map-only job in
Hadoop, we want to filter all HTML tags and forward only this preprocessed
data to the aggregate. The reason for this is that it is difficult to change
the document during processing in UIMA and it is impractical to work all the
time on documents containing HTML tags.

Furthermore we are planning to add the Tagger Annotator, which implements a
Hidden Markov Model tagger. Here we aren't sure, which tokens with their
corresponding part of speech tags to delete or not and so using them for the
feature extraction. One purpose could be to use at the very beginning only
substantives and verbs.

We are very interested in your comments and remarks and it would be nice to
hear from you.

Cheers,
Marc




Reply via email to