Hi Marc, How are you planning on cleaning up the HTML documents?
Perhaps something like this would be useful: I came across an interesting approach a few days ago, it would be interesting to hear more from someone who has tried something like this: http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/ Described further, with java implementations here: http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html Drew On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <[email protected]> wrote: > Hello everybody, > > having already presented the draft of our architecture, I would like now to > discuss the second layer more in detail. As mentioned before we have chosen > UIMA for this layer. The main aggregate currently consists of the Whitespace > Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based > StopwordFilter. Before processing this aggregate in a map-only job in > Hadoop, we want to filter all HTML tags and forward only this preprocessed > data to the aggregate. The reason for this is that it is difficult to change > the document during processing in UIMA and it is impractical to work all the > time on documents containing HTML tags. > > Furthermore we are planning to add the Tagger Annotator, which implements a > Hidden Markov Model tagger. Here we aren't sure, which tokens with their > corresponding part of speech tags to delete or not and so using them for the > feature extraction. One purpose could be to use at the very beginning only > substantives and verbs. > > We are very interested in your comments and remarks and it would be nice to > hear from you. > > Cheers, > Marc >
