Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Drew Farris Sat, 28 Nov 2009 12:09:14 -0800

Hi Marc,

How are you planning on cleaning up the HTML documents?


Perhaps something like this would be useful: I came across an
interesting approach a few days ago, it would be interesting to hear
more from someone who has tried something like this:
http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

Described further, with java implementations here:
http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html

Drew

On Sat, Nov 28, 2009 at 2:57 PM, Marc Hofer <[email protected]> wrote:
> Hello everybody,
>
> having already presented the draft of our architecture, I would like now to
> discuss the second layer more in detail. As mentioned before we have chosen
> UIMA for this layer. The main aggregate currently consists of the Whitespace
> Tokenizer Annotator, the Snowball Annotator (Stemming) and a list-based
> StopwordFilter. Before processing this aggregate in a map-only job in
> Hadoop, we want to filter all HTML tags and forward only this preprocessed
> data to the aggregate. The reason for this is that it is difficult to change
> the document during processing in UIMA and it is impractical to work all the
> time on documents containing HTML tags.
>
> Furthermore we are planning to add the Tagger Annotator, which implements a
> Hidden Markov Model tagger. Here we aren't sure, which tokens with their
> corresponding part of speech tags to delete or not and so using them for the
> feature extraction. One purpose could be to use at the very beginning only
> substantives and verbs.
>
> We are very interested in your comments and remarks and it would be nice to
> hear from you.
>
> Cheers,
> Marc
>

Re: TU Berlin Winter of Code Project - II. Layer: Preprocessing

Reply via email to