subject:"Extracting triples tags or hash tags from html"

Extracting triples tags or hash tags from html

2011-07-17 Thread lewis john mcgibbney

Hi, Is this currently possible with Tika 0.9 in Nutch branch 1.4? I would have thought that this would have been dealt with in Tika, however I have seen no mention of anyone having problems extracting this from web documents when fetching with Nutch or even discussing it. For example say I had

Re: Extracting triples tags or hash tags from html

2011-07-17 Thread Julien Nioche

You simply need to write a HTMLParser, they receive the DOM representation of the page from parse-tika (or parse-html). See JIRA for the entry on the metatag parser for an example and discussion. There is usually no need to modify parse-html or tika at all Julien On 17 July 2011 16:23, lewis