Hi,
I've created a plugin on Nutch 1.0 that extends the HtmlParseFilter.
I wanted to extract some more information from the HTML document.
I've got all the parameters into the filter function and then I wanted to
make some searches using "xpath" on the DocumentFragment object.
I tried to do something simple like extracting all "h1" tags but no matter
what I do I always get 0 results.
What is the relation between DocumentFragment and XPath?
Is it even possible to use XPaths on DocumentFragment object?
public ParseResult filter(Content content, ParseResult parseResult,
HTMLMetaTags metaTags, DocumentFragment doc)
{
Parse parse = parseResult.get(content.getUrl());
Metadata metadata = parse.getData().getParseMeta();
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
try
{
XPathExpression expr = xpath.compile("//h1");
Object result = expr.evaluate(d, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
System.out.println("Found " + nodes.getLength() + " matches!");
for (int i = 0; i < nodes.getLength(); i++)
{
System.out.println(nodes.item(i).getNodeValue());
}
}
catch (XPathExpressionException e)
{
System.out.println("Error: " + e);
}
return parseResult;
}
Thanks,
Eran