Hi Ziqi, Have you checked out the XPath extractor suite[0] in the extractor package? I've not used this but it looks bang on what you are after.
hth Lewis [0] http://any23.apache.org/apidocs/index.html?org/apache/any23/extractor/xpath/package-summary.html On Mon, Oct 29, 2012 at 10:13 AM, Ziqi Zhang <[email protected]> wrote: > Hi all > > We have a special need in our work that we need to not only extract triples > from a page, but also knowining the contexts of the triples. By context I > mean the html elements containing the Subject or Object of the triple, and > the xpath to it. For example, on this page > http://www.imdb.com/title/tt0071562/, let's suppose a triple "_:nodexyz > <rdfs:type> http://schema.org/Movie" and "_:nodexyz > <schema.org/itemprop/actors> Al Pacino". > > I would like to be able to know that "al pacino" is in an html element that > has this xpath: <html><body><blahblah><div class="txt-block"><a> > > Can you give some general suggestions on which classes I should > extend/starting point? > > Many thanks! > > -- > Ziqi Zhang > -- Lewis
