Yeah, I tried:
//xhtml:div[@class='bibliographicData']/descendant:node()
also tried
//xhtml:div[@class='bibliographicData']
Neither worked. The DIV I need also had an ID value, and I tried both
variations on ID as well. Still nothing.
XPath handling for Tika seems to be pretty basic and
I want to restrict the HTML that is returned by Tika to basically:
/xhtml:html/xhtml:body//xhtml:div[@class='bibliographicData']/descendant:node()
and it seems that the XPath class being used does not support the '//'
syntax.
Is there anyway to configure Tika to use a different XPath e
I am submitting HTML document to Solr using the ERH. Is it possible to
store the contents of the document (including all markup) into a field?
Using fmap.content (I am assuming this comes from Tika) stores the
extracted text of the document in a field, but not the markup. I want
the whole un
Can I use a XPathEntityProcessor in conjunction with an
ExtractingRequestHandler? Also, the scripting language that
XPathEntityProcessor uses/supports, is that just ECMA/JavaScript?
Or is XPathEntityProcessor only supported for use in conjuntion with the
DataImportHandler?
Thanks.
Hello Erik,
I will take a look at both:
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
and
org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor
and figure out what I need to extend to handle processing in the way I
am looking for. I am assumi
On 12/05/2011 01:52 PM, Michael Kelleher wrote:
I am crawling a bunch of HTML pages within a site (using ManifoldCF),
that will be sent to Solr for indexing. I want to extract some
content out of the pages, each piece of content to be stored as its
own field BEFORE indexing in Solr.
My
I am crawling a bunch of HTML pages within a site (using ManifoldCF),
that will be sent to Solr for indexing. I want to extract some content
out of the pages, each piece of content to be stored as its own field
BEFORE indexing in Solr.
My guess would be that I should use a Document processing