Re: XPath with ExtractingRequestHandler

2011-12-15 Thread Michael Kelleher
Yeah, I tried: //xhtml:div[@class='bibliographicData']/descendant:node() also tried //xhtml:div[@class='bibliographicData'] Neither worked. The DIV I need also had an ID value, and I tried both variations on ID as well. Still nothing. XPath handling for Tika seems to be pretty basic and

XPath with ExtractingRequestHandler

2011-12-14 Thread Michael Kelleher
I want to restrict the HTML that is returned by Tika to basically: /xhtml:html/xhtml:body//xhtml:div[@class='bibliographicData']/descendant:node() and it seems that the XPath class being used does not support the '//' syntax. Is there anyway to configure Tika to use a different XPath e

ExtractingRequestHandler and HTML

2011-12-12 Thread Michael Kelleher
I am submitting HTML document to Solr using the ERH. Is it possible to store the contents of the document (including all markup) into a field? Using fmap.content (I am assuming this comes from Tika) stores the extracted text of the document in a field, but not the markup. I want the whole un

XPathEntityProcessor and ExtractingRequestHandler

2011-12-07 Thread Michael Kelleher
Can I use a XPathEntityProcessor in conjunction with an ExtractingRequestHandler? Also, the scripting language that XPathEntityProcessor uses/supports, is that just ECMA/JavaScript? Or is XPathEntityProcessor only supported for use in conjuntion with the DataImportHandler? Thanks.

Re: Document Processing

2011-12-05 Thread Michael Kelleher
Hello Erik, I will take a look at both: org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor and org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor and figure out what I need to extend to handle processing in the way I am looking for. I am assumi

Re: Document Processing

2011-12-05 Thread Michael Kelleher
On 12/05/2011 01:52 PM, Michael Kelleher wrote: I am crawling a bunch of HTML pages within a site (using ManifoldCF), that will be sent to Solr for indexing. I want to extract some content out of the pages, each piece of content to be stored as its own field BEFORE indexing in Solr. My

Document Processing

2011-12-05 Thread Michael Kelleher
I am crawling a bunch of HTML pages within a site (using ManifoldCF), that will be sent to Solr for indexing. I want to extract some content out of the pages, each piece of content to be stored as its own field BEFORE indexing in Solr. My guess would be that I should use a Document processing