Re: XPath with ExtractingRequestHandler
Yeah, I tried: //xhtml:div[@class='bibliographicData']/descendant:node() also tried //xhtml:div[@class='bibliographicData'] Neither worked. The DIV I need also had an ID value, and I tried both variations on ID as well. Still nothing. XPath handling for Tika seems to be pretty basic and does not seem to support most XPath Query syntax. Probably because it's using a Sax parser, I don't know. I guess I will have to write something custom to get it to do what I need it to. Thanks for the reply though. I will post a follow up with how I fixed this. --mike
XPath with ExtractingRequestHandler
I want to restrict the HTML that is returned by Tika to basically: /xhtml:html/xhtml:body//xhtml:div[@class='bibliographicData']/descendant:node() and it seems that the XPath class being used does not support the '//' syntax. Is there anyway to configure Tika to use a different XPath evaluation class?
ExtractingRequestHandler and HTML
I am submitting HTML document to Solr using the ERH. Is it possible to store the contents of the document (including all markup) into a field? Using fmap.content (I am assuming this comes from Tika) stores the extracted text of the document in a field, but not the markup. I want the whole un-altered document. Is this possible? thanks --mike
XPathEntityProcessor and ExtractingRequestHandler
Can I use a XPathEntityProcessor in conjunction with an ExtractingRequestHandler? Also, the scripting language that XPathEntityProcessor uses/supports, is that just ECMA/JavaScript? Or is XPathEntityProcessor only supported for use in conjuntion with the DataImportHandler? Thanks.
Re: Document Processing
Hello Erik, I will take a look at both: org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor and org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor and figure out what I need to extend to handle processing in the way I am looking for. I am assuming that "component" configuration is handled in a standard way such that I can configure my new UpdateProcessor in the same way I would configure any other UpdateProcessor "component"? Thanks for the suggestion. 1 more question: given that I am probably going to convert the HTML to XML so I can use XPath expressions to "extract" my content, do you think that this kind of processing will overload Solr? This Solr instance will be used solely for indexing, and will only ever have a single ManifoldCF crawling job feeding it documents at one time. --mike
Re: Document Processing
On 12/05/2011 01:52 PM, Michael Kelleher wrote: I am crawling a bunch of HTML pages within a site (using ManifoldCF), that will be sent to Solr for indexing. I want to extract some content out of the pages, each piece of content to be stored as its own field BEFORE indexing in Solr. My guess would be that I should use a Document processing pipeline in Solr like UIMA, or something of the like. What would be the best way of handling this kind of processing? Would it be preferable to use a Document Processing Pipeline such as OpenPipe, UIMA, etc? Should this be handled externally, or would the DataImportHandler suffice? The Solr server being used for this will solely be used for indexing, and the "submit" jobs from the crawler will be very controlled, and not high volume after the initial crawl. thanks.
Document Processing
I am crawling a bunch of HTML pages within a site (using ManifoldCF), that will be sent to Solr for indexing. I want to extract some content out of the pages, each piece of content to be stored as its own field BEFORE indexing in Solr. My guess would be that I should use a Document processing pipeline in Solr like UIMA, or something of the like. What would be the best way of handling this kind of processing? Would it be preferable to use a Document Processing Pipeline such as OpenPipe, UIMA, etc? Should this be handled externally, or would the DataImportHandler suffice? The Solr server being used for this will solely be used for indexing, and the "submit" jobs from the crawler will be very controlled, and not high volume after the initial crawl. thanks.