Re: XPath with ExtractingRequestHandler

2011-12-15 Thread Michael Kelleher

Yeah, I tried:

//xhtml:div[@class='bibliographicData']/descendant:node()

also tried

//xhtml:div[@class='bibliographicData']

Neither worked.  The DIV I need also had an ID value, and I tried both 
variations on ID as well.  Still nothing.



XPath handling for Tika seems to be pretty basic and does not seem to 
support most XPath Query syntax.  Probably because it's using a Sax 
parser, I don't know.  I guess I will have to write something custom to 
get it to do what I need it to.


Thanks for the reply though.

I will post a follow up with how I fixed this.

--mike


XPath with ExtractingRequestHandler

2011-12-14 Thread Michael Kelleher

I want to restrict the HTML that is returned by Tika to basically:


/xhtml:html/xhtml:body//xhtml:div[@class='bibliographicData']/descendant:node()



and it seems that the XPath class being used does not support the '//' 
syntax.


Is there anyway to configure Tika to use a different XPath evaluation class?




ExtractingRequestHandler and HTML

2011-12-12 Thread Michael Kelleher
I am submitting HTML document to Solr using the ERH.  Is it possible to 
store the contents of the document (including all markup) into a field?  
Using fmap.content (I am assuming this comes from Tika) stores the 
extracted text of the document in a field, but not the markup.  I want 
the whole un-altered document.


Is this possible?

thanks

--mike


XPathEntityProcessor and ExtractingRequestHandler

2011-12-07 Thread Michael Kelleher
Can I use a XPathEntityProcessor in conjunction with an 
ExtractingRequestHandler?  Also, the scripting language that 
XPathEntityProcessor uses/supports, is that just ECMA/JavaScript?


Or is XPathEntityProcessor only supported for use in conjuntion with the 
DataImportHandler?


Thanks.


Re: Document Processing

2011-12-05 Thread Michael Kelleher

Hello Erik,

I will take a look at both:

org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor

and

org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor


and figure out what I need to extend to handle processing in the way I 
am looking for.  I am assuming that "component" configuration is handled 
in a standard way such that I can configure my new UpdateProcessor in 
the same way I would configure any other UpdateProcessor "component"?


Thanks for the suggestion.


1 more question:  given that I am probably going to convert the HTML to 
XML so I can use XPath expressions to "extract" my content, do you think 
that this kind of processing will overload Solr?  This Solr instance 
will be used solely for indexing, and will only ever have a single 
ManifoldCF crawling job feeding it documents at one time.


--mike


Re: Document Processing

2011-12-05 Thread Michael Kelleher

On 12/05/2011 01:52 PM, Michael Kelleher wrote:
I am crawling a bunch of HTML pages within a site (using ManifoldCF), 
that will be sent to Solr for indexing.  I want to extract some 
content out of the pages, each piece of content to be stored as its 
own field BEFORE indexing in Solr.


My guess would be that I should use a Document processing pipeline in 
Solr like UIMA, or something of the like.


What would be the best way of handling this kind of processing?  Would 
it be preferable to use a Document Processing Pipeline such as 
OpenPipe, UIMA, etc?  Should this be handled externally, or would the 
DataImportHandler suffice?


The Solr server being used for this will solely be used for indexing, 
and the "submit" jobs from the crawler will be very controlled, and 
not high volume after the initial crawl.


thanks.




Document Processing

2011-12-05 Thread Michael Kelleher
I am crawling a bunch of HTML pages within a site (using ManifoldCF), 
that will be sent to Solr for indexing.  I want to extract some content 
out of the pages, each piece of content to be stored as its own field 
BEFORE indexing in Solr.


My guess would be that I should use a Document processing pipeline in 
Solr like UIMA, or something of the like.


What would be the best way of handling this kind of processing?  Would 
it be preferable to use a Document Processing Pipeline such as OpenPipe, 
UIMA, etc?  Should this be handled externally, or would the 
DataImportHandler suffice?


The Solr server being used for this will solely be used for indexing, 
and the "submit" jobs from the crawler will be very controlled, and not 
high volume after the initial crawl.


thanks.