Re: [fcrepo-user] GSearch feeding invalid characters to SOLR via SOAP? (was: Does Islandora 7.x index FULL_TEXT DS from PDF SP by default?)

Peter Murray Tue, 19 Mar 2013 06:10:19 -0700

Greetings, Gert!  Thanks for replying.

I'm using the supplied foxmlToSolr.xslt from the 2.5 distribution, which is 
using (ii) -- the Tika call.  I dug in a little further and found that it was 
not a literal 0x0C character that was causing the problem but a "&#12;" entity. 
 If I add a filter for that in 
gsearch/FgsSolr2/src/java/dk/defxws/fgsolr/OperationsImpl.java then the XML 
document is added to the SOLR index without error:


        StringBuffer sb = (new GTransformer()).transform(
                        xsltPath, 
                        new StreamSource(foxmlStream),
                        config.getURIResolver(indexName),
                        params);
        StringBuffer sb2 = new StringBuffer(sb.toString().replaceAll("&#12;", 
""));

Not elegant, but it seems to be working.


Peter

On Mar 19, 2013, at 7:08 AM, Gert Schmeltz Pedersen <[email protected]> wrote:
> 
> Peter, I will try to clarify some things. 
> 
> As you know, GSearch uses PDFBox to extract text from PDF files. 
> 
> It is done from foxmlToSolr.xslt, calling either (i) exts:getDatastreamText() 
> or (ii) exts:getDatastreamFromTika()  .
> 
> In case (i) GSearch calls PDFBox direct, and it replaces characters below 
> space (including 0xc, excepting 0x9, 0xa, and 0xd) with space, you may see 
> the replacements in DEBUG log lines.
> 
> In case (ii) GSearch calls Tika direct, which calls its internal PDFBox, and 
> there is no replacement.
> 
> Tika was included in GSearch from version 2.4
> 
> I would like you to confirm that you get the illegal character, when you call 
> case (ii), and not in case (i), please.
> 
> If confirmed, you may simply go on calling case (i). The next version of 
> GSearch may include the character replacement in case (ii).
> 
> Gert
> 
> 
> 
> 
> 
> 
> On 19/03/2013, at 02.48, Peter Murray wrote:
> 
>> This message started in [email protected], but appears to be a more 
>> general problem with gsearch, so I'm also copying this to fedora-users.
>> 
>> On Mar 18, 2013, at 8:06 PM, Peter Murray <[email protected]> wrote:
>>> Does default configuration of GSearch for Islandora-7.x index the FULL_TEXT 
>>> datastream of objects created by the PDF Solution Pack?  The search engine 
>>> appears to index the metadata without fail.  I've even gone into the 
>>> GSearch updateIndex web screen and updated all of the FOXML files.  I'm 
>>> using the GSearch 2.5 (the version previous to the one released today) 
>>> 'fgsconfig-basic-for-islandora.properties' updated with the passwords and 
>>> locations specific to my setup.
>> 
>> 
>> I've dug a little deeper on this, and am still coming up stymied.  It looks 
>> like objects with PDFs are not getting index.  GSearch is showing this error:
>> 
>> DEBUG 2013-03-18 21:34:09,583 (Config) insertSystemProperties 
>> propertyValue=http://localhost:8080/solr
>> DEBUG 2013-03-18 21:34:09,594 (OperationsImpl) closeIndexSearcher 
>> indexName=FgsIndex
>> DEBUG 2013-03-18 21:34:09,595 (OperationsImpl) closeIndexReader 
>> indexName=FgsIndex docCount=45
>> ERROR 2013-03-18 21:34:09,597 (UpdateListener) Unable to perform index 
>> update due to Exception: Mon Mar 18 21:34:09 EDT 2013 Connection error (is 
>> Solr running at http://localhost:8080/solr/update ?): java.io.IOException: 
>> Server returned HTTP response code: 500 for URL: 
>> http://localhost:8080/solr/update
>> dk.defxws.fedoragsearch.server.errors.GenericSearchException: Mon Mar 18 
>> 21:34:09 EDT 2013 Connection error (is Solr running at 
>> http://localhost:8080/solr/update ?): java.io.IOException: Server returned 
>> HTTP response code: 500 for URL: http://localhost:8080/solr/update
>>        at dk.defxws.fgssolr.OperationsImpl.postData(OperationsImpl.java:653)
>>        at dk.defxws.fgssolr.OperationsImpl.indexDoc(OperationsImpl.java:473)
>>        at dk.defxws.fgssolr.OperationsImpl.fromPid(OperationsImpl.java:413)
>> 
>> 
>> Which correlates to this SOLR error in catalina.out:
>> 
>> Mar 18, 2013 9:34:09 PM org.apache.solr.common.SolrException log
>> SEVERE: [com.ctc.wstx.exc.WstxLazyException] 
>> com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion 
>> character (code 0xc) not a valid XML character
>> at [row,col {unknown-source}]: [1668,5]
>>        at 
>> com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
>>        at 
>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
>>        at 
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
>>        at 
>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>        at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>>        at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>>        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>> 
>> 
>> The discussions I'm seeing on Stack Exchange about the "…not a valid XML 
>> character" point to XML that is being generated with characters that are 
>> invalid in XML.  (In this case 0xC -- or "form feed" character.)
>> 
>> Before I start tracing around the guts of GSearch, is this sounding familiar 
>> to anyone?
>> 
>> 
>> Peter


--
Peter Murray
Assistant Director, Technology Services Development
LYRASIS
[email protected]
+1 678-235-2955
800.999.8558 x2955



------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Re: [fcrepo-user] GSearch feeding invalid characters to SOLR via SOAP? (was: Does Islandora 7.x index FULL_TEXT DS from PDF SP by default?)

Reply via email to