Re: [fcrepo-user] GSearch feeding invalid characters to SOLR via SOAP? (was: Does Islandora 7.x index FULL_TEXT DS from PDF SP by default?)

Peter Murray Tue, 19 Mar 2013 19:11:35 -0700

Ah, excellent.  Thank you for the additional clues, Gert, on the right place to 
filter out the illegal XML characters.  I'd like to stick with Tika because it 
will extract text from a wider variety of document formats, so I've copied this 
code into the end of that function.  It is working as expected.


Best,


Peter

On Mar 19, 2013, at 1:24 PM, Gert Schmeltz Pedersen <[email protected]> wrote:
> 
> You are right, that takes care of that character (btw 0xc and &#12; are one 
> and the same, the character with integer value 12). However, if you place the 
> code there, then all contents of the index docs are checked for that 
> character. Better place it, where it applies only to the result of text 
> extraction from PDFBox, and there it should apply to all the characters that 
> would be illegal, namely 00-31, except 09, 10, and 13. That is why I have 
> this code for case (i) at the end of 
> dk.defxws.fedoragsearch.server.TransformerToText.getTextFromPDF():
> 
> //      put space instead of characters not allowed in the indexing stylesheet
>         char c;
>               for (int i=0; i<docText.length(); i++) {
>                       c = docText.charAt(i);
>               if (c < 32 && c != 9 && c != 10 && c != 13) {
>                 if (logger.isDebugEnabled())
>                       logger.debug("getTextFromPDF index="+i+" char="+c+" set 
> to 32");
>                 docText.replace(i, i+1, " ");
>               }
>         }
> 
> The same code for case (ii) at the end of 
> dk.defxws.fedoragsearch.server.TransformerToText.getFromTika()  will apply to 
> all text extractions with Tika. It should probably be done in the next 
> version of GSearch.
> 
> You can switch between the cases in your foxmlToSolr.xslt by uncommenting 
> case (i) and commenting case (ii) that is, call exts:getDatastreamText() 
> instead of exts:getDatastreamFromTika().
> 
> Gert
> 
> 
> On 19/03/2013, at 14.08, Peter Murray wrote:
> 
>> Greetings, Gert!  Thanks for replying.
>> 
>> I'm using the supplied foxmlToSolr.xslt from the 2.5 distribution, which is 
>> using (ii) -- the Tika call.  I dug in a little further and found that it 
>> was not a literal 0x0C character that was causing the problem but a "&#12;" 
>> entity.  If I add a filter for that in 
>> gsearch/FgsSolr2/src/java/dk/defxws/fgsolr/OperationsImpl.java then the XML 
>> document is added to the SOLR index without error:
>> 
>>      StringBuffer sb = (new GTransformer()).transform(
>>                      xsltPath, 
>>                      new StreamSource(foxmlStream),
>>                      config.getURIResolver(indexName),
>>                      params);
>>        StringBuffer sb2 = new StringBuffer(sb.toString().replaceAll("&#12;", 
>> ""));
>> 
>> Not elegant, but it seems to be working.
>> 
>> 
>> Peter
>> 
>> On Mar 19, 2013, at 7:08 AM, Gert Schmeltz Pedersen <[email protected]> 
>> wrote:
>>> 
>>> Peter, I will try to clarify some things. 
>>> 
>>> As you know, GSearch uses PDFBox to extract text from PDF files. 
>>> 
>>> It is done from foxmlToSolr.xslt, calling either (i) 
>>> exts:getDatastreamText() or (ii) exts:getDatastreamFromTika()  .
>>> 
>>> In case (i) GSearch calls PDFBox direct, and it replaces characters below 
>>> space (including 0xc, excepting 0x9, 0xa, and 0xd) with space, you may see 
>>> the replacements in DEBUG log lines.
>>> 
>>> In case (ii) GSearch calls Tika direct, which calls its internal PDFBox, 
>>> and there is no replacement.
>>> 
>>> Tika was included in GSearch from version 2.4
>>> 
>>> I would like you to confirm that you get the illegal character, when you 
>>> call case (ii), and not in case (i), please.
>>> 
>>> If confirmed, you may simply go on calling case (i). The next version of 
>>> GSearch may include the character replacement in case (ii).
>>> 
>>> Gert
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 19/03/2013, at 02.48, Peter Murray wrote:
>>> 
>>>> This message started in [email protected], but appears to be a 
>>>> more general problem with gsearch, so I'm also copying this to 
>>>> fedora-users.
>>>> 
>>>> On Mar 18, 2013, at 8:06 PM, Peter Murray <[email protected]> wrote:
>>>>> Does default configuration of GSearch for Islandora-7.x index the 
>>>>> FULL_TEXT datastream of objects created by the PDF Solution Pack?  The 
>>>>> search engine appears to index the metadata without fail.  I've even gone 
>>>>> into the GSearch updateIndex web screen and updated all of the FOXML 
>>>>> files.  I'm using the GSearch 2.5 (the version previous to the one 
>>>>> released today) 'fgsconfig-basic-for-islandora.properties' updated with 
>>>>> the passwords and locations specific to my setup.
>>>> 
>>>> 
>>>> I've dug a little deeper on this, and am still coming up stymied.  It 
>>>> looks like objects with PDFs are not getting index.  GSearch is showing 
>>>> this error:
>>>> 
>>>> DEBUG 2013-03-18 21:34:09,583 (Config) insertSystemProperties 
>>>> propertyValue=http://localhost:8080/solr
>>>> DEBUG 2013-03-18 21:34:09,594 (OperationsImpl) closeIndexSearcher 
>>>> indexName=FgsIndex
>>>> DEBUG 2013-03-18 21:34:09,595 (OperationsImpl) closeIndexReader 
>>>> indexName=FgsIndex docCount=45
>>>> ERROR 2013-03-18 21:34:09,597 (UpdateListener) Unable to perform index 
>>>> update due to Exception: Mon Mar 18 21:34:09 EDT 2013 Connection error (is 
>>>> Solr running at http://localhost:8080/solr/update ?): java.io.IOException: 
>>>> Server returned HTTP response code: 500 for URL: 
>>>> http://localhost:8080/solr/update
>>>> dk.defxws.fedoragsearch.server.errors.GenericSearchException: Mon Mar 18 
>>>> 21:34:09 EDT 2013 Connection error (is Solr running at 
>>>> http://localhost:8080/solr/update ?): java.io.IOException: Server returned 
>>>> HTTP response code: 500 for URL: http://localhost:8080/solr/update
>>>>       at dk.defxws.fgssolr.OperationsImpl.postData(OperationsImpl.java:653)
>>>>       at dk.defxws.fgssolr.OperationsImpl.indexDoc(OperationsImpl.java:473)
>>>>       at dk.defxws.fgssolr.OperationsImpl.fromPid(OperationsImpl.java:413)
>>>> 
>>>> 
>>>> Which correlates to this SOLR error in catalina.out:
>>>> 
>>>> Mar 18, 2013 9:34:09 PM org.apache.solr.common.SolrException log
>>>> SEVERE: [com.ctc.wstx.exc.WstxLazyException] 
>>>> com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion 
>>>> character (code 0xc) not a valid XML character
>>>> at [row,col {unknown-source}]: [1668,5]
>>>>       at 
>>>> com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
>>>>       at 
>>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
>>>>       at 
>>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
>>>>       at 
>>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>>>       at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>>>>       at 
>>>> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>>>>       at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>>>> 
>>>> 
>>>> The discussions I'm seeing on Stack Exchange about the "…not a valid XML 
>>>> character" point to XML that is being generated with characters that are 
>>>> invalid in XML.  (In this case 0xC -- or "form feed" character.)
>>>> 
>>>> Before I start tracing around the guts of GSearch, is this sounding 
>>>> familiar to anyone?
>>>> 
>>>> 
>>>> Peter

--
Peter Murray
Assistant Director, Technology Services Development
LYRASIS
[email protected]
+1 678-235-2955
800.999.8558 x2955



------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Re: [fcrepo-user] GSearch feeding invalid characters to SOLR via SOAP? (was: Does Islandora 7.x index FULL_TEXT DS from PDF SP by default?)

Reply via email to