Ah, excellent.  Thank you for the additional clues, Gert, on the right place to 
filter out the illegal XML characters.  I'd like to stick with Tika because it 
will extract text from a wider variety of document formats, so I've copied this 
code into the end of that function.  It is working as expected.

Best,


Peter

On Mar 19, 2013, at 1:24 PM, Gert Schmeltz Pedersen <[email protected]> wrote:
> 
> You are right, that takes care of that character (btw 0xc and &#12; are one 
> and the same, the character with integer value 12). However, if you place the 
> code there, then all contents of the index docs are checked for that 
> character. Better place it, where it applies only to the result of text 
> extraction from PDFBox, and there it should apply to all the characters that 
> would be illegal, namely 00-31, except 09, 10, and 13. That is why I have 
> this code for case (i) at the end of 
> dk.defxws.fedoragsearch.server.TransformerToText.getTextFromPDF():
> 
> //      put space instead of characters not allowed in the indexing stylesheet
>         char c;
>               for (int i=0; i<docText.length(); i++) {
>                       c = docText.charAt(i);
>               if (c < 32 && c != 9 && c != 10 && c != 13) {
>                 if (logger.isDebugEnabled())
>                       logger.debug("getTextFromPDF index="+i+" char="+c+" set 
> to 32");
>                 docText.replace(i, i+1, " ");
>               }
>         }
> 
> The same code for case (ii) at the end of 
> dk.defxws.fedoragsearch.server.TransformerToText.getFromTika()  will apply to 
> all text extractions with Tika. It should probably be done in the next 
> version of GSearch.
> 
> You can switch between the cases in your foxmlToSolr.xslt by uncommenting 
> case (i) and commenting case (ii) that is, call exts:getDatastreamText() 
> instead of exts:getDatastreamFromTika().
> 
> Gert
> 
> 
> On 19/03/2013, at 14.08, Peter Murray wrote:
> 
>> Greetings, Gert!  Thanks for replying.
>> 
>> I'm using the supplied foxmlToSolr.xslt from the 2.5 distribution, which is 
>> using (ii) -- the Tika call.  I dug in a little further and found that it 
>> was not a literal 0x0C character that was causing the problem but a "&#12;" 
>> entity.  If I add a filter for that in 
>> gsearch/FgsSolr2/src/java/dk/defxws/fgsolr/OperationsImpl.java then the XML 
>> document is added to the SOLR index without error:
>> 
>>      StringBuffer sb = (new GTransformer()).transform(
>>                      xsltPath, 
>>                      new StreamSource(foxmlStream),
>>                      config.getURIResolver(indexName),
>>                      params);
>>        StringBuffer sb2 = new StringBuffer(sb.toString().replaceAll("&#12;", 
>> ""));
>> 
>> Not elegant, but it seems to be working.
>> 
>> 
>> Peter
>> 
>> On Mar 19, 2013, at 7:08 AM, Gert Schmeltz Pedersen <[email protected]> 
>> wrote:
>>> 
>>> Peter, I will try to clarify some things. 
>>> 
>>> As you know, GSearch uses PDFBox to extract text from PDF files. 
>>> 
>>> It is done from foxmlToSolr.xslt, calling either (i) 
>>> exts:getDatastreamText() or (ii) exts:getDatastreamFromTika()  .
>>> 
>>> In case (i) GSearch calls PDFBox direct, and it replaces characters below 
>>> space (including 0xc, excepting 0x9, 0xa, and 0xd) with space, you may see 
>>> the replacements in DEBUG log lines.
>>> 
>>> In case (ii) GSearch calls Tika direct, which calls its internal PDFBox, 
>>> and there is no replacement.
>>> 
>>> Tika was included in GSearch from version 2.4
>>> 
>>> I would like you to confirm that you get the illegal character, when you 
>>> call case (ii), and not in case (i), please.
>>> 
>>> If confirmed, you may simply go on calling case (i). The next version of 
>>> GSearch may include the character replacement in case (ii).
>>> 
>>> Gert
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 19/03/2013, at 02.48, Peter Murray wrote:
>>> 
>>>> This message started in [email protected], but appears to be a 
>>>> more general problem with gsearch, so I'm also copying this to 
>>>> fedora-users.
>>>> 
>>>> On Mar 18, 2013, at 8:06 PM, Peter Murray <[email protected]> wrote:
>>>>> Does default configuration of GSearch for Islandora-7.x index the 
>>>>> FULL_TEXT datastream of objects created by the PDF Solution Pack?  The 
>>>>> search engine appears to index the metadata without fail.  I've even gone 
>>>>> into the GSearch updateIndex web screen and updated all of the FOXML 
>>>>> files.  I'm using the GSearch 2.5 (the version previous to the one 
>>>>> released today) 'fgsconfig-basic-for-islandora.properties' updated with 
>>>>> the passwords and locations specific to my setup.
>>>> 
>>>> 
>>>> I've dug a little deeper on this, and am still coming up stymied.  It 
>>>> looks like objects with PDFs are not getting index.  GSearch is showing 
>>>> this error:
>>>> 
>>>> DEBUG 2013-03-18 21:34:09,583 (Config) insertSystemProperties 
>>>> propertyValue=http://localhost:8080/solr
>>>> DEBUG 2013-03-18 21:34:09,594 (OperationsImpl) closeIndexSearcher 
>>>> indexName=FgsIndex
>>>> DEBUG 2013-03-18 21:34:09,595 (OperationsImpl) closeIndexReader 
>>>> indexName=FgsIndex docCount=45
>>>> ERROR 2013-03-18 21:34:09,597 (UpdateListener) Unable to perform index 
>>>> update due to Exception: Mon Mar 18 21:34:09 EDT 2013 Connection error (is 
>>>> Solr running at http://localhost:8080/solr/update ?): java.io.IOException: 
>>>> Server returned HTTP response code: 500 for URL: 
>>>> http://localhost:8080/solr/update
>>>> dk.defxws.fedoragsearch.server.errors.GenericSearchException: Mon Mar 18 
>>>> 21:34:09 EDT 2013 Connection error (is Solr running at 
>>>> http://localhost:8080/solr/update ?): java.io.IOException: Server returned 
>>>> HTTP response code: 500 for URL: http://localhost:8080/solr/update
>>>>       at dk.defxws.fgssolr.OperationsImpl.postData(OperationsImpl.java:653)
>>>>       at dk.defxws.fgssolr.OperationsImpl.indexDoc(OperationsImpl.java:473)
>>>>       at dk.defxws.fgssolr.OperationsImpl.fromPid(OperationsImpl.java:413)
>>>> 
>>>> 
>>>> Which correlates to this SOLR error in catalina.out:
>>>> 
>>>> Mar 18, 2013 9:34:09 PM org.apache.solr.common.SolrException log
>>>> SEVERE: [com.ctc.wstx.exc.WstxLazyException] 
>>>> com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion 
>>>> character (code 0xc) not a valid XML character
>>>> at [row,col {unknown-source}]: [1668,5]
>>>>       at 
>>>> com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
>>>>       at 
>>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
>>>>       at 
>>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
>>>>       at 
>>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>>>       at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>>>>       at 
>>>> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>>>>       at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>>>> 
>>>> 
>>>> The discussions I'm seeing on Stack Exchange about the "…not a valid XML 
>>>> character" point to XML that is being generated with characters that are 
>>>> invalid in XML.  (In this case 0xC -- or "form feed" character.)
>>>> 
>>>> Before I start tracing around the guts of GSearch, is this sounding 
>>>> familiar to anyone?
>>>> 
>>>> 
>>>> Peter

--
Peter Murray
Assistant Director, Technology Services Development
LYRASIS
[email protected]
+1 678-235-2955
800.999.8558 x2955



------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to