Re: [fcrepo-user] GSearch feeding invalid characters to SOLR via SOAP? (was: Does Islandora 7.x index FULL_TEXT DS from PDF SP by default?)

Gert Schmeltz Pedersen Tue, 19 Mar 2013 10:25:39 -0700

You are right, that takes care of that character (btw 0xc and &#12; are one and 
the same, the character with integer value 12). However, if you place the code 
there, then all contents of the index docs are checked for that character. 
Better place it, where it applies only to the result of text extraction from 
PDFBox, and there it should apply to all the characters that would be illegal, 
namely 00-31, except 09, 10, and 13. That is why I have this code for case (i) 
at the end of dk.defxws.fedoragsearch.server.TransformerToText.getTextFromPDF():


//      put space instead of characters not allowed in the indexing stylesheet
        char c;
        for (int i=0; i<docText.length(); i++) {
                c = docText.charAt(i);
                if (c < 32 && c != 9 && c != 10 && c != 13) {
                if (logger.isDebugEnabled())
                        logger.debug("getTextFromPDF index="+i+" char="+c+" set 
to 32");
                docText.replace(i, i+1, " ");
                }
        }

The same code for case (ii) at the end of 
dk.defxws.fedoragsearch.server.TransformerToText.getFromTika()  will apply to 
all text extractions with Tika. It should probably be done in the next version 
of GSearch.

You can switch between the cases in your foxmlToSolr.xslt by uncommenting case 
(i) and commenting case (ii) that is, call exts:getDatastreamText() instead of 
exts:getDatastreamFromTika().

Gert


On 19/03/2013, at 14.08, Peter Murray wrote:

> Greetings, Gert!  Thanks for replying.
> 
> I'm using the supplied foxmlToSolr.xslt from the 2.5 distribution, which is 
> using (ii) -- the Tika call.  I dug in a little further and found that it was 
> not a literal 0x0C character that was causing the problem but a "&#12;" 
> entity.  If I add a filter for that in 
> gsearch/FgsSolr2/src/java/dk/defxws/fgsolr/OperationsImpl.java then the XML 
> document is added to the SOLR index without error:
> 
>       StringBuffer sb = (new GTransformer()).transform(
>                       xsltPath, 
>                       new StreamSource(foxmlStream),
>                       config.getURIResolver(indexName),
>                       params);
>        StringBuffer sb2 = new StringBuffer(sb.toString().replaceAll("&#12;", 
> ""));
> 
> Not elegant, but it seems to be working.
> 
> 
> Peter
> 
> On Mar 19, 2013, at 7:08 AM, Gert Schmeltz Pedersen <[email protected]> 
> wrote:
>> 
>> Peter, I will try to clarify some things. 
>> 
>> As you know, GSearch uses PDFBox to extract text from PDF files. 
>> 
>> It is done from foxmlToSolr.xslt, calling either (i) 
>> exts:getDatastreamText() or (ii) exts:getDatastreamFromTika()  .
>> 
>> In case (i) GSearch calls PDFBox direct, and it replaces characters below 
>> space (including 0xc, excepting 0x9, 0xa, and 0xd) with space, you may see 
>> the replacements in DEBUG log lines.
>> 
>> In case (ii) GSearch calls Tika direct, which calls its internal PDFBox, and 
>> there is no replacement.
>> 
>> Tika was included in GSearch from version 2.4
>> 
>> I would like you to confirm that you get the illegal character, when you 
>> call case (ii), and not in case (i), please.
>> 
>> If confirmed, you may simply go on calling case (i). The next version of 
>> GSearch may include the character replacement in case (ii).
>> 
>> Gert
>> 
>> 
>> 
>> 
>> 
>> 
>> On 19/03/2013, at 02.48, Peter Murray wrote:
>> 
>>> This message started in [email protected], but appears to be a 
>>> more general problem with gsearch, so I'm also copying this to fedora-users.
>>> 
>>> On Mar 18, 2013, at 8:06 PM, Peter Murray <[email protected]> wrote:
>>>> Does default configuration of GSearch for Islandora-7.x index the 
>>>> FULL_TEXT datastream of objects created by the PDF Solution Pack?  The 
>>>> search engine appears to index the metadata without fail.  I've even gone 
>>>> into the GSearch updateIndex web screen and updated all of the FOXML 
>>>> files.  I'm using the GSearch 2.5 (the version previous to the one 
>>>> released today) 'fgsconfig-basic-for-islandora.properties' updated with 
>>>> the passwords and locations specific to my setup.
>>> 
>>> 
>>> I've dug a little deeper on this, and am still coming up stymied.  It looks 
>>> like objects with PDFs are not getting index.  GSearch is showing this 
>>> error:
>>> 
>>> DEBUG 2013-03-18 21:34:09,583 (Config) insertSystemProperties 
>>> propertyValue=http://localhost:8080/solr
>>> DEBUG 2013-03-18 21:34:09,594 (OperationsImpl) closeIndexSearcher 
>>> indexName=FgsIndex
>>> DEBUG 2013-03-18 21:34:09,595 (OperationsImpl) closeIndexReader 
>>> indexName=FgsIndex docCount=45
>>> ERROR 2013-03-18 21:34:09,597 (UpdateListener) Unable to perform index 
>>> update due to Exception: Mon Mar 18 21:34:09 EDT 2013 Connection error (is 
>>> Solr running at http://localhost:8080/solr/update ?): java.io.IOException: 
>>> Server returned HTTP response code: 500 for URL: 
>>> http://localhost:8080/solr/update
>>> dk.defxws.fedoragsearch.server.errors.GenericSearchException: Mon Mar 18 
>>> 21:34:09 EDT 2013 Connection error (is Solr running at 
>>> http://localhost:8080/solr/update ?): java.io.IOException: Server returned 
>>> HTTP response code: 500 for URL: http://localhost:8080/solr/update
>>>       at dk.defxws.fgssolr.OperationsImpl.postData(OperationsImpl.java:653)
>>>       at dk.defxws.fgssolr.OperationsImpl.indexDoc(OperationsImpl.java:473)
>>>       at dk.defxws.fgssolr.OperationsImpl.fromPid(OperationsImpl.java:413)
>>> 
>>> 
>>> Which correlates to this SOLR error in catalina.out:
>>> 
>>> Mar 18, 2013 9:34:09 PM org.apache.solr.common.SolrException log
>>> SEVERE: [com.ctc.wstx.exc.WstxLazyException] 
>>> com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion 
>>> character (code 0xc) not a valid XML character
>>> at [row,col {unknown-source}]: [1668,5]
>>>       at 
>>> com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
>>>       at 
>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
>>>       at 
>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
>>>       at 
>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>>>       at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>>>       at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>>>       at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>>> 
>>> 
>>> The discussions I'm seeing on Stack Exchange about the "…not a valid XML 
>>> character" point to XML that is being generated with characters that are 
>>> invalid in XML.  (In this case 0xC -- or "form feed" character.)
>>> 
>>> Before I start tracing around the guts of GSearch, is this sounding 
>>> familiar to anyone?
>>> 
>>> 
>>> Peter
> 
> 
> --
> Peter Murray
> Assistant Director, Technology Services Development
> LYRASIS
> [email protected]
> +1 678-235-2955
> 800.999.8558 x2955
> 
> 
> 
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Fedora-commons-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar

_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Re: [fcrepo-user] GSearch feeding invalid characters to SOLR via SOAP? (was: Does Islandora 7.x index FULL_TEXT DS from PDF SP by default?)

Reply via email to