Ah, excellent. Thank you for the additional clues, Gert, on the right place to filter out the illegal XML characters. I'd like to stick with Tika because it will extract text from a wider variety of document formats, so I've copied this code into the end of that function. It is working as expected.
Best, Peter On Mar 19, 2013, at 1:24 PM, Gert Schmeltz Pedersen <[email protected]> wrote: > > You are right, that takes care of that character (btw 0xc and  are one > and the same, the character with integer value 12). However, if you place the > code there, then all contents of the index docs are checked for that > character. Better place it, where it applies only to the result of text > extraction from PDFBox, and there it should apply to all the characters that > would be illegal, namely 00-31, except 09, 10, and 13. That is why I have > this code for case (i) at the end of > dk.defxws.fedoragsearch.server.TransformerToText.getTextFromPDF(): > > // put space instead of characters not allowed in the indexing stylesheet > char c; > for (int i=0; i<docText.length(); i++) { > c = docText.charAt(i); > if (c < 32 && c != 9 && c != 10 && c != 13) { > if (logger.isDebugEnabled()) > logger.debug("getTextFromPDF index="+i+" char="+c+" set > to 32"); > docText.replace(i, i+1, " "); > } > } > > The same code for case (ii) at the end of > dk.defxws.fedoragsearch.server.TransformerToText.getFromTika() will apply to > all text extractions with Tika. It should probably be done in the next > version of GSearch. > > You can switch between the cases in your foxmlToSolr.xslt by uncommenting > case (i) and commenting case (ii) that is, call exts:getDatastreamText() > instead of exts:getDatastreamFromTika(). > > Gert > > > On 19/03/2013, at 14.08, Peter Murray wrote: > >> Greetings, Gert! Thanks for replying. >> >> I'm using the supplied foxmlToSolr.xslt from the 2.5 distribution, which is >> using (ii) -- the Tika call. I dug in a little further and found that it >> was not a literal 0x0C character that was causing the problem but a "" >> entity. If I add a filter for that in >> gsearch/FgsSolr2/src/java/dk/defxws/fgsolr/OperationsImpl.java then the XML >> document is added to the SOLR index without error: >> >> StringBuffer sb = (new GTransformer()).transform( >> xsltPath, >> new StreamSource(foxmlStream), >> config.getURIResolver(indexName), >> params); >> StringBuffer sb2 = new StringBuffer(sb.toString().replaceAll("", >> "")); >> >> Not elegant, but it seems to be working. >> >> >> Peter >> >> On Mar 19, 2013, at 7:08 AM, Gert Schmeltz Pedersen <[email protected]> >> wrote: >>> >>> Peter, I will try to clarify some things. >>> >>> As you know, GSearch uses PDFBox to extract text from PDF files. >>> >>> It is done from foxmlToSolr.xslt, calling either (i) >>> exts:getDatastreamText() or (ii) exts:getDatastreamFromTika() . >>> >>> In case (i) GSearch calls PDFBox direct, and it replaces characters below >>> space (including 0xc, excepting 0x9, 0xa, and 0xd) with space, you may see >>> the replacements in DEBUG log lines. >>> >>> In case (ii) GSearch calls Tika direct, which calls its internal PDFBox, >>> and there is no replacement. >>> >>> Tika was included in GSearch from version 2.4 >>> >>> I would like you to confirm that you get the illegal character, when you >>> call case (ii), and not in case (i), please. >>> >>> If confirmed, you may simply go on calling case (i). The next version of >>> GSearch may include the character replacement in case (ii). >>> >>> Gert >>> >>> >>> >>> >>> >>> >>> On 19/03/2013, at 02.48, Peter Murray wrote: >>> >>>> This message started in [email protected], but appears to be a >>>> more general problem with gsearch, so I'm also copying this to >>>> fedora-users. >>>> >>>> On Mar 18, 2013, at 8:06 PM, Peter Murray <[email protected]> wrote: >>>>> Does default configuration of GSearch for Islandora-7.x index the >>>>> FULL_TEXT datastream of objects created by the PDF Solution Pack? The >>>>> search engine appears to index the metadata without fail. I've even gone >>>>> into the GSearch updateIndex web screen and updated all of the FOXML >>>>> files. I'm using the GSearch 2.5 (the version previous to the one >>>>> released today) 'fgsconfig-basic-for-islandora.properties' updated with >>>>> the passwords and locations specific to my setup. >>>> >>>> >>>> I've dug a little deeper on this, and am still coming up stymied. It >>>> looks like objects with PDFs are not getting index. GSearch is showing >>>> this error: >>>> >>>> DEBUG 2013-03-18 21:34:09,583 (Config) insertSystemProperties >>>> propertyValue=http://localhost:8080/solr >>>> DEBUG 2013-03-18 21:34:09,594 (OperationsImpl) closeIndexSearcher >>>> indexName=FgsIndex >>>> DEBUG 2013-03-18 21:34:09,595 (OperationsImpl) closeIndexReader >>>> indexName=FgsIndex docCount=45 >>>> ERROR 2013-03-18 21:34:09,597 (UpdateListener) Unable to perform index >>>> update due to Exception: Mon Mar 18 21:34:09 EDT 2013 Connection error (is >>>> Solr running at http://localhost:8080/solr/update ?): java.io.IOException: >>>> Server returned HTTP response code: 500 for URL: >>>> http://localhost:8080/solr/update >>>> dk.defxws.fedoragsearch.server.errors.GenericSearchException: Mon Mar 18 >>>> 21:34:09 EDT 2013 Connection error (is Solr running at >>>> http://localhost:8080/solr/update ?): java.io.IOException: Server returned >>>> HTTP response code: 500 for URL: http://localhost:8080/solr/update >>>> at dk.defxws.fgssolr.OperationsImpl.postData(OperationsImpl.java:653) >>>> at dk.defxws.fgssolr.OperationsImpl.indexDoc(OperationsImpl.java:473) >>>> at dk.defxws.fgssolr.OperationsImpl.fromPid(OperationsImpl.java:413) >>>> >>>> >>>> Which correlates to this SOLR error in catalina.out: >>>> >>>> Mar 18, 2013 9:34:09 PM org.apache.solr.common.SolrException log >>>> SEVERE: [com.ctc.wstx.exc.WstxLazyException] >>>> com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion >>>> character (code 0xc) not a valid XML character >>>> at [row,col {unknown-source}]: [1668,5] >>>> at >>>> com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45) >>>> at >>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729) >>>> at >>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659) >>>> at >>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) >>>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315) >>>> at >>>> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156) >>>> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) >>>> >>>> >>>> The discussions I'm seeing on Stack Exchange about the "…not a valid XML >>>> character" point to XML that is being generated with characters that are >>>> invalid in XML. (In this case 0xC -- or "form feed" character.) >>>> >>>> Before I start tracing around the guts of GSearch, is this sounding >>>> familiar to anyone? >>>> >>>> >>>> Peter -- Peter Murray Assistant Director, Technology Services Development LYRASIS [email protected] +1 678-235-2955 800.999.8558 x2955 ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Fedora-commons-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
