Peter, I will try to clarify some things.
As you know, GSearch uses PDFBox to extract text from PDF files.
It is done from foxmlToSolr.xslt, calling either (i) exts:getDatastreamText()
or (ii) exts:getDatastreamFromTika() .
In case (i) GSearch calls PDFBox direct, and it replaces characters below space
(including 0xc, excepting 0x9, 0xa, and 0xd) with space, you may see the
replacements in DEBUG log lines.
In case (ii) GSearch calls Tika direct, which calls its internal PDFBox, and
there is no replacement.
Tika was included in GSearch from version 2.4
I would like you to confirm that you get the illegal character, when you call
case (ii), and not in case (i), please.
If confirmed, you may simply go on calling case (i). The next version of
GSearch may include the character replacement in case (ii).
Gert
On 19/03/2013, at 02.48, Peter Murray wrote:
> This message started in [email protected], but appears to be a more
> general problem with gsearch, so I'm also copying this to fedora-users.
>
> On Mar 18, 2013, at 8:06 PM, Peter Murray <[email protected]> wrote:
>> Does default configuration of GSearch for Islandora-7.x index the FULL_TEXT
>> datastream of objects created by the PDF Solution Pack? The search engine
>> appears to index the metadata without fail. I've even gone into the GSearch
>> updateIndex web screen and updated all of the FOXML files. I'm using the
>> GSearch 2.5 (the version previous to the one released today)
>> 'fgsconfig-basic-for-islandora.properties' updated with the passwords and
>> locations specific to my setup.
>
>
> I've dug a little deeper on this, and am still coming up stymied. It looks
> like objects with PDFs are not getting index. GSearch is showing this error:
>
> DEBUG 2013-03-18 21:34:09,583 (Config) insertSystemProperties
> propertyValue=http://localhost:8080/solr
> DEBUG 2013-03-18 21:34:09,594 (OperationsImpl) closeIndexSearcher
> indexName=FgsIndex
> DEBUG 2013-03-18 21:34:09,595 (OperationsImpl) closeIndexReader
> indexName=FgsIndex docCount=45
> ERROR 2013-03-18 21:34:09,597 (UpdateListener) Unable to perform index update
> due to Exception: Mon Mar 18 21:34:09 EDT 2013 Connection error (is Solr
> running at http://localhost:8080/solr/update ?): java.io.IOException: Server
> returned HTTP response code: 500 for URL: http://localhost:8080/solr/update
> dk.defxws.fedoragsearch.server.errors.GenericSearchException: Mon Mar 18
> 21:34:09 EDT 2013 Connection error (is Solr running at
> http://localhost:8080/solr/update ?): java.io.IOException: Server returned
> HTTP response code: 500 for URL: http://localhost:8080/solr/update
> at dk.defxws.fgssolr.OperationsImpl.postData(OperationsImpl.java:653)
> at dk.defxws.fgssolr.OperationsImpl.indexDoc(OperationsImpl.java:473)
> at dk.defxws.fgssolr.OperationsImpl.fromPid(OperationsImpl.java:413)
>
>
> Which correlates to this SOLR error in catalina.out:
>
> Mar 18, 2013 9:34:09 PM org.apache.solr.common.SolrException log
> SEVERE: [com.ctc.wstx.exc.WstxLazyException]
> com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion
> character (code 0xc) not a valid XML character
> at [row,col {unknown-source}]: [1668,5]
> at
> com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
> at
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
> at
> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>
>
> The discussions I'm seeing on Stack Exchange about the "…not a valid XML
> character" point to XML that is being generated with characters that are
> invalid in XML. (In this case 0xC -- or "form feed" character.)
>
> Before I start tracing around the guts of GSearch, is this sounding familiar
> to anyone?
>
>
> Peter
> --
> Peter Murray
> Assistant Director, Technology Services Development
> LYRASIS
> [email protected]
> +1 678-235-2955
> 800.999.8558 x2955
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Fedora-commons-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users