Greetings, Gert! Thanks for replying.
I'm using the supplied foxmlToSolr.xslt from the 2.5 distribution, which is
using (ii) -- the Tika call. I dug in a little further and found that it was
not a literal 0x0C character that was causing the problem but a "" entity.
If I add a filter for that in
gsearch/FgsSolr2/src/java/dk/defxws/fgsolr/OperationsImpl.java then the XML
document is added to the SOLR index without error:
StringBuffer sb = (new GTransformer()).transform(
xsltPath,
new StreamSource(foxmlStream),
config.getURIResolver(indexName),
params);
StringBuffer sb2 = new StringBuffer(sb.toString().replaceAll("",
""));
Not elegant, but it seems to be working.
Peter
On Mar 19, 2013, at 7:08 AM, Gert Schmeltz Pedersen <[email protected]> wrote:
>
> Peter, I will try to clarify some things.
>
> As you know, GSearch uses PDFBox to extract text from PDF files.
>
> It is done from foxmlToSolr.xslt, calling either (i) exts:getDatastreamText()
> or (ii) exts:getDatastreamFromTika() .
>
> In case (i) GSearch calls PDFBox direct, and it replaces characters below
> space (including 0xc, excepting 0x9, 0xa, and 0xd) with space, you may see
> the replacements in DEBUG log lines.
>
> In case (ii) GSearch calls Tika direct, which calls its internal PDFBox, and
> there is no replacement.
>
> Tika was included in GSearch from version 2.4
>
> I would like you to confirm that you get the illegal character, when you call
> case (ii), and not in case (i), please.
>
> If confirmed, you may simply go on calling case (i). The next version of
> GSearch may include the character replacement in case (ii).
>
> Gert
>
>
>
>
>
>
> On 19/03/2013, at 02.48, Peter Murray wrote:
>
>> This message started in [email protected], but appears to be a more
>> general problem with gsearch, so I'm also copying this to fedora-users.
>>
>> On Mar 18, 2013, at 8:06 PM, Peter Murray <[email protected]> wrote:
>>> Does default configuration of GSearch for Islandora-7.x index the FULL_TEXT
>>> datastream of objects created by the PDF Solution Pack? The search engine
>>> appears to index the metadata without fail. I've even gone into the
>>> GSearch updateIndex web screen and updated all of the FOXML files. I'm
>>> using the GSearch 2.5 (the version previous to the one released today)
>>> 'fgsconfig-basic-for-islandora.properties' updated with the passwords and
>>> locations specific to my setup.
>>
>>
>> I've dug a little deeper on this, and am still coming up stymied. It looks
>> like objects with PDFs are not getting index. GSearch is showing this error:
>>
>> DEBUG 2013-03-18 21:34:09,583 (Config) insertSystemProperties
>> propertyValue=http://localhost:8080/solr
>> DEBUG 2013-03-18 21:34:09,594 (OperationsImpl) closeIndexSearcher
>> indexName=FgsIndex
>> DEBUG 2013-03-18 21:34:09,595 (OperationsImpl) closeIndexReader
>> indexName=FgsIndex docCount=45
>> ERROR 2013-03-18 21:34:09,597 (UpdateListener) Unable to perform index
>> update due to Exception: Mon Mar 18 21:34:09 EDT 2013 Connection error (is
>> Solr running at http://localhost:8080/solr/update ?): java.io.IOException:
>> Server returned HTTP response code: 500 for URL:
>> http://localhost:8080/solr/update
>> dk.defxws.fedoragsearch.server.errors.GenericSearchException: Mon Mar 18
>> 21:34:09 EDT 2013 Connection error (is Solr running at
>> http://localhost:8080/solr/update ?): java.io.IOException: Server returned
>> HTTP response code: 500 for URL: http://localhost:8080/solr/update
>> at dk.defxws.fgssolr.OperationsImpl.postData(OperationsImpl.java:653)
>> at dk.defxws.fgssolr.OperationsImpl.indexDoc(OperationsImpl.java:473)
>> at dk.defxws.fgssolr.OperationsImpl.fromPid(OperationsImpl.java:413)
>>
>>
>> Which correlates to this SOLR error in catalina.out:
>>
>> Mar 18, 2013 9:34:09 PM org.apache.solr.common.SolrException log
>> SEVERE: [com.ctc.wstx.exc.WstxLazyException]
>> com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion
>> character (code 0xc) not a valid XML character
>> at [row,col {unknown-source}]: [1668,5]
>> at
>> com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
>> at
>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
>> at
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
>> at
>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>>
>>
>> The discussions I'm seeing on Stack Exchange about the "…not a valid XML
>> character" point to XML that is being generated with characters that are
>> invalid in XML. (In this case 0xC -- or "form feed" character.)
>>
>> Before I start tracing around the guts of GSearch, is this sounding familiar
>> to anyone?
>>
>>
>> Peter
--
Peter Murray
Assistant Director, Technology Services Development
LYRASIS
[email protected]
+1 678-235-2955
800.999.8558 x2955
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users