Yes, Tika extracts text and metadata from datastreams according to their 
mimetypes, and it does not try to extract text from image files, just metadata.

In order to find why you get the SolrException, please use updateIndex fromPid 
on that Fedora object in debug logging mode, then send me all the gsearch log 
lines from that operation, together with the foxml record and the xslt indexing 
stylesheet in use.

Gert


On 24/07/2013, at 15.03, [email protected] wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> It uses Apache Tika, and to my understanding, extracts whatever Tika can 
> extract. But I invite comment from Gert so we can be sure about that.
> 
> If you mean the Java source files by which Tika extraction is made available 
> as an XSLT extension function, they are here:
> 
> https://github.com/fcrepo/gsearch/blob/master/FedoraGenericSearch/src/java/dk/defxws/fedoragsearch/server/GenericOperationsImpl.java
> 
> and for text extraction, here:
> 
> https://github.com/fcrepo/gsearch/blob/master/FedoraGenericSearch/src/java/dk/defxws/fedoragsearch/server/TransformerToText.java
> 
> - ---
> A. Soroka
> The University of Virginia Library
> 
> On Jul 24, 2013, at 8:49 AM, Alistair Young wrote:
> 
>> I can see how it 's useful but with it in, I have a jpeg file that can't
>> be indexed. What sort of technical assertions does it extract/infer? I
>> could see if there's something strange in the image file.
>> 
>> Alternately, what's the source file and I'll have a look...
>> 
>> Alistair
>> 
>> -- 
>> mov eax,1
>> mov ebx,0
>> int 80h
>> 
>> 
>> 
>> 
>> On 24/07/2013 13:42, "[email protected]" <[email protected]> wrote:
>> 
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>> 
>>> I was one of the people who instigated Gert to add that functionality.The
>>> motivation is to be able to extract technical assertions about binary
>>> datastreams and use them in indexing. It's not extracting content from
>>> images, although it could extract content from PDF files or other
>>> text-containing formats.
>>> 
>>> On perhaps a more useful note, you should definitely expect to alter the
>>> default indexing stylesheets, or even better, to create your own that are
>>> to your particular purposes.
>>> 
>>> - ---
>>> A. Soroka
>>> The University of Virginia Library
>>> 
>>> On Jul 24, 2013, at 8:32 AM, Alistair Young wrote:
>>> 
>>>> sorted it by removing the Apache Tika extraction from:
>>>> 
>>>> WEB-INF/classes/fgsconfigFinal/index/FgsIndex/foxmlToSolrGenerated.xslt
>>>> 
>>>> it seems it extracts the content and tries to index it. Not sure why it
>>>> would want to extract the content of an image but when it does it causes
>>>> Solr to fail to index the resource:
>>>> 
>>>> SEVERE: org.apache.solr.common.SolrException: Illegal character (NULL,
>>>> unicode 0) encountered: not valid in any content
>>>> 
>>>> Seems to only think some jpg files are not jpg files.
>>>> 
>>>> Alistair
>>>> 
>>>> -- 
>>>> mov eax,1
>>>> mov ebx,0
>>>> int 80h
>>>> 
>>>> From: Alistair Young <[email protected]>
>>>> Reply-To: "Support and info exchange list for Fedora users."
>>>> <[email protected]>
>>>> Date: Wednesday, 24 July 2013 11:03
>>>> To: "Support and info exchange list for Fedora users."
>>>> <[email protected]>
>>>> Subject: Re: [fcrepo-user] Does gsearch index content with solr?
>>>> 
>>>> sorry should have mentioned, it's the content datastream, i.e.
>>>> image/jpeg
>>>> 
>>>> Alistair
>>>> 
>>>> -- 
>>>> mov eax,1
>>>> mov ebx,0
>>>> int 80h
>>>> 
>>>> From: Alistair Young <[email protected]>
>>>> Reply-To: "Support and info exchange list for Fedora users."
>>>> <[email protected]>
>>>> Date: Wednesday, 24 July 2013 10:59
>>>> To: "Support and info exchange list for Fedora users."
>>>> <[email protected]>
>>>> Subject: [fcrepo-user] Does gsearch index content with solr?
>>>> 
>>>> I have a weird problem. I dropped a foxml file into
>>>> FgsConfig/indexingXsltGenerator/foxml and configured etc but certain
>>>> files, when uploaded cause solr to crash:
>>>> 
>>>> SEVERE: org.apache.solr.common.SolrException: Illegal character (NULL,
>>>> unicode 0) encountered: not valid in any content
>>>> 
>>>> If I don't include datastream in the foxml it doesn't cause the crash,
>>>> i.e. remove this:
>>>> 
>>>> <foxml:datastream ID="AUDIT" STATE="A" CONTROL_GROUP="X"
>>>> VERSIONABLE="false">
>>>> 
>>>> Should the foxml used to configure gsearch only contain 'metadata',
>>>> i.e. DC, RDF etc and not datastreams?
>>>> 
>>>> thanks,
>>>> 
>>>> Alistair
>>>> 
>>>> 
>>>> -------------------------------------------------------------------------
>>>> -----
>>>> See everything from the browser to the database with AppDynamics
>>>> Get end-to-end visibility with application monitoring from AppDynamics
>>>> Isolate bottlenecks and diagnose root cause in seconds.
>>>> Start your free trial of AppDynamics Pro today!
>>>> 
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clkt
>>>> rk_______________________________________________
>>>> Fedora-commons-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>>> 
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
>>> Comment: GPGTools - http://gpgtools.org
>>> 
>>> iQEcBAEBAgAGBQJR78vHAAoJEATpPYSyaoIk8dsIALihgJB0b4OABcOcOnk2qthk
>>> 79JqHouayvOFwTNMHsHZMIPXQ9KlD7h/zrHVYPPOqXV8fvNb3+EeQEal5WJxs4Z3
>>> mMevFpEpBlOWUOBAiEqayNNfnxNCGQ3ARCRXNzeiaheM43ouFCluOGkX9p3fjqSV
>>> qq6QG862vDFvYF69rMH1NiFIUIA/QP8w/K/QzyI8qoblrzWCX2LmQ8NaH5b0oN1j
>>> Nb0NXIQv+XOVJZeHFvbHNEzGMGMEWHKs2QsZ1auirOKaO3ccV74+gVTuvDkmmuXL
>>> VjQQoxNBTqbkhSpoDsWPCkHE+fVGuWyFS/ffJQ/0heX1rWOkiOFgJhhGuwJOl2Y=
>>> =s4aM
>>> -----END PGP SIGNATURE-----
>>> 
>>> --------------------------------------------------------------------------
>>> ----
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktr
>>> k
>>> _______________________________________________
>>> Fedora-commons-users mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>>> 
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Fedora-commons-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.19 (Darwin)
> Comment: GPGTools - http://gpgtools.org
> 
> iQEcBAEBAgAGBQJR79CgAAoJEATpPYSyaoIkplQH/RU81U3ekHrGhHDMf6Qn+R5k
> aiRSat1jZAQvCgND12GYmWywn9ap4mouDOiN8b4o+881HUUVClcFpgGueQQ3eK3d
> VCyhmWs4inO2rMz8RTNYWDwYfvBAB9qk4Ji6gSj2bM+VnTV6F64LuJRnhToqbVl+
> 3cLyTZwAFCgb9GHUuo8jPYomCFpSMKvA/Ohc5z5DXvw9HnHVF2AD2pM/3i5wTl84
> zJvgtK/SWCD6HvBZwQbUmXTne9O6h8hHMEZOTG5szxDyQhFAmj4cQChXXxG2u+sI
> Z5XZ7Ook43A/iVVM/0XoP7bwoM/uaUPpjlg0iAI0Ekk60BV+0InmCRgKVtwBY+Q=
> =GcmN
> -----END PGP SIGNATURE-----
> 
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Fedora-commons-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Fedora-commons-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Reply via email to