[ 
https://issues.apache.org/jira/browse/SOLR-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242120#comment-13242120
 ] 

Uwe Schindler commented on SOLR-3295:
-------------------------------------

bq. It's some obscure  data format that tika can convert to plain text. I've 
never seen it, don't know what it is. Uwe filed a bug for Tika.

LOL. It's obscure, indeed, especially for people outside the climate community. 
As representative of PANGAEA, I of course know this format (its something like 
a container for huge multi-dimensional arrays) of numeric data. The funny thing 
is: The only textual parts are metadata about the file in the header, the data 
itsself never contains any text. The UCAR netcdf library is on the other hand 
not able to handle streaming file input, so TIKA loads the whole file into 
memory... and OOMs in most cases for files I have seen (climate users produce 
files up to several gigabytes, if not terabytes, depends on use-case, see 
examples here: [Kleinen, T et al. (2011): Holocene carbon cycle dynamics, links 
to model files. 
doi:10.1594/PANGAEA.758219|http://doi.pangaea.de/10.1594/PANGAEA.758219?format=html]).
 So I don't really see the use-case for support in Solr. But if we remove the 
JAR file, people who try to index .nc files, will get ClassNotFoundException. 
So ideally, we should also remove the file format from TIKA's META-INF in that 
case, or instruct the loader to ignore that. I always use some  "TIKA loader" 
component in my programs, that configure TIKA to only provide the formats I am 
interested in and I remove all useless JAR files then (but thats not easy). I 
can provide code how to configure TIKA for a subset of formats, which makes it 
easier for us to control what libraries are needed and users won't get 
ClassNotFound/InvalidClassFileFormat errors.
                
> Binaries contain 1.6 classes
> ----------------------------
>
>                 Key: SOLR-3295
>                 URL: https://issues.apache.org/jira/browse/SOLR-3295
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Dawid Weiss
>            Priority: Minor
>             Fix For: 3.6
>
>         Attachments: output.log
>
>
> I've ran this tool (does the job): http://code.google.com/p/versioncheck/ on 
> the checkout of branch_3x. To my surprise there is a JAR which contains Java 
> 1.6 code:
> {noformat}
> Major.Minor Version : 50.0             JAVA compatibility : Java 1.6 
> platform: 45.3-50.0
> Number of classes : 60
> Classes are : 
> c:\Work\lucene-solr\.\solr\contrib\extraction\lib\netcdf-4.2-min.jar [:] 
> ucar/unidata/geoloc/Bearing.class
> ...
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to