[ 
https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533722#comment-14533722
 ] 

Rishi Verma commented on TIKA-1577:
-----------------------------------

Hi Annie, All

I'm going to take a crack at this. Feel free to assign this to me!

My plan: leverage Tika's ParseContext to give a couple of content extraction 
"modes". Doing so will allow the developer to configure netCDF variable 
extraction to better scale with huge amounts of variable content. I'm aiming 
for the following modes:
  1. "Default Mode": default to either Zero Mode or Preview Mode.
  2. "Zero Mode": no variable content is read. This is the same as the current 
capability. 
  2. "Preview Mode": a limited amount of variable content read, starting from 
index zero. Probably one or two indices only, since the text buffer can become 
massive very quickly.
  3. "Custom Mode": provide ability to specify a custom variable Range to 
extract for ALL variables. If the range is greater than the size of a 
respective dimension within a variable, then the maximum size of the dimension 
will be extracted only. I'm specifically targeting a custom Range that applies 
to all variables concurrently, because Tika's philosophy (to me) seems to 
predicate limited knowledge of the actual data. Plus, if the user has a very 
specific use case involving something like a need to extract a particular 
variable's slice + range + step, then IMO Tika is not the tool to use, instead, 
the netCDF library should be utilized (which gives this type of maximum 
flexibility).  
  4. "Full Mode": extract all variable content. Note, this can result in a Tika 
exception if more than 100,000 characters are extracted when calling 
"handler.toString()".  

In terms of XHTML structure, I'm thinking a nested "<ul><li>" structure, that 
starts with the left-most dimension first for a given variable, and generates 
inner "<ul><li>" structures for each subsequent dimension's data. Doing this 
will provide some visible structure when rendering to a viewer's screen, but 
also provide for much easier parsing via XML then a giant singular list of 
variable content.

> NetCDF Data Extraction
> ----------------------
>
>                 Key: TIKA-1577
>                 URL: https://issues.apache.org/jira/browse/TIKA-1577
>             Project: Tika
>          Issue Type: Improvement
>          Components: handler, parser
>    Affects Versions: 1.7
>            Reporter: Ann Burgess
>            Assignee: Ann Burgess
>              Labels: features, handler
>             Fix For: 1.9
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> A netCDF classic or 64-bit offset dataset is stored as a single file 
> comprising two parts:
>  - a header, containing all the information about dimensions, attributes, and 
> variables except for the variable data;
>  - a data part, comprising fixed-size data, containing the data for variables 
> that don't have an unlimited dimension; and variable-size data, containing 
> the data for variables that have an unlimited dimension.
> The NetCDFparser currently extracts the "header part".  
>  -- text extracts file Dimensions and Variables
>  -- metadata extracts Global Attributes
> We want the option to extract the "data part" of NetCDF files.  
> Lets use the NetCDF test file for our dev testing:  
> tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_200001.nc
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to