[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538319#comment-14538319 ] Rishi Verma commented on TIKA-1577: --- Hey Chris - thanks for your suggestions. bq. just have that many tds, each one representing the cols, and trs each one representing the rows. I'm not sure if nested tds and trs themselves would produce valid XHTML. For example, I tried the below out against a W3 validator, but it did not pass. 2x2x2 matrix (3-dimensional): table tr td tr td102.1/td td102.2/td /tr tr td102.3/td td102.4/td /tr /td td tr td202.1/td td202.2/td /tr tr td202.3/td td202.4/td /tr /td /tr /table [1] http://validator.w3.org/nu/ NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.9 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538330#comment-14538330 ] Rishi Verma commented on TIKA-1577: --- That being said, fully nested tables, where a td contains an entire table structure within it seems to be okay. This might make sense to do so that the XHTHL output is on par with the Tika Excel structured output (which is limited to being 2-D I believe). I'm going to think about this a bit and see what may work. NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.9 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538380#comment-14538380 ] Chris A. Mattmann commented on TIKA-1577: - Hey Rishi, I wasn't intending nested tds and trs. I simply meant a regular valid XHTML table, e.g., Nx3 dimensional output: table th tdVariable1 Value/td tdLatitude/td tdLongitude/td /th tr td52.1/td td134.1/td td-50.0/td /tr !-- more data -- /table NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.9 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534847#comment-14534847 ] Ann Burgess commented on TIKA-1577: --- Take it away [~riverma]! NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.9 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534854#comment-14534854 ] Chris A. Mattmann commented on TIKA-1577: - Rishi, I love everything about the proposal but the structure - why use a UL when this is inherently tabular data (aka matrix oriented)? NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.9 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533722#comment-14533722 ] Rishi Verma commented on TIKA-1577: --- Hi Annie, All I'm going to take a crack at this. Feel free to assign this to me! My plan: leverage Tika's ParseContext to give a couple of content extraction modes. Doing so will allow the developer to configure netCDF variable extraction to better scale with huge amounts of variable content. I'm aiming for the following modes: 1. Default Mode: default to either Zero Mode or Preview Mode. 2. Zero Mode: no variable content is read. This is the same as the current capability. 2. Preview Mode: a limited amount of variable content read, starting from index zero. Probably one or two indices only, since the text buffer can become massive very quickly. 3. Custom Mode: provide ability to specify a custom variable Range to extract for ALL variables. If the range is greater than the size of a respective dimension within a variable, then the maximum size of the dimension will be extracted only. I'm specifically targeting a custom Range that applies to all variables concurrently, because Tika's philosophy (to me) seems to predicate limited knowledge of the actual data. Plus, if the user has a very specific use case involving something like a need to extract a particular variable's slice + range + step, then IMO Tika is not the tool to use, instead, the netCDF library should be utilized (which gives this type of maximum flexibility). 4. Full Mode: extract all variable content. Note, this can result in a Tika exception if more than 100,000 characters are extracted when calling handler.toString(). In terms of XHTML structure, I'm thinking a nested ulli structure, that starts with the left-most dimension first for a given variable, and generates inner ulli structures for each subsequent dimension's data. Doing this will provide some visible structure when rendering to a viewer's screen, but also provide for much easier parsing via XML then a giant singular list of variable content. NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.9 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389901#comment-14389901 ] Rishi Verma commented on TIKA-1577: --- Hi Annie, Chris, That architecture looks good, although I don't know if we'd be able to leverage any code from NCDumpW to help develop TikaParser or ScientificContentHandler. We might want to give some thought to a CSV type output as well. I think that would have broad applicability for client applications. NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.8 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385408#comment-14385408 ] Chris A. Mattmann commented on TIKA-1577: - Agreed, if we can reuse this, then great. The one catch is that I'm not sure that dump capability generates a table or something in an XHTML representation which is our basis representation in Tika. I would like us to consider the output of this issue to be: - TikaParser generates XHTML tabular and other elements that represent the data in the NetCDF file - we create like a ScientificContentHandler that can then take that output from the parser (in the data section) and then format it e.g., like NCDump. Sound good? NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.8 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14384323#comment-14384323 ] Ann Burgess commented on TIKA-1577: --- This is a great idea. I'm all for not re-creating code if it already exists in good form! NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.8 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383040#comment-14383040 ] Rishi Verma commented on TIKA-1577: --- Hi All, Based on a chat with @Giuseppe, what about leveraging the pre-existing NCDump output capability when generating Tika plain text from a netCDF file? See the below two classes we could leverage: * http://www.unidata.ucar.edu/software/thredds/v4.5/netcdf-java/javadoc/ucar/nc2/NCdumpW.html * http://www.unidata.ucar.edu/software/thredds/v4.5/netcdf-java/javadoc/ucar/nc2/NCdumpW.WantValues.html NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.8 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378993#comment-14378993 ] Rishi Verma commented on TIKA-1577: --- Thanks Ann. Appreciate it. For other's reference, the latest docs (summarized) are here too: http://www.unidata.ucar.edu/software/netcdf/docs/index.html NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Fix For: 1.8 Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1577) NetCDF Data Extraction
[ https://issues.apache.org/jira/browse/TIKA-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370177#comment-14370177 ] Ann Burgess commented on TIKA-1577: --- [~riverma] this is a good place to start: http://www.unidata.ucar.edu/software/netcdf/old_docs/really_old/guide_toc.html NetCDF Data Extraction -- Key: TIKA-1577 URL: https://issues.apache.org/jira/browse/TIKA-1577 Project: Tika Issue Type: Improvement Components: handler, parser Affects Versions: 1.7 Reporter: Ann Burgess Assignee: Ann Burgess Labels: features, handler Original Estimate: 504h Remaining Estimate: 504h A netCDF classic or 64-bit offset dataset is stored as a single file comprising two parts: - a header, containing all the information about dimensions, attributes, and variables except for the variable data; - a data part, comprising fixed-size data, containing the data for variables that don't have an unlimited dimension; and variable-size data, containing the data for variables that have an unlimited dimension. The NetCDFparser currently extracts the header part. -- text extracts file Dimensions and Variables -- metadata extracts Global Attributes We want the option to extract the data part of NetCDF files. Lets use the NetCDF test file for our dev testing: tika/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_21.nc -- This message was sent by Atlassian JIRA (v6.3.4#6332)