[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat
[ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559872#comment-16559872 ] Hudson commented on TIKA-2462: -- SUCCESS: Integrated in Jenkins build tika-branch-1x #63 (See [https://builds.apache.org/job/tika-branch-1x/63/]) TIKA-2462 Initial parser for SAS7BDAT files powered by Parso (now (tallison: [https://github.com/apache/tika/commit/2d19fe0ad26f68c6cc2caeaed713cd28179dede7]) * (add) tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java * (edit) tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser > Add a parser for sas7bdat > - > > Key: TIKA-2462 > URL: https://issues.apache.org/jira/browse/TIKA-2462 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > Fix For: 2.0, 1.19 > > > EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate > parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 > !!! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat
[ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559815#comment-16559815 ] Tim Allison commented on TIKA-2462: --- I cherry-picked these into branch_1x so they'll be included in 1.19. > Add a parser for sas7bdat > - > > Key: TIKA-2462 > URL: https://issues.apache.org/jira/browse/TIKA-2462 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > Fix For: 2.0, 1.19 > > > EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate > parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 > !!! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat
[ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455569#comment-16455569 ] Hudson commented on TIKA-2462: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1475 (See [https://builds.apache.org/job/Tika-trunk/1475/]) TIKA-2462 Initial parser for SAS7BDAT files powered by Parso (now (nick: [https://github.com/apache/tika/commit/754fb4c93b7229abc3512168228df2c269fb5274]) * (add) tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java * (edit) tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser > Add a parser for sas7bdat > - > > Key: TIKA-2462 > URL: https://issues.apache.org/jira/browse/TIKA-2462 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > > EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate > parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 > !!! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat
[ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455563#comment-16455563 ] Hudson commented on TIKA-2462: -- UNSTABLE: Integrated in Jenkins build tika-2.x-windows #240 (See [https://builds.apache.org/job/tika-2.x-windows/240/]) TIKA-2462 Initial parser for SAS7BDAT files powered by Parso (now (nick: rev 754fb4c93b7229abc3512168228df2c269fb5274) * (add) tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java * (edit) tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser > Add a parser for sas7bdat > - > > Key: TIKA-2462 > URL: https://issues.apache.org/jira/browse/TIKA-2462 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > > EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate > parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 > !!! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat
[ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455534#comment-16455534 ] Nick Burch commented on TIKA-2462: -- Using the newly-released 2.0.9 version of parso, I've added a basic text-only parser in 754fb4c93b7229abc3512168228df2c269fb5274. Still needs metadata and unit testing, which I'll aim to add in the next few days, unless someone beats me to it! > Add a parser for sas7bdat > - > > Key: TIKA-2462 > URL: https://issues.apache.org/jira/browse/TIKA-2462 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > > EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate > parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 > !!! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat
[ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445791#comment-16445791 ] Nick Burch commented on TIKA-2462: -- Relicense has gone through, awaiting 2.0.9 release for the easy access to the formatted values to output before we add the parser Still needed - "moderately complicated" columnar test files for Excel / CSV, which we can then convert to the other formats (CSV, XLS, XLSX, ODS, sas7bdat, DB formats etc) to use to check for consistency between the parsers > Add a parser for sas7bdat > - > > Key: TIKA-2462 > URL: https://issues.apache.org/jira/browse/TIKA-2462 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Major > > EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate > parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 > !!! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat
[ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308429#comment-16308429 ] Nick Burch commented on TIKA-2462: -- While we wait for the re-license to go through, I've had a look at writing a parser. Outputting as CSV is very easy, as they've got a great class to do all the work. SAX events of a HTML table will be trickier, as the logic to format a raw value in a given column to "a string of how it looks in SAS" is currently in a private method. I've raised [#24|https://github.com/epam/parso/issues/24] to see if that can be refactored out, to avoid us needing to duplicate lots of their code Tika questions on column metadata, test files etc still remain for us though! > Add a parser for sas7bdat > - > > Key: TIKA-2462 > URL: https://issues.apache.org/jira/browse/TIKA-2462 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > > EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate > parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 > !!! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat
[ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167063#comment-16167063 ] Nick Burch commented on TIKA-2462: -- I've just had a quick try with the library, against a test SAS file with 5 columns each of different types. Looking at the properties on the file, and on the columns, Parso is able to return: {{{ u64 - false compressionMethod - null endianness - 1 encoding - windows-1252 sessionEncoding - null name - SHEET1 fileType - DATA dateCreated - Fri Mar 06 19:10:19 GMT 2015 dateModified - Fri Mar 06 19:10:19 GMT 2015 sasRelease - 9.0101M3 serverType - XP_PRO osName - osType - headerLength - 1024 pageLength - 8192 pageCount - 1 rowLength - 96 rowCount - 31 mixPageRowCount - 69 columnsCount - 5 5 Columns defined: 1 - A Label: A Format: $ Size 58 of java.lang.String 2 - B Label: B Format: Size 8 of java.lang.Number 3 - C Label: C Format: DATE Size 8 of java.lang.Number 4 - D Label: D Format: DATETIME Size 8 of java.lang.Number 5 - E Label: E Format: Size 8 of java.lang.Number }}} I guess we'd want to map some of the file properties onto standard keys, and the rest onto custom ones? For the data, I guess we output SAX events for a HTML-like table. Not sure about the column metadata, any patterns we can copy from any of the database formats or other scientific dataset formats? Also, we only seem to have 1 fairly simple test sas7bdat file in the Tika Parsers test documents area. Do we have a standard "moderately complicated" tabular test file (eg XLS, CSV) which I could get a SAS version made of, so we can have largely the same test data between formats? > Add a parser for sas7bdat > - > > Key: TIKA-2462 > URL: https://issues.apache.org/jira/browse/TIKA-2462 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > > EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate > parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 > !!! -- This message was sent by Atlassian JIRA (v6.4.14#64029)