[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat

2018-07-27 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559872#comment-16559872
 ] 

Hudson commented on TIKA-2462:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #63 (See 
[https://builds.apache.org/job/tika-branch-1x/63/])
TIKA-2462 Initial parser for SAS7BDAT files powered by Parso (now (tallison: 
[https://github.com/apache/tika/commit/2d19fe0ad26f68c6cc2caeaed713cd28179dede7])
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java
* (edit) 
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser


> Add a parser for sas7bdat
> -
>
> Key: TIKA-2462
> URL: https://issues.apache.org/jira/browse/TIKA-2462
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.0, 1.19
>
>
> EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate 
> parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 
> !!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat

2018-07-27 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559815#comment-16559815
 ] 

Tim Allison commented on TIKA-2462:
---

I cherry-picked these into branch_1x so they'll be included in 1.19.

> Add a parser for sas7bdat
> -
>
> Key: TIKA-2462
> URL: https://issues.apache.org/jira/browse/TIKA-2462
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
> Fix For: 2.0, 1.19
>
>
> EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate 
> parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 
> !!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455569#comment-16455569
 ] 

Hudson commented on TIKA-2462:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1475 (See 
[https://builds.apache.org/job/Tika-trunk/1475/])
TIKA-2462 Initial parser for SAS7BDAT files powered by Parso (now (nick: 
[https://github.com/apache/tika/commit/754fb4c93b7229abc3512168228df2c269fb5274])
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java
* (edit) 
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser


> Add a parser for sas7bdat
> -
>
> Key: TIKA-2462
> URL: https://issues.apache.org/jira/browse/TIKA-2462
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
>
> EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate 
> parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 
> !!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455563#comment-16455563
 ] 

Hudson commented on TIKA-2462:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #240 (See 
[https://builds.apache.org/job/tika-2.x-windows/240/])
TIKA-2462 Initial parser for SAS7BDAT files powered by Parso (now (nick: rev 
754fb4c93b7229abc3512168228df2c269fb5274)
* (add) 
tika-parsers/src/main/java/org/apache/tika/parser/sas/SAS7BDATParser.java
* (edit) 
tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser


> Add a parser for sas7bdat
> -
>
> Key: TIKA-2462
> URL: https://issues.apache.org/jira/browse/TIKA-2462
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
>
> EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate 
> parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 
> !!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat

2018-04-26 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455534#comment-16455534
 ] 

Nick Burch commented on TIKA-2462:
--

Using the newly-released 2.0.9 version of parso, I've added a basic text-only 
parser in 754fb4c93b7229abc3512168228df2c269fb5274. Still needs metadata and 
unit testing, which I'll aim to add in the next few days, unless someone beats 
me to it!

> Add a parser for sas7bdat
> -
>
> Key: TIKA-2462
> URL: https://issues.apache.org/jira/browse/TIKA-2462
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
>
> EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate 
> parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 
> !!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat

2018-04-20 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445791#comment-16445791
 ] 

Nick Burch commented on TIKA-2462:
--

Relicense has gone through, awaiting 2.0.9 release for the easy access to the 
formatted values to output before we add the parser

Still needed - "moderately complicated" columnar test files for Excel / CSV, 
which we can then convert to the other formats (CSV, XLS, XLSX, ODS, sas7bdat, 
DB formats etc) to use to check for consistency between the parsers

> Add a parser for sas7bdat
> -
>
> Key: TIKA-2462
> URL: https://issues.apache.org/jira/browse/TIKA-2462
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
>
> EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate 
> parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 
> !!!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat

2018-01-02 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308429#comment-16308429
 ] 

Nick Burch commented on TIKA-2462:
--

While we wait for the re-license to go through, I've had a look at writing a 
parser. Outputting as CSV is very easy, as they've got a great class to do all 
the work. SAX events of a HTML table will be trickier, as the logic to format a 
raw value in a given column to "a string of how it looks in SAS" is currently 
in a private method. I've raised [#24|https://github.com/epam/parso/issues/24] 
to see if that can be refactored out, to avoid us needing to duplicate lots of 
their code

Tika questions on column metadata, test files etc still remain for us though!

> Add a parser for sas7bdat
> -
>
> Key: TIKA-2462
> URL: https://issues.apache.org/jira/browse/TIKA-2462
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate 
> parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 
> !!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2462) Add a parser for sas7bdat

2017-09-14 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167063#comment-16167063
 ] 

Nick Burch commented on TIKA-2462:
--

I've just had a quick try with the library, against a test SAS file with 5 
columns each of different types. Looking at the properties on the file, and on 
the columns, Parso is able to return:
{{{
u64 - false
compressionMethod - null
endianness - 1
encoding - windows-1252
sessionEncoding - null
name - SHEET1
fileType - DATA
dateCreated - Fri Mar 06 19:10:19 GMT 2015
dateModified - Fri Mar 06 19:10:19 GMT 2015
sasRelease - 9.0101M3
serverType - XP_PRO
osName - 
osType - 
headerLength - 1024
pageLength - 8192
pageCount - 1
rowLength - 96
rowCount - 31
mixPageRowCount - 69
columnsCount - 5

5 Columns defined:
 1 - A
  Label: A
  Format: $
  Size 58 of java.lang.String
 2 - B
  Label: B
  Format: 
  Size 8 of java.lang.Number
 3 - C
  Label: C
  Format: DATE
  Size 8 of java.lang.Number
 4 - D
  Label: D
  Format: DATETIME
  Size 8 of java.lang.Number
 5 - E
  Label: E
  Format: 
  Size 8 of java.lang.Number
}}}

I guess we'd want to map some of the file properties onto standard keys, and 
the rest onto custom ones? For the data, I guess we output SAX events for a 
HTML-like table. Not sure about the column metadata, any patterns we can copy 
from any of the database formats or other scientific dataset formats?

Also, we only seem to have 1 fairly simple test sas7bdat file in the Tika 
Parsers test documents area. Do we have a standard "moderately complicated" 
tabular test file (eg XLS, CSV) which I could get a SAS version made of, so we 
can have largely the same test data between formats?

> Add a parser for sas7bdat
> -
>
> Key: TIKA-2462
> URL: https://issues.apache.org/jira/browse/TIKA-2462
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate 
> parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 
> !!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)