[jira] [Created] (TIKA-1456) Visual Sentiment API parser

2014-10-23 Thread Chris A. Mattmann (JIRA)
Chris A. Mattmann created TIKA-1456:
---

 Summary: Visual Sentiment API parser
 Key: TIKA-1456
 URL: https://issues.apache.org/jira/browse/TIKA-1456
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7


Integrate the Visual Sentibank API as a parser for images. We can use Aperture 
from CMU, it's released under the MIT license:

https://github.com/d8w/aperture



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui

2014-10-23 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182476#comment-14182476
 ] 

Chris A. Mattmann commented on TIKA-1451:
-

great work Tim!

> Add Recursive Metadata Parser Wrapper output to tika-app and gui
> 
>
> Key: TIKA-1451
> URL: https://issues.apache.org/jira/browse/TIKA-1451
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
> Attachments: integrate_recursive_metadata_wrapper.patch
>
>
> It would be helpful to expose the output of the recursive metadata parser 
> wrapper in the gui and in the command line for tika-app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: import (re)ordering?

2014-10-23 Thread Mattmann, Chris A (3980)
Hey Tim,

No big objections from me, but it will dilute things so glad we
have it noted if it happens.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: , "Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, October 21, 2014 at 1:59 PM
To: "dev@tika.apache.org" 
Subject: import (re)ordering?

>All,
>  I have Intellij set to order imports by javax, java, then other.  I
>think this is the most common pattern in Tika.  Is it ok if I make these
>(meaningless/formatting) changes when I commit other changes?
>  Thank you.
>
>   Best,
>
>  Tim



[jira] [Commented] (TIKA-443) Geographic Information Parser

2014-10-23 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182462#comment-14182462
 ] 

Chris A. Mattmann commented on TIKA-443:


Guys, I wonder if we should (now 4 years later) standardize on Apache SIS 
(http://sis.apache.org/) and incorporate its support for parsing ISO19115 
metadata. It seems to have the same types of properties that FDO metadata XML 
has. 

I'm going to give a whirl at creating a GeoParser that extracts information 
from ISO 19115 XML files. [~desruisseaux] FYI [~adamestrada] FYI.

> Geographic Information Parser
> -
>
> Key: TIKA-443
> URL: https://issues.apache.org/jira/browse/TIKA-443
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Arturo Beltran
>Assignee: Chris A. Mattmann
> Attachments: getFDOMetadata.xml
>
>
> I'm working in the automatic description of geospatial resources, and I think 
> that might be interesting to incorporate new parser/s to Tika in order to 
> manage and describe some geo-formats. These geo-formats include files, 
> services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate 
> to contact me. Any help is welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TIKA-443) Geographic Information Parser

2014-10-23 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned TIKA-443:
--

Assignee: Chris A. Mattmann

> Geographic Information Parser
> -
>
> Key: TIKA-443
> URL: https://issues.apache.org/jira/browse/TIKA-443
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Arturo Beltran
>Assignee: Chris A. Mattmann
> Attachments: getFDOMetadata.xml
>
>
> I'm working in the automatic description of geospatial resources, and I think 
> that might be interesting to incorporate new parser/s to Tika in order to 
> manage and describe some geo-formats. These geo-formats include files, 
> services and databases.
> If anyone is interested in this issue or want to collaborate do not hesitate 
> to contact me. Any help is welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-23 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182204#comment-14182204
 ] 

Lewis John McGibbney commented on TIKA-1423:


Output looks fantastic, can you please do 
{code}
mvn dependency:analyze-report
{code}
and see if you can resolve the slf4j-simple conflict between tika-app/pom.xml 
and tika-parsers/pom.xml when you add the netCDF library.
It probably worth trying to exclude the logging dependency from the netCDF 
dependency similar to what is done here
https://github.com/apache/gora/blob/master/gora-accumulo/pom.xml#L144
hth, great work.
Lewis


> Build a parser to extract data from GRIB formats
> 
>
> Key: TIKA-1423
> URL: https://issues.apache.org/jira/browse/TIKA-1423
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata, mime, parser
>Affects Versions: 1.6
>Reporter: Vineet Ghatge
>Assignee: Vineet Ghatge
>Priority: Critical
>  Labels: features, newbie
> Fix For: 1.7
>
> Attachments: GribParser.java, 
> NLDAS_FORA0125_H.A20130112.1200.002.grb, fileName.html, 
> gdas1.forecmwf.2014062612.grib2
>
>
> Arctic dataset contains a MIME format called GRIB -  General 
> Regularly­distributed information in Binary form 
> http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
> a concise data format used in meteorology to store historical and 
> weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
> The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
> intended for either transmission or storage contains a single parameter with 
> values located at an array of grid points, or represented as a set of 
> spectral coefficients, for a single level (or layer), encoded as a continuous 
> bit stream. Logical divisions of the record are designated as "sections", 
> each of which provides control information and/or data. A GRIB record 
> consists of six sections, two of which are optional: 
>  
> (0) Indicator Section 
> (1) Product Definition Section (PDS) 
> (2) Grid Description Section (GDS) ­ optional 
> (3) Bit Map Section (BMS) ­ optional 
> (4) Binary Data Section (BDS) 
> (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-23 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182208#comment-14182208
 ] 

Lewis John McGibbney commented on TIKA-1423:


p.s. do you have a patch against Tika trunk so that we can test? Thanks

> Build a parser to extract data from GRIB formats
> 
>
> Key: TIKA-1423
> URL: https://issues.apache.org/jira/browse/TIKA-1423
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata, mime, parser
>Affects Versions: 1.6
>Reporter: Vineet Ghatge
>Assignee: Vineet Ghatge
>Priority: Critical
>  Labels: features, newbie
> Fix For: 1.7
>
> Attachments: GribParser.java, 
> NLDAS_FORA0125_H.A20130112.1200.002.grb, fileName.html, 
> gdas1.forecmwf.2014062612.grib2
>
>
> Arctic dataset contains a MIME format called GRIB -  General 
> Regularly­distributed information in Binary form 
> http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
> a concise data format used in meteorology to store historical and 
> weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
> The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
> intended for either transmission or storage contains a single parameter with 
> values located at an array of grid points, or represented as a set of 
> spectral coefficients, for a single level (or layer), encoded as a continuous 
> bit stream. Logical divisions of the record are designated as "sections", 
> each of which provides control information and/or data. A GRIB record 
> consists of six sections, two of which are optional: 
>  
> (0) Indicator Section 
> (1) Product Definition Section (PDS) 
> (2) Grid Description Section (GDS) ­ optional 
> (3) Bit Map Section (BMS) ­ optional 
> (4) Binary Data Section (BDS) 
> (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-23 Thread Vineet Ghatge (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Ghatge updated TIKA-1423:

Attachment: fileName.html

Output in HTML

> Build a parser to extract data from GRIB formats
> 
>
> Key: TIKA-1423
> URL: https://issues.apache.org/jira/browse/TIKA-1423
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata, mime, parser
>Affects Versions: 1.6
>Reporter: Vineet Ghatge
>Assignee: Vineet Ghatge
>Priority: Critical
>  Labels: features, newbie
> Fix For: 1.7
>
> Attachments: GribParser.java, 
> NLDAS_FORA0125_H.A20130112.1200.002.grb, fileName.html, 
> gdas1.forecmwf.2014062612.grib2
>
>
> Arctic dataset contains a MIME format called GRIB -  General 
> Regularly­distributed information in Binary form 
> http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
> a concise data format used in meteorology to store historical and 
> weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
> The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
> intended for either transmission or storage contains a single parameter with 
> values located at an array of grid points, or represented as a set of 
> spectral coefficients, for a single level (or layer), encoded as a continuous 
> bit stream. Logical divisions of the record are designated as "sections", 
> each of which provides control information and/or data. A GRIB record 
> consists of six sections, two of which are optional: 
>  
> (0) Indicator Section 
> (1) Product Definition Section (PDS) 
> (2) Grid Description Section (GDS) ­ optional 
> (3) Bit Map Section (BMS) ­ optional 
> (4) Binary Data Section (BDS) 
> (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-23 Thread Vineet Ghatge (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182194#comment-14182194
 ] 

Vineet Ghatge commented on TIKA-1423:
-

Consumed the Parser to get data in HTML format and it works. I have attached 
the output to the documents. There is an issue with netCDFall4.5 jar keeps 
displaying these warnings 

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/home/vineghlinux/Desktop/CoursesFall2014/CSCI572/DR/netcdfAll-4.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/vineghlinux/Desktop/CoursesFall2014/CSCI572/DR/tika-app-1.7-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/vineghlinux/Desktop/CoursesFall2014/CSCI572/DR/slf4j-simple-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]

Tried to change the pom.xml of the tika, but that did not work either. Trying 
to remedy based on http://www.slf4j.org/codes.html#multiple_binding and 
http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/JarDependencies.html

> Build a parser to extract data from GRIB formats
> 
>
> Key: TIKA-1423
> URL: https://issues.apache.org/jira/browse/TIKA-1423
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata, mime, parser
>Affects Versions: 1.6
>Reporter: Vineet Ghatge
>Assignee: Vineet Ghatge
>Priority: Critical
>  Labels: features, newbie
> Fix For: 1.7
>
> Attachments: GribParser.java, 
> NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2
>
>
> Arctic dataset contains a MIME format called GRIB -  General 
> Regularly­distributed information in Binary form 
> http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
> a concise data format used in meteorology to store historical and 
> weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
> The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
> intended for either transmission or storage contains a single parameter with 
> values located at an array of grid points, or represented as a set of 
> spectral coefficients, for a single level (or layer), encoded as a continuous 
> bit stream. Logical divisions of the record are designated as "sections", 
> each of which provides control information and/or data. A GRIB record 
> consists of six sections, two of which are optional: 
>  
> (0) Indicator Section 
> (1) Product Definition Section (PDS) 
> (2) Grid Description Section (GDS) ­ optional 
> (3) Bit Map Section (BMS) ­ optional 
> (4) Binary Data Section (BDS) 
> (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182047#comment-14182047
 ] 

Tilman Hausherr commented on TIKA-1442:
---

A few files have less meta data than before:
019/019837.pdf
138/138155.pdf
221/221001.pdf
224/224644.pdf
308/308233.pdf
469/469387.pdf
490/490345.pdf
490/490344.pdf
597/597244.pdf
643/643910.pdf

Could you tell what you get in TIKA for the first one?

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip

I'm done now; the result is two new issues, PDFBOX-2448 and PDFBOX-2449. 
However PDFBOX-2448 isn't relevant to 1.8.8.

Many changes are positive ones, files that no longer thrown an exception, or 
files that have better text extraction.


> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181813#comment-14181813
 ] 

Tilman Hausherr commented on TIKA-1442:
---

The directory structure isn't a problem for me, I've downloaded all PDF files 
locally on a flat directory. Currently I'm still checking the files by hand, 
but I'll probably write a small script to extract and render with the different 
versions.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181779#comment-14181779
 ] 

Tilman Hausherr edited comment on TIKA-1442 at 10/23/14 7:31 PM:
-

Thanks!

I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is 
encrypted and has no text extract permission. But the line in the excel file 
does have tokens, which is, uh, surprising.

With the "old" parser, use this code, because files are sometimes encrypted 
with the empty password:
{code}
if( document.isEncrypted() )
{
try
{
StandardDecryptionMaterial sdm = new 
StandardDecryptionMaterial("");
document.openProtection(sdm);
}
catch( InvalidPasswordException e )
{
System.err.println( "Error: The document is encrypted." );
}
}
{code}
The nonSeq parser does this automatically.


Same for 892/892859.pdf


was (Author: tilman):
Thanks!

I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is 
encrypted and has no text extract permission. But the line in the excel file 
does have tokens, which is, uh, surprising.

With the "old" parser, use this code, because files are sometimes encrypted 
with the empty password:
{code}
if( document.isEncrypted() )
{
try
{
StandardDecryptionMaterial sdm = new 
StandardDecryptionMaterial("");
document.openProtection(sdm);
}
catch( InvalidPasswordException e )
{
System.err.println( "Error: The document is encrypted." );
}
}
{code}
The nonSeq parser does this automatically.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181799#comment-14181799
 ] 

Tim Allison commented on TIKA-1442:
---

If it is any consolation, the Cyrillic is totally hosed. :)

I'm hoping to get a basic file server set up (thanks to Rackspace) so that I 
can create hyperlinks for the source doc and for the extracted text/metadata so 
that you don't have to go hunting through the directory structure, and so that 
you can see what's extracted without running the app yourself.

That is probably a few weeks off though.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181779#comment-14181779
 ] 

Tilman Hausherr commented on TIKA-1442:
---

Thanks!

I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is 
encrypted and has no text extract permission. But the line in the excel file 
does have tokens, which is, uh, surprising.

With the "old" parser, use this code, because files are sometimes encrypted 
with the empty password:
{code}
if( document.isEncrypted() )
{
try
{
StandardDecryptionMaterial sdm = new 
StandardDecryptionMaterial("");
document.openProtection(sdm);
}
catch( InvalidPasswordException e )
{
System.err.println( "Error: The document is encrypted." );
}
}
{code}
The nonSeq parser does this automatically.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏

2014-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181630#comment-14181630
 ] 

Andreas Lehmkühler commented on TIKA-1098:
--

I've finally solved PDFBOX-1273. The fix will be part of the upcoming version 
1.8.8 and 2.0.0.

Thanks for your patience :-)

> not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
> 
>
> Key: TIKA-1098
> URL: https://issues.apache.org/jira/browse/TIKA-1098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: linux redhat
>Reporter: Qian Diao
> Attachments: url_1763_approx-alg-notes.pdf
>
>
> Hi,
> I got some parsing problems when using Tika 1.1 for the attached pdf file.
> my code (Test.java):
> import java.io.File;
> import java.io.InputStream;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.html.BoilerpipeContentHandler;
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.parser.html.HtmlParser;
> import de.l3s.boilerpipe.extractors.ArticleExtractor;
> public class Test {
> private static final String validBoilerpipeFilenameRegEx = 
> ".*(\\.)(htm|html|shtml|php|asp|aspx)$";
> public String parseFile(File inFile) {
> if (inFile == null || !inFile.isFile() || !inFile.canRead()) 
> return null;
>
> InputStream is = null;
> String outputText = "";
> try {
> // Open input stream
> is = new FileInputStream(inFile);
> // Prepare parser
> BodyContentHandler contenthandler = new 
> BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
> ParseContext pc = new ParseContext();
> // Call parse with boilerpipe if valid boilerpipe extension; 
> otherwise, call regular parse.
> if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
> Parser parser = new AutoDetectParser();
> parser.parse(is, contenthandler, metadata, pc);
> }
> else {
> Parser parser = new HtmlParser();
> BoilerpipeContentHandler bh = new 
> BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
> parser.parse(is, bh, metadata, pc);
> }
> // Prepare text for write
> outputText = contenthandler.toString();
> } catch (Exception e) {
> System.out.println(e);
> return null;
> } finally {
> try { 
> if (is != null) 
> is.close(); 
> } catch (Exception e) {}
> }
>
> return outputText;
> }
> =output
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> url_1763_approx-alg-notes.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181530#comment-14181530
 ] 

Hong-Thai Nguyen commented on TIKA-1446:


Thank alot [~binhawking], I've quick look on your fix. Effectually, there's 
quite a lot of changes. After cleanup & fix some minor, I broke CHM tests.

We appreciate really your contribution and we should continue & finalize. I've 
created new pull request basing on a branch with your fix + my cleanup:
https://github.com/apache/tika/pull/21
https://github.com/thaichat04/tika.git, branch TIKA-1446

> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: CHM Parser Improvement

2014-10-23 Thread thaichat04
GitHub user thaichat04 opened a pull request:

https://github.com/apache/tika/pull/21

CHM Parser Improvement

This pull request to improve Tika CHM Parser.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thaichat04/tika TIKA-1446

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/21.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21


commit ac354e4fe22daf60326d240190c5da32cded6443
Author: hong-thai.nguyen 
Date:   2014-10-23T16:12:10Z

TIKA-1446 - Apply fix of [~binhawking] and some cleanup




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181518#comment-14181518
 ] 

ASF GitHub Bot commented on TIKA-1446:
--

Github user thaichat04 closed the pull request at:

https://github.com/apache/tika/pull/20


> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: TIKA-1446

2014-10-23 Thread thaichat04
Github user thaichat04 closed the pull request at:

https://github.com/apache/tika/pull/20


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (TIKA-1455) Upgrade GSON dependency

2014-10-23 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1455.
---
Resolution: Fixed

r1633850

> Upgrade GSON dependency
> ---
>
> Key: TIKA-1455
> URL: https://issues.apache.org/jira/browse/TIKA-1455
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1455) Upgrade GSON dependency

2014-10-23 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1455:
-

 Summary: Upgrade GSON dependency
 Key: TIKA-1455
 URL: https://issues.apache.org/jira/browse/TIKA-1455
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] tika pull request: TIKA-1446

2014-10-23 Thread thaichat04
GitHub user thaichat04 opened a pull request:

https://github.com/apache/tika/pull/20

TIKA-1446

TIKA- 1430, TIKA-1446, TIKA-1447, TIKA-1448: CHM Parser improvement

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/tika 1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/20.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20


commit 58a465391d128c2aa9b11c9f5a986f6bcd28abca
Author: Chris Mattmann 
Date:   2014-07-28T00:45:03Z

[maven-release-plugin]  copy for tag 1.6

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1613865 
13f79535-47bb-0310-9956-ffa450edef68

commit c98da37a4b83bdad6aa86ccc6aaec6b0d647c59a
Author: David Meikle 
Date:   2014-07-31T18:29:32Z

TIKA-1381 - Added Lingo24Translator implementation

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1614950 
13f79535-47bb-0310-9956-ffa450edef68

commit d831ac12be2fc3303f5dab45b00b53b53b6a67e9
Author: Nick Burch 
Date:   2014-08-04T15:41:54Z

Create a branch for 1.6, to backport the POI upgrade to

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615619 
13f79535-47bb-0310-9956-ffa450edef68

commit e2d10e633d38c52b0f490a09043fb43176d26fbe
Author: Nick Burch 
Date:   2014-08-04T15:54:55Z

Merge the POI 3.11 beta 1 upgrade from Trunk to the 1.6 branch (TIKA-1380), 
ready for inclusion in rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615636 
13f79535-47bb-0310-9956-ffa450edef68

commit a5942c11cd6a3e75304ce0267c1fc4b5e979c66c
Author: Tim Allison 
Date:   2014-08-04T16:51:40Z

TIKA-1317 extract contents from SDTs within cells in tables in XWPF (docx) 
files

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615675 
13f79535-47bb-0310-9956-ffa450edef68

commit 68f9a11926946bdea29ab757a8275149d8d057e9
Author: Nick Burch 
Date:   2014-08-04T21:27:41Z

Merge r1615631 from Trunk to 1.6 - Upgrade the Commons Codec version to 
match that in Apache POI, upgraded in TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615800 
13f79535-47bb-0310-9956-ffa450edef68

commit ee988d4daa5b451a51b799b0ec790b88ca7fc111
Author: Tim Allison 
Date:   2014-08-05T13:03:05Z

TIKA-1275 upgrade Commons Compress to 1.8.1; updated CHANGES.txt, too

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615923 
13f79535-47bb-0310-9956-ffa450edef68

commit 9d27e1379fba530def45b470a92ce5052078021c
Author: Tim Allison 
Date:   2014-08-05T18:17:39Z

TIKA-1380; fix for null ole.getLabel()

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615970 
13f79535-47bb-0310-9956-ffa450edef68

commit 2ee02d85aa703e65607a707ee171c166017916ab
Author: Nick Burch 
Date:   2014-08-20T14:16:06Z

Merge r1619108 from Trunk to the 1.6 branch ready for release - Bump the 
POI dependency to 3.11-beta2, and remove the Geronimo stax one which is no 
longer required by anything now we are on Java 1.6 TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1619109 
13f79535-47bb-0310-9956-ffa450edef68

commit a3eac367cd560c20da4231f45eb18d638d4f91a1
Author: Chris Mattmann 
Date:   2014-08-31T19:36:36Z

Bring 1.6 branch up to date with trunk in prep for 1.6 RC #2.

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621623 
13f79535-47bb-0310-9956-ffa450edef68

commit dd2a2b5bad7e363c5ab74db69b89b6083f6fc8ff
Author: Chris Mattmann 
Date:   2014-08-31T19:44:11Z

[maven-release-plugin] prepare release 1.6-rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621627 
13f79535-47bb-0310-9956-ffa450edef68

commit 5f9845759fb7839298ac5ee3abb11667035faac3
Author: Chris Mattmann 
Date:   2014-08-31T19:44:17Z

[maven-release-plugin] prepare for next development iteration

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621629 
13f79535-47bb-0310-9956-ffa450edef68




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181483#comment-14181483
 ] 

ASF GitHub Bot commented on TIKA-1446:
--

GitHub user thaichat04 opened a pull request:

https://github.com/apache/tika/pull/20

TIKA-1446

TIKA- 1430, TIKA-1446, TIKA-1447, TIKA-1448: CHM Parser improvement

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/tika 1.6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tika/pull/20.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20


commit 58a465391d128c2aa9b11c9f5a986f6bcd28abca
Author: Chris Mattmann 
Date:   2014-07-28T00:45:03Z

[maven-release-plugin]  copy for tag 1.6

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1613865 
13f79535-47bb-0310-9956-ffa450edef68

commit c98da37a4b83bdad6aa86ccc6aaec6b0d647c59a
Author: David Meikle 
Date:   2014-07-31T18:29:32Z

TIKA-1381 - Added Lingo24Translator implementation

git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1614950 
13f79535-47bb-0310-9956-ffa450edef68

commit d831ac12be2fc3303f5dab45b00b53b53b6a67e9
Author: Nick Burch 
Date:   2014-08-04T15:41:54Z

Create a branch for 1.6, to backport the POI upgrade to

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615619 
13f79535-47bb-0310-9956-ffa450edef68

commit e2d10e633d38c52b0f490a09043fb43176d26fbe
Author: Nick Burch 
Date:   2014-08-04T15:54:55Z

Merge the POI 3.11 beta 1 upgrade from Trunk to the 1.6 branch (TIKA-1380), 
ready for inclusion in rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615636 
13f79535-47bb-0310-9956-ffa450edef68

commit a5942c11cd6a3e75304ce0267c1fc4b5e979c66c
Author: Tim Allison 
Date:   2014-08-04T16:51:40Z

TIKA-1317 extract contents from SDTs within cells in tables in XWPF (docx) 
files

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615675 
13f79535-47bb-0310-9956-ffa450edef68

commit 68f9a11926946bdea29ab757a8275149d8d057e9
Author: Nick Burch 
Date:   2014-08-04T21:27:41Z

Merge r1615631 from Trunk to 1.6 - Upgrade the Commons Codec version to 
match that in Apache POI, upgraded in TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615800 
13f79535-47bb-0310-9956-ffa450edef68

commit ee988d4daa5b451a51b799b0ec790b88ca7fc111
Author: Tim Allison 
Date:   2014-08-05T13:03:05Z

TIKA-1275 upgrade Commons Compress to 1.8.1; updated CHANGES.txt, too

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615923 
13f79535-47bb-0310-9956-ffa450edef68

commit 9d27e1379fba530def45b470a92ce5052078021c
Author: Tim Allison 
Date:   2014-08-05T18:17:39Z

TIKA-1380; fix for null ole.getLabel()

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615970 
13f79535-47bb-0310-9956-ffa450edef68

commit 2ee02d85aa703e65607a707ee171c166017916ab
Author: Nick Burch 
Date:   2014-08-20T14:16:06Z

Merge r1619108 from Trunk to the 1.6 branch ready for release - Bump the 
POI dependency to 3.11-beta2, and remove the Geronimo stax one which is no 
longer required by anything now we are on Java 1.6 TIKA-1380

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1619109 
13f79535-47bb-0310-9956-ffa450edef68

commit a3eac367cd560c20da4231f45eb18d638d4f91a1
Author: Chris Mattmann 
Date:   2014-08-31T19:36:36Z

Bring 1.6 branch up to date with trunk in prep for 1.6 RC #2.

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621623 
13f79535-47bb-0310-9956-ffa450edef68

commit dd2a2b5bad7e363c5ab74db69b89b6083f6fc8ff
Author: Chris Mattmann 
Date:   2014-08-31T19:44:11Z

[maven-release-plugin] prepare release 1.6-rc2

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621627 
13f79535-47bb-0310-9956-ffa450edef68

commit 5f9845759fb7839298ac5ee3abb11667035faac3
Author: Chris Mattmann 
Date:   2014-08-31T19:44:17Z

[maven-release-plugin] prepare for next development iteration

git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621629 
13f79535-47bb-0310-9956-ffa450edef68




> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressA