[jira] [Created] (TIKA-1456) Visual Sentiment API parser
Chris A. Mattmann created TIKA-1456: --- Summary: Visual Sentiment API parser Key: TIKA-1456 URL: https://issues.apache.org/jira/browse/TIKA-1456 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Integrate the Visual Sentibank API as a parser for images. We can use Aperture from CMU, it's released under the MIT license: https://github.com/d8w/aperture -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui
[ https://issues.apache.org/jira/browse/TIKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182476#comment-14182476 ] Chris A. Mattmann commented on TIKA-1451: - great work Tim! > Add Recursive Metadata Parser Wrapper output to tika-app and gui > > > Key: TIKA-1451 > URL: https://issues.apache.org/jira/browse/TIKA-1451 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 1.7 > > Attachments: integrate_recursive_metadata_wrapper.patch > > > It would be helpful to expose the output of the recursive metadata parser > wrapper in the gui and in the command line for tika-app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: import (re)ordering?
Hey Tim, No big objections from me, but it will dilute things so glad we have it noted if it happens. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: , "Timothy B." Reply-To: "dev@tika.apache.org" Date: Tuesday, October 21, 2014 at 1:59 PM To: "dev@tika.apache.org" Subject: import (re)ordering? >All, > I have Intellij set to order imports by javax, java, then other. I >think this is the most common pattern in Tika. Is it ok if I make these >(meaningless/formatting) changes when I commit other changes? > Thank you. > > Best, > > Tim
[jira] [Commented] (TIKA-443) Geographic Information Parser
[ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182462#comment-14182462 ] Chris A. Mattmann commented on TIKA-443: Guys, I wonder if we should (now 4 years later) standardize on Apache SIS (http://sis.apache.org/) and incorporate its support for parsing ISO19115 metadata. It seems to have the same types of properties that FDO metadata XML has. I'm going to give a whirl at creating a GeoParser that extracts information from ISO 19115 XML files. [~desruisseaux] FYI [~adamestrada] FYI. > Geographic Information Parser > - > > Key: TIKA-443 > URL: https://issues.apache.org/jira/browse/TIKA-443 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Arturo Beltran >Assignee: Chris A. Mattmann > Attachments: getFDOMetadata.xml > > > I'm working in the automatic description of geospatial resources, and I think > that might be interesting to incorporate new parser/s to Tika in order to > manage and describe some geo-formats. These geo-formats include files, > services and databases. > If anyone is interested in this issue or want to collaborate do not hesitate > to contact me. Any help is welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TIKA-443) Geographic Information Parser
[ https://issues.apache.org/jira/browse/TIKA-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-443: -- Assignee: Chris A. Mattmann > Geographic Information Parser > - > > Key: TIKA-443 > URL: https://issues.apache.org/jira/browse/TIKA-443 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Arturo Beltran >Assignee: Chris A. Mattmann > Attachments: getFDOMetadata.xml > > > I'm working in the automatic description of geospatial resources, and I think > that might be interesting to incorporate new parser/s to Tika in order to > manage and describe some geo-formats. These geo-formats include files, > services and databases. > If anyone is interested in this issue or want to collaborate do not hesitate > to contact me. Any help is welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182204#comment-14182204 ] Lewis John McGibbney commented on TIKA-1423: Output looks fantastic, can you please do {code} mvn dependency:analyze-report {code} and see if you can resolve the slf4j-simple conflict between tika-app/pom.xml and tika-parsers/pom.xml when you add the netCDF library. It probably worth trying to exclude the logging dependency from the netCDF dependency similar to what is done here https://github.com/apache/gora/blob/master/gora-accumulo/pom.xml#L144 hth, great work. Lewis > Build a parser to extract data from GRIB formats > > > Key: TIKA-1423 > URL: https://issues.apache.org/jira/browse/TIKA-1423 > Project: Tika > Issue Type: New Feature > Components: metadata, mime, parser >Affects Versions: 1.6 >Reporter: Vineet Ghatge >Assignee: Vineet Ghatge >Priority: Critical > Labels: features, newbie > Fix For: 1.7 > > Attachments: GribParser.java, > NLDAS_FORA0125_H.A20130112.1200.002.grb, fileName.html, > gdas1.forecmwf.2014062612.grib2 > > > Arctic dataset contains a MIME format called GRIB - General > Regularlydistributed information in Binary form > http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is > a concise data format used in meteorology to store historical and > weather data. There are 2 different types of the format GRIB 0, GRIB 2. > The focus will be on GRIB 2 which is the most prevalent. Each GRIB record > intended for either transmission or storage contains a single parameter with > values located at an array of grid points, or represented as a set of > spectral coefficients, for a single level (or layer), encoded as a continuous > bit stream. Logical divisions of the record are designated as "sections", > each of which provides control information and/or data. A GRIB record > consists of six sections, two of which are optional: > > (0) Indicator Section > (1) Product Definition Section (PDS) > (2) Grid Description Section (GDS) optional > (3) Bit Map Section (BMS) optional > (4) Binary Data Section (BDS) > (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182208#comment-14182208 ] Lewis John McGibbney commented on TIKA-1423: p.s. do you have a patch against Tika trunk so that we can test? Thanks > Build a parser to extract data from GRIB formats > > > Key: TIKA-1423 > URL: https://issues.apache.org/jira/browse/TIKA-1423 > Project: Tika > Issue Type: New Feature > Components: metadata, mime, parser >Affects Versions: 1.6 >Reporter: Vineet Ghatge >Assignee: Vineet Ghatge >Priority: Critical > Labels: features, newbie > Fix For: 1.7 > > Attachments: GribParser.java, > NLDAS_FORA0125_H.A20130112.1200.002.grb, fileName.html, > gdas1.forecmwf.2014062612.grib2 > > > Arctic dataset contains a MIME format called GRIB - General > Regularlydistributed information in Binary form > http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is > a concise data format used in meteorology to store historical and > weather data. There are 2 different types of the format GRIB 0, GRIB 2. > The focus will be on GRIB 2 which is the most prevalent. Each GRIB record > intended for either transmission or storage contains a single parameter with > values located at an array of grid points, or represented as a set of > spectral coefficients, for a single level (or layer), encoded as a continuous > bit stream. Logical divisions of the record are designated as "sections", > each of which provides control information and/or data. A GRIB record > consists of six sections, two of which are optional: > > (0) Indicator Section > (1) Product Definition Section (PDS) > (2) Grid Description Section (GDS) optional > (3) Bit Map Section (BMS) optional > (4) Binary Data Section (BDS) > (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vineet Ghatge updated TIKA-1423: Attachment: fileName.html Output in HTML > Build a parser to extract data from GRIB formats > > > Key: TIKA-1423 > URL: https://issues.apache.org/jira/browse/TIKA-1423 > Project: Tika > Issue Type: New Feature > Components: metadata, mime, parser >Affects Versions: 1.6 >Reporter: Vineet Ghatge >Assignee: Vineet Ghatge >Priority: Critical > Labels: features, newbie > Fix For: 1.7 > > Attachments: GribParser.java, > NLDAS_FORA0125_H.A20130112.1200.002.grb, fileName.html, > gdas1.forecmwf.2014062612.grib2 > > > Arctic dataset contains a MIME format called GRIB - General > Regularlydistributed information in Binary form > http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is > a concise data format used in meteorology to store historical and > weather data. There are 2 different types of the format GRIB 0, GRIB 2. > The focus will be on GRIB 2 which is the most prevalent. Each GRIB record > intended for either transmission or storage contains a single parameter with > values located at an array of grid points, or represented as a set of > spectral coefficients, for a single level (or layer), encoded as a continuous > bit stream. Logical divisions of the record are designated as "sections", > each of which provides control information and/or data. A GRIB record > consists of six sections, two of which are optional: > > (0) Indicator Section > (1) Product Definition Section (PDS) > (2) Grid Description Section (GDS) optional > (3) Bit Map Section (BMS) optional > (4) Binary Data Section (BDS) > (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182194#comment-14182194 ] Vineet Ghatge commented on TIKA-1423: - Consumed the Parser to get data in HTML format and it works. I have attached the output to the documents. There is an issue with netCDFall4.5 jar keeps displaying these warnings SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/vineghlinux/Desktop/CoursesFall2014/CSCI572/DR/netcdfAll-4.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/vineghlinux/Desktop/CoursesFall2014/CSCI572/DR/tika-app-1.7-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/vineghlinux/Desktop/CoursesFall2014/CSCI572/DR/slf4j-simple-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory] Tried to change the pom.xml of the tika, but that did not work either. Trying to remedy based on http://www.slf4j.org/codes.html#multiple_binding and http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/JarDependencies.html > Build a parser to extract data from GRIB formats > > > Key: TIKA-1423 > URL: https://issues.apache.org/jira/browse/TIKA-1423 > Project: Tika > Issue Type: New Feature > Components: metadata, mime, parser >Affects Versions: 1.6 >Reporter: Vineet Ghatge >Assignee: Vineet Ghatge >Priority: Critical > Labels: features, newbie > Fix For: 1.7 > > Attachments: GribParser.java, > NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2 > > > Arctic dataset contains a MIME format called GRIB - General > Regularlydistributed information in Binary form > http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is > a concise data format used in meteorology to store historical and > weather data. There are 2 different types of the format GRIB 0, GRIB 2. > The focus will be on GRIB 2 which is the most prevalent. Each GRIB record > intended for either transmission or storage contains a single parameter with > values located at an array of grid points, or represented as a set of > spectral coefficients, for a single level (or layer), encoded as a continuous > bit stream. Logical divisions of the record are designated as "sections", > each of which provides control information and/or data. A GRIB record > consists of six sections, two of which are optional: > > (0) Indicator Section > (1) Product Definition Section (PDS) > (2) Grid Description Section (GDS) optional > (3) Bit Map Section (BMS) optional > (4) Binary Data Section (BDS) > (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182047#comment-14182047 ] Tilman Hausherr commented on TIKA-1442: --- A few files have less meta data than before: 019/019837.pdf 138/138155.pdf 221/221001.pdf 224/224644.pdf 308/308233.pdf 469/469387.pdf 490/490345.pdf 490/490344.pdf 597/597244.pdf 643/643910.pdf Could you tell what you get in TIKA for the first one? > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.7 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip I'm done now; the result is two new issues, PDFBOX-2448 and PDFBOX-2449. However PDFBOX-2448 isn't relevant to 1.8.8. Many changes are positive ones, files that no longer thrown an exception, or files that have better text extraction. > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.7 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181813#comment-14181813 ] Tilman Hausherr commented on TIKA-1442: --- The directory structure isn't a problem for me, I've downloaded all PDF files locally on a flat directory. Currently I'm still checking the files by hand, but I'll probably write a small script to extract and render with the different versions. > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.7 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181779#comment-14181779 ] Tilman Hausherr edited comment on TIKA-1442 at 10/23/14 7:31 PM: - Thanks! I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is encrypted and has no text extract permission. But the line in the excel file does have tokens, which is, uh, surprising. With the "old" parser, use this code, because files are sometimes encrypted with the empty password: {code} if( document.isEncrypted() ) { try { StandardDecryptionMaterial sdm = new StandardDecryptionMaterial(""); document.openProtection(sdm); } catch( InvalidPasswordException e ) { System.err.println( "Error: The document is encrypted." ); } } {code} The nonSeq parser does this automatically. Same for 892/892859.pdf was (Author: tilman): Thanks! I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is encrypted and has no text extract permission. But the line in the excel file does have tokens, which is, uh, surprising. With the "old" parser, use this code, because files are sometimes encrypted with the empty password: {code} if( document.isEncrypted() ) { try { StandardDecryptionMaterial sdm = new StandardDecryptionMaterial(""); document.openProtection(sdm); } catch( InvalidPasswordException e ) { System.err.println( "Error: The document is encrypted." ); } } {code} The nonSeq parser does this automatically. > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.7 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181799#comment-14181799 ] Tim Allison commented on TIKA-1442: --- If it is any consolation, the Cyrillic is totally hosed. :) I'm hoping to get a basic file server set up (thanks to Rackspace) so that I can create hyperlinks for the source doc and for the extracted text/metadata so that you don't have to go hunting through the directory structure, and so that you can see what's extracted without running the app yourself. That is probably a few weeks off though. > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.7 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181779#comment-14181779 ] Tilman Hausherr commented on TIKA-1442: --- Thanks! I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is encrypted and has no text extract permission. But the line in the excel file does have tokens, which is, uh, surprising. With the "old" parser, use this code, because files are sometimes encrypted with the empty password: {code} if( document.isEncrypted() ) { try { StandardDecryptionMaterial sdm = new StandardDecryptionMaterial(""); document.openProtection(sdm); } catch( InvalidPasswordException e ) { System.err.println( "Error: The document is encrypted." ); } } {code} The nonSeq parser does this automatically. > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.7 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser
[ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181630#comment-14181630 ] Andreas Lehmkühler commented on TIKA-1098: -- I've finally solved PDFBOX-1273. The fix will be part of the upcoming version 1.8.8 and 2.0.0. Thanks for your patience :-) > not able to parse pdfs/docs/ppts using 1.1 tika parser > > > Key: TIKA-1098 > URL: https://issues.apache.org/jira/browse/TIKA-1098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: linux redhat >Reporter: Qian Diao > Attachments: url_1763_approx-alg-notes.pdf > > > Hi, > I got some parsing problems when using Tika 1.1 for the attached pdf file. > my code (Test.java): > import java.io.File; > import java.io.InputStream; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.parser.html.BoilerpipeContentHandler; > import org.apache.tika.sax.BodyContentHandler; > import org.apache.tika.parser.html.HtmlParser; > import de.l3s.boilerpipe.extractors.ArticleExtractor; > public class Test { > private static final String validBoilerpipeFilenameRegEx = > ".*(\\.)(htm|html|shtml|php|asp|aspx)$"; > public String parseFile(File inFile) { > if (inFile == null || !inFile.isFile() || !inFile.canRead()) > return null; > > InputStream is = null; > String outputText = ""; > try { > // Open input stream > is = new FileInputStream(inFile); > // Prepare parser > BodyContentHandler contenthandler = new > BodyContentHandler(-1); > Metadata metadata = new Metadata(); > metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName()); > ParseContext pc = new ParseContext(); > // Call parse with boilerpipe if valid boilerpipe extension; > otherwise, call regular parse. > if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) { > Parser parser = new AutoDetectParser(); > parser.parse(is, contenthandler, metadata, pc); > } > else { > Parser parser = new HtmlParser(); > BoilerpipeContentHandler bh = new > BoilerpipeContentHandler(contenthandler, new ArticleExtractor()); > parser.parse(is, bh, metadata, pc); > } > // Prepare text for write > outputText = contenthandler.toString(); > } catch (Exception e) { > System.out.println(e); > return null; > } finally { > try { > if (is != null) > is.close(); > } catch (Exception e) {} > } > > return outputText; > } > =output > org.apache.tika.exception.TikaException: Unable to extract PDF content > url_1763_approx-alg-notes.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181530#comment-14181530 ] Hong-Thai Nguyen commented on TIKA-1446: Thank alot [~binhawking], I've quick look on your fix. Effectually, there's quite a lot of changes. After cleanup & fix some minor, I broke CHM tests. We appreciate really your contribution and we should continue & finalize. I've created new pull request basing on a branch with your fix + my cleanup: https://github.com/apache/tika/pull/21 https://github.com/thaichat04/tika.git, branch TIKA-1446 > CHM parser : wrong decompression of aligned blocks > -- > > Key: TIKA-1446 > URL: https://issues.apache.org/jira/browse/TIKA-1446 > Project: Tika > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Bin Hawking >Priority: Critical > Attachments: chm.zip > > > If an embedded file contains aligned blocks, the parser outputs chaotic text > or empty text as to this file. > I have fixed it myself, corrected decompressAlignedBlock() and its > preparation methods. Mostly this bug is due to misusing main tree/align > tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: CHM Parser Improvement
GitHub user thaichat04 opened a pull request: https://github.com/apache/tika/pull/21 CHM Parser Improvement This pull request to improve Tika CHM Parser. You can merge this pull request into a Git repository by running: $ git pull https://github.com/thaichat04/tika TIKA-1446 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/21.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21 commit ac354e4fe22daf60326d240190c5da32cded6443 Author: hong-thai.nguyen Date: 2014-10-23T16:12:10Z TIKA-1446 - Apply fix of [~binhawking] and some cleanup --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181518#comment-14181518 ] ASF GitHub Bot commented on TIKA-1446: -- Github user thaichat04 closed the pull request at: https://github.com/apache/tika/pull/20 > CHM parser : wrong decompression of aligned blocks > -- > > Key: TIKA-1446 > URL: https://issues.apache.org/jira/browse/TIKA-1446 > Project: Tika > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Bin Hawking >Priority: Critical > Attachments: chm.zip > > > If an embedded file contains aligned blocks, the parser outputs chaotic text > or empty text as to this file. > I have fixed it myself, corrected decompressAlignedBlock() and its > preparation methods. Mostly this bug is due to misusing main tree/align > tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: TIKA-1446
Github user thaichat04 closed the pull request at: https://github.com/apache/tika/pull/20 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Resolved] (TIKA-1455) Upgrade GSON dependency
[ https://issues.apache.org/jira/browse/TIKA-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1455. --- Resolution: Fixed r1633850 > Upgrade GSON dependency > --- > > Key: TIKA-1455 > URL: https://issues.apache.org/jira/browse/TIKA-1455 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1455) Upgrade GSON dependency
Tim Allison created TIKA-1455: - Summary: Upgrade GSON dependency Key: TIKA-1455 URL: https://issues.apache.org/jira/browse/TIKA-1455 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] tika pull request: TIKA-1446
GitHub user thaichat04 opened a pull request: https://github.com/apache/tika/pull/20 TIKA-1446 TIKA- 1430, TIKA-1446, TIKA-1447, TIKA-1448: CHM Parser improvement You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/tika 1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/20.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20 commit 58a465391d128c2aa9b11c9f5a986f6bcd28abca Author: Chris Mattmann Date: 2014-07-28T00:45:03Z [maven-release-plugin] copy for tag 1.6 git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1613865 13f79535-47bb-0310-9956-ffa450edef68 commit c98da37a4b83bdad6aa86ccc6aaec6b0d647c59a Author: David Meikle Date: 2014-07-31T18:29:32Z TIKA-1381 - Added Lingo24Translator implementation git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1614950 13f79535-47bb-0310-9956-ffa450edef68 commit d831ac12be2fc3303f5dab45b00b53b53b6a67e9 Author: Nick Burch Date: 2014-08-04T15:41:54Z Create a branch for 1.6, to backport the POI upgrade to git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615619 13f79535-47bb-0310-9956-ffa450edef68 commit e2d10e633d38c52b0f490a09043fb43176d26fbe Author: Nick Burch Date: 2014-08-04T15:54:55Z Merge the POI 3.11 beta 1 upgrade from Trunk to the 1.6 branch (TIKA-1380), ready for inclusion in rc2 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615636 13f79535-47bb-0310-9956-ffa450edef68 commit a5942c11cd6a3e75304ce0267c1fc4b5e979c66c Author: Tim Allison Date: 2014-08-04T16:51:40Z TIKA-1317 extract contents from SDTs within cells in tables in XWPF (docx) files git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615675 13f79535-47bb-0310-9956-ffa450edef68 commit 68f9a11926946bdea29ab757a8275149d8d057e9 Author: Nick Burch Date: 2014-08-04T21:27:41Z Merge r1615631 from Trunk to 1.6 - Upgrade the Commons Codec version to match that in Apache POI, upgraded in TIKA-1380 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615800 13f79535-47bb-0310-9956-ffa450edef68 commit ee988d4daa5b451a51b799b0ec790b88ca7fc111 Author: Tim Allison Date: 2014-08-05T13:03:05Z TIKA-1275 upgrade Commons Compress to 1.8.1; updated CHANGES.txt, too git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615923 13f79535-47bb-0310-9956-ffa450edef68 commit 9d27e1379fba530def45b470a92ce5052078021c Author: Tim Allison Date: 2014-08-05T18:17:39Z TIKA-1380; fix for null ole.getLabel() git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615970 13f79535-47bb-0310-9956-ffa450edef68 commit 2ee02d85aa703e65607a707ee171c166017916ab Author: Nick Burch Date: 2014-08-20T14:16:06Z Merge r1619108 from Trunk to the 1.6 branch ready for release - Bump the POI dependency to 3.11-beta2, and remove the Geronimo stax one which is no longer required by anything now we are on Java 1.6 TIKA-1380 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1619109 13f79535-47bb-0310-9956-ffa450edef68 commit a3eac367cd560c20da4231f45eb18d638d4f91a1 Author: Chris Mattmann Date: 2014-08-31T19:36:36Z Bring 1.6 branch up to date with trunk in prep for 1.6 RC #2. git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621623 13f79535-47bb-0310-9956-ffa450edef68 commit dd2a2b5bad7e363c5ab74db69b89b6083f6fc8ff Author: Chris Mattmann Date: 2014-08-31T19:44:11Z [maven-release-plugin] prepare release 1.6-rc2 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621627 13f79535-47bb-0310-9956-ffa450edef68 commit 5f9845759fb7839298ac5ee3abb11667035faac3 Author: Chris Mattmann Date: 2014-08-31T19:44:17Z [maven-release-plugin] prepare for next development iteration git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621629 13f79535-47bb-0310-9956-ffa450edef68 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181483#comment-14181483 ] ASF GitHub Bot commented on TIKA-1446: -- GitHub user thaichat04 opened a pull request: https://github.com/apache/tika/pull/20 TIKA-1446 TIKA- 1430, TIKA-1446, TIKA-1447, TIKA-1448: CHM Parser improvement You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/tika 1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/20.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20 commit 58a465391d128c2aa9b11c9f5a986f6bcd28abca Author: Chris Mattmann Date: 2014-07-28T00:45:03Z [maven-release-plugin] copy for tag 1.6 git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1613865 13f79535-47bb-0310-9956-ffa450edef68 commit c98da37a4b83bdad6aa86ccc6aaec6b0d647c59a Author: David Meikle Date: 2014-07-31T18:29:32Z TIKA-1381 - Added Lingo24Translator implementation git-svn-id: https://svn.apache.org/repos/asf/tika/tags/1.6@1614950 13f79535-47bb-0310-9956-ffa450edef68 commit d831ac12be2fc3303f5dab45b00b53b53b6a67e9 Author: Nick Burch Date: 2014-08-04T15:41:54Z Create a branch for 1.6, to backport the POI upgrade to git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615619 13f79535-47bb-0310-9956-ffa450edef68 commit e2d10e633d38c52b0f490a09043fb43176d26fbe Author: Nick Burch Date: 2014-08-04T15:54:55Z Merge the POI 3.11 beta 1 upgrade from Trunk to the 1.6 branch (TIKA-1380), ready for inclusion in rc2 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615636 13f79535-47bb-0310-9956-ffa450edef68 commit a5942c11cd6a3e75304ce0267c1fc4b5e979c66c Author: Tim Allison Date: 2014-08-04T16:51:40Z TIKA-1317 extract contents from SDTs within cells in tables in XWPF (docx) files git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615675 13f79535-47bb-0310-9956-ffa450edef68 commit 68f9a11926946bdea29ab757a8275149d8d057e9 Author: Nick Burch Date: 2014-08-04T21:27:41Z Merge r1615631 from Trunk to 1.6 - Upgrade the Commons Codec version to match that in Apache POI, upgraded in TIKA-1380 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615800 13f79535-47bb-0310-9956-ffa450edef68 commit ee988d4daa5b451a51b799b0ec790b88ca7fc111 Author: Tim Allison Date: 2014-08-05T13:03:05Z TIKA-1275 upgrade Commons Compress to 1.8.1; updated CHANGES.txt, too git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615923 13f79535-47bb-0310-9956-ffa450edef68 commit 9d27e1379fba530def45b470a92ce5052078021c Author: Tim Allison Date: 2014-08-05T18:17:39Z TIKA-1380; fix for null ole.getLabel() git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1615970 13f79535-47bb-0310-9956-ffa450edef68 commit 2ee02d85aa703e65607a707ee171c166017916ab Author: Nick Burch Date: 2014-08-20T14:16:06Z Merge r1619108 from Trunk to the 1.6 branch ready for release - Bump the POI dependency to 3.11-beta2, and remove the Geronimo stax one which is no longer required by anything now we are on Java 1.6 TIKA-1380 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1619109 13f79535-47bb-0310-9956-ffa450edef68 commit a3eac367cd560c20da4231f45eb18d638d4f91a1 Author: Chris Mattmann Date: 2014-08-31T19:36:36Z Bring 1.6 branch up to date with trunk in prep for 1.6 RC #2. git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621623 13f79535-47bb-0310-9956-ffa450edef68 commit dd2a2b5bad7e363c5ab74db69b89b6083f6fc8ff Author: Chris Mattmann Date: 2014-08-31T19:44:11Z [maven-release-plugin] prepare release 1.6-rc2 git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621627 13f79535-47bb-0310-9956-ffa450edef68 commit 5f9845759fb7839298ac5ee3abb11667035faac3 Author: Chris Mattmann Date: 2014-08-31T19:44:17Z [maven-release-plugin] prepare for next development iteration git-svn-id: https://svn.apache.org/repos/asf/tika/branches/1.6@1621629 13f79535-47bb-0310-9956-ffa450edef68 > CHM parser : wrong decompression of aligned blocks > -- > > Key: TIKA-1446 > URL: https://issues.apache.org/jira/browse/TIKA-1446 > Project: Tika > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Bin Hawking >Priority: Critical > Attachments: chm.zip > > > If an embedded file contains aligned blocks, the parser outputs chaotic text > or empty text as to this file. > I have fixed it myself, corrected decompressA