Trying To Index A Visio document
Hi, I have tried to index a document with the vsdx extension but its is erroring. I can index pdf, word doc and plain text but vsdx not working why? -- View this message in context: http://lucene.472066.n3.nabble.com/Trying-To-Index-A-Visio-document-tp4162413.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157068#comment-14157068 ] Tyler Palsulich commented on TIKA-1422: --- [~chrismattmann], I believe that patch fails when Tesseract is not installed. When Tesseract is not installed, the ContentHandler in question is only invoked 4 times. But, when Tesseract is installed, it's invoked 5 times. My first thought was that the Tesseract Parser always invoked the ContentHandler, even if no OCR text was found. But, there *is* OCR text to be found in this test -- several "Happy New Year!" messages. So, there are a few ways I can see fixing this test: 1. Just remove the offending line in the test. 2. Allow either 4 or 5 invocations of the handler. 3. Check if Tesseract is installed, checking for 4 or 5 invocations based on the result. 4. Update the image used in the test to have no text and update the TesseractParser to only invoke the handler when it finds content. I would like the third option the most, but I don't like the idea of checking for an external dependency in an otherwise unrelated test. On the other hand, it has the advantage of widening the scope of the test as little as possible. I'd like the fourth option the most, but it requires some funky logic in TesseractOCRParser. Thoughts? > org.apache.tika.parser.mail.RFC822ParserTest fails > -- > > Key: TIKA-1422 > URL: https://issues.apache.org/jira/browse/TIKA-1422 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Chris A. Mattmann > Fix For: 1.7 > > Attachments: TIKA-1422.Mattmann.100114.patch.txt > > > I'm seeing test failures from: > {noformat} > Results : > Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): > (..) > Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 > {noformat} > CentOS6 VM image, running: > {noformat} > [mattmann@memex tika]$ java -version > java version "1.7.0_67" > Java(TM) SE Runtime Environment (build 1.7.0_67-b01) > Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) > [mattmann@memex tika]$ mvn -version > Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; > 2014-02-14T09:37:52-08:00) > Maven home: /usr/share/apache-maven > Java version: 1.7.0_65, vendor: Oracle Corporation > Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre > Default locale: en_US, platform encoding: UTF-8 > OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: > "amd64", family: "unix" > [mattmann@memex tika]$ > {noformat} > Here are the surefire reports - no clue what's up here: > {noformat} > [mattmann@memex tika]$ more > tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt > > --- > Test set: org.apache.tika.parser.mail.RFC822ParserTest > --- > Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< > FAILURE! > testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: > 0.152 sec <<< FAILURE! > org.mockito.exceptions.verification.TooManyActualInvocations: > xHTMLContentHandler.startElement( > "http://www.w3.org/1999/xhtml";, > "div", > "div", > isA(org.xml.sax.Attributes) > ); > Wanted 4 times but was 5 > at > org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) > Caused by: org.mockito.exceptions.cause.UndesiredInvocation: > Undesired invocation: > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) >
[jira] [Updated] (TIKA-1434) Plain text file reported as binary
[ https://issues.apache.org/jira/browse/TIKA-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Quaranta updated TIKA-1434: - Attachment: ascii_file.txt > Plain text file reported as binary > -- > > Key: TIKA-1434 > URL: https://issues.apache.org/jira/browse/TIKA-1434 > Project: Tika > Issue Type: Bug > Components: detector, mime >Affects Versions: 1.6 >Reporter: Marco Quaranta >Priority: Minor > Attachments: ascii_file.txt > > > Mime type detection of the attached file, an ASCII file, reports > application/octect-stream. The negative result is caused by ASCII-extended > characters in the file; TextStatistics.isMostlyAscii() calls return false. We > could try to call TextDetector.detect() two times: if the first call fails, > then we can try to convert bytes in utf-8 and call it again. What do you > think? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1434) Plain text file reported as binary
Marco Quaranta created TIKA-1434: Summary: Plain text file reported as binary Key: TIKA-1434 URL: https://issues.apache.org/jira/browse/TIKA-1434 Project: Tika Issue Type: Bug Components: detector, mime Affects Versions: 1.6 Reporter: Marco Quaranta Priority: Minor Mime type detection of the attached file, an ASCII file, reports application/octect-stream. The negative result is caused by ASCII-extended characters in the file; TextStatistics.isMostlyAscii() calls return false. We could try to call TextDetector.detect() two times: if the first call fails, then we can try to convert bytes in utf-8 and call it again. What do you think? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156262#comment-14156262 ] Vineet Ghatge commented on TIKA-1423: - UPDATE: So I picked from some conversation that Annie and Christian Ward from Netcdf - http://www.unidata.ucar.edu/mailing_lists/archives/netcdf-java/2014/msg00091.html and seems like there was sample provided which I used to run and it gives out the GRIB2 data import java.io.IOException; import java.io.File; import ucar.nc2.NetcdfFile; import ucar.nc2.dataset.NetcdfDataset; public class Foo { public static void main(String[] args) throws IOException { File gribFile = new File("gdas1.forecmwf.2014062612.grib2"); NetcdfFile ncFile = NetcdfDataset.openFile(gribFile.getAbsolutePath(), null); System.out.println("Success!"); try { System.out.println(ncFile.toString()); } finally { ncFile.close(); } } } This parses and loads the GRIB2 format and I am currently working on getting Annie's code and changing class path references > Build a parser to extract data from GRIB formats > > > Key: TIKA-1423 > URL: https://issues.apache.org/jira/browse/TIKA-1423 > Project: Tika > Issue Type: New Feature > Components: metadata, mime, parser >Affects Versions: 1.6 >Reporter: Vineet Ghatge >Priority: Critical > Labels: features, newbie > Fix For: 1.7 > > Attachments: GribParser.java, > NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2 > > > Arctic dataset contains a MIME format called GRIB - General > Regularlydistributed information in Binary form > http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is > a concise data format used in meteorology to store historical and > weather data. There are 2 different types of the format GRIB 0, GRIB 2. > The focus will be on GRIB 2 which is the most prevalent. Each GRIB record > intended for either transmission or storage contains a single parameter with > values located at an array of grid points, or represented as a set of > spectral coefficients, for a single level (or layer), encoded as a continuous > bit stream. Logical divisions of the record are designated as "sections", > each of which provides control information and/or data. A GRIB record > consists of six sections, two of which are optional: > > (0) Indicator Section > (1) Product Definition Section (PDS) > (2) Grid Description Section (GDS) optional > (3) Bit Map Section (BMS) optional > (4) Binary Data Section (BDS) > (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156146#comment-14156146 ] Vineet Ghatge commented on TIKA-1423: - Thanks [~lewismc] > Build a parser to extract data from GRIB formats > > > Key: TIKA-1423 > URL: https://issues.apache.org/jira/browse/TIKA-1423 > Project: Tika > Issue Type: New Feature > Components: metadata, mime, parser >Affects Versions: 1.6 >Reporter: Vineet Ghatge >Priority: Critical > Labels: features, newbie > Fix For: 1.7 > > Attachments: GribParser.java, > NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2 > > > Arctic dataset contains a MIME format called GRIB - General > Regularlydistributed information in Binary form > http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is > a concise data format used in meteorology to store historical and > weather data. There are 2 different types of the format GRIB 0, GRIB 2. > The focus will be on GRIB 2 which is the most prevalent. Each GRIB record > intended for either transmission or storage contains a single parameter with > values located at an array of grid points, or represented as a set of > spectral coefficients, for a single level (or layer), encoded as a continuous > bit stream. Logical divisions of the record are designated as "sections", > each of which provides control information and/or data. A GRIB record > consists of six sections, two of which are optional: > > (0) Indicator Section > (1) Product Definition Section (PDS) > (2) Grid Description Section (GDS) optional > (3) Bit Map Section (BMS) optional > (4) Binary Data Section (BDS) > (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)