Trying To Index A Visio document

2014-10-02 Thread mdemarco123
Hi, I have tried to index a document with the vsdx extension but its is
erroring. I can index pdf, word doc and plain text but vsdx not working why?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Trying-To-Index-A-Visio-document-tp4162413.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-02 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157068#comment-14157068
 ] 

Tyler Palsulich commented on TIKA-1422:
---

[~chrismattmann], I believe that patch fails when Tesseract is not installed. 
When Tesseract is not installed, the ContentHandler in question is only invoked 
4 times. But, when Tesseract is installed, it's invoked 5 times.

My first thought was that the Tesseract Parser always invoked the 
ContentHandler, even if no OCR text was found. But, there *is* OCR text to be 
found in this test -- several "Happy New Year!" messages. So, there are a few 
ways I can see fixing this test:

1. Just remove the offending line in the test.
2. Allow either 4 or 5 invocations of the handler.
3. Check if Tesseract is installed, checking for 4 or 5 invocations based on 
the result.
4. Update the image used in the test to have no text and update the 
TesseractParser to only invoke the handler when it finds content.

I would like the third option the most, but I don't like the idea of checking 
for an external dependency in an otherwise unrelated test. On the other hand, 
it has the advantage of widening the scope of the test as little as possible.

I'd like the fourth option the most, but it requires some funky logic in 
TesseractOCRParser.

Thoughts?



> org.apache.tika.parser.mail.RFC822ParserTest fails
> --
>
> Key: TIKA-1422
> URL: https://issues.apache.org/jira/browse/TIKA-1422
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
> Fix For: 1.7
>
> Attachments: TIKA-1422.Mattmann.100114.patch.txt
>
>
> I'm seeing test failures from:
> {noformat}
> Results :
> Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
> (..)
> Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
> {noformat}
> CentOS6 VM image, running:
> {noformat}
> [mattmann@memex tika]$ java -version
> java version "1.7.0_67"
> Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
> [mattmann@memex tika]$ mvn -version
> Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
> 2014-02-14T09:37:52-08:00)
> Maven home: /usr/share/apache-maven
> Java version: 1.7.0_65, vendor: Oracle Corporation
> Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: 
> "amd64", family: "unix"
> [mattmann@memex tika]$ 
> {noformat}
> Here are the surefire reports - no clue what's up here:
> {noformat}
> [mattmann@memex tika]$ more 
> tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
>  
> ---
> Test set: org.apache.tika.parser.mail.RFC822ParserTest
> ---
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< 
> FAILURE!
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
> 0.152 sec  <<< FAILURE!
> org.mockito.exceptions.verification.TooManyActualInvocations: 
> xHTMLContentHandler.startElement(
> "http://www.w3.org/1999/xhtml";,
> "div",
> "div",
> isA(org.xml.sax.Attributes)
> );
> Wanted 4 times but was 5
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
> Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
> Undesired invocation:
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
> 

[jira] [Updated] (TIKA-1434) Plain text file reported as binary

2014-10-02 Thread Marco Quaranta (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Quaranta updated TIKA-1434:
-
Attachment: ascii_file.txt

> Plain text file reported as binary
> --
>
> Key: TIKA-1434
> URL: https://issues.apache.org/jira/browse/TIKA-1434
> Project: Tika
>  Issue Type: Bug
>  Components: detector, mime
>Affects Versions: 1.6
>Reporter: Marco Quaranta
>Priority: Minor
> Attachments: ascii_file.txt
>
>
> Mime type detection of the attached file, an ASCII file, reports 
> application/octect-stream. The negative result is caused by ASCII-extended 
> characters in the file; TextStatistics.isMostlyAscii() calls return false. We 
> could try to call TextDetector.detect() two times: if the first call fails, 
> then we can try to convert bytes in utf-8 and call it again. What do you 
> think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1434) Plain text file reported as binary

2014-10-02 Thread Marco Quaranta (JIRA)
Marco Quaranta created TIKA-1434:


 Summary: Plain text file reported as binary
 Key: TIKA-1434
 URL: https://issues.apache.org/jira/browse/TIKA-1434
 Project: Tika
  Issue Type: Bug
  Components: detector, mime
Affects Versions: 1.6
Reporter: Marco Quaranta
Priority: Minor


Mime type detection of the attached file, an ASCII file, reports 
application/octect-stream. The negative result is caused by ASCII-extended 
characters in the file; TextStatistics.isMostlyAscii() calls return false. We 
could try to call TextDetector.detect() two times: if the first call fails, 
then we can try to convert bytes in utf-8 and call it again. What do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-02 Thread Vineet Ghatge (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156262#comment-14156262
 ] 

Vineet Ghatge commented on TIKA-1423:
-

UPDATE:
So I picked from some conversation that Annie and Christian Ward from Netcdf - 
http://www.unidata.ucar.edu/mailing_lists/archives/netcdf-java/2014/msg00091.html
  and seems like there was sample provided which I used to run and it gives out 
the GRIB2 data

import java.io.IOException;
import java.io.File;
import ucar.nc2.NetcdfFile;
import ucar.nc2.dataset.NetcdfDataset;

public class Foo {
public static void main(String[] args) throws IOException {
File gribFile = new File("gdas1.forecmwf.2014062612.grib2");
NetcdfFile ncFile = NetcdfDataset.openFile(gribFile.getAbsolutePath(), 
null);
System.out.println("Success!");
try {
System.out.println(ncFile.toString());
} finally {
ncFile.close();
}
}
}
This parses and loads the GRIB2 format and I am currently working on getting 
Annie's code  and changing class path references

> Build a parser to extract data from GRIB formats
> 
>
> Key: TIKA-1423
> URL: https://issues.apache.org/jira/browse/TIKA-1423
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata, mime, parser
>Affects Versions: 1.6
>Reporter: Vineet Ghatge
>Priority: Critical
>  Labels: features, newbie
> Fix For: 1.7
>
> Attachments: GribParser.java, 
> NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2
>
>
> Arctic dataset contains a MIME format called GRIB -  General 
> Regularly­distributed information in Binary form 
> http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
> a concise data format used in meteorology to store historical and 
> weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
> The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
> intended for either transmission or storage contains a single parameter with 
> values located at an array of grid points, or represented as a set of 
> spectral coefficients, for a single level (or layer), encoded as a continuous 
> bit stream. Logical divisions of the record are designated as "sections", 
> each of which provides control information and/or data. A GRIB record 
> consists of six sections, two of which are optional: 
>  
> (0) Indicator Section 
> (1) Product Definition Section (PDS) 
> (2) Grid Description Section (GDS) ­ optional 
> (3) Bit Map Section (BMS) ­ optional 
> (4) Binary Data Section (BDS) 
> (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-02 Thread Vineet Ghatge (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156146#comment-14156146
 ] 

Vineet Ghatge commented on TIKA-1423:
-

Thanks [~lewismc]

> Build a parser to extract data from GRIB formats
> 
>
> Key: TIKA-1423
> URL: https://issues.apache.org/jira/browse/TIKA-1423
> Project: Tika
>  Issue Type: New Feature
>  Components: metadata, mime, parser
>Affects Versions: 1.6
>Reporter: Vineet Ghatge
>Priority: Critical
>  Labels: features, newbie
> Fix For: 1.7
>
> Attachments: GribParser.java, 
> NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2
>
>
> Arctic dataset contains a MIME format called GRIB -  General 
> Regularly­distributed information in Binary form 
> http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
> a concise data format used in meteorology to store historical and 
> weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
> The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
> intended for either transmission or storage contains a single parameter with 
> values located at an array of grid points, or represented as a set of 
> spectral coefficients, for a single level (or layer), encoded as a continuous 
> bit stream. Logical divisions of the record are designated as "sections", 
> each of which provides control information and/or data. A GRIB record 
> consists of six sections, two of which are optional: 
>  
> (0) Indicator Section 
> (1) Product Definition Section (PDS) 
> (2) Grid Description Section (GDS) ­ optional 
> (3) Bit Map Section (BMS) ­ optional 
> (4) Binary Data Section (BDS) 
> (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)