[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274612#comment-14274612
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

Yes, there are native libs for windows, mac and linux packed into xerial 
sqlite-jdbc-3.8.7.jar, but there are other wrappers if that is a problem. The 
license for xerial-jdbc is Apache v2.

> Create a parser for SQLite3
> ---
>
> Key: TIKA-1511
> URL: https://issues.apache.org/jira/browse/TIKA-1511
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
> Fix For: 1.8
>
>
> I think it would be very useful, as sqlite is used as data storage by a wide 
> range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1222) Tika does not extract attachments from RFC822 files

2015-01-12 Thread Luis Filipe Nassif (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif updated TIKA-1222:
-
  Description: TikaApp --extract option does not extract attachments 
from RFC822 files. The issue happens because MailContentHandler.body(...) 
method gets a Parser.class object from the context and calls parser.parse(). It 
should get a EmbeddedDocumentExtractor.class object from the ParseContext one 
and call embeddedDocumentExtractor.parseEmbedded(), similar to other Container 
parsers.  (was: TikaCli --extract option does not extract attachments from 
RFC822 files. The issue happens because MailContentHandler.body(...) method 
gets a Parser.class object from the context and calls parser.parse(). It should 
get a EmbeddedDocumentExtractor.class object from the ParseContext one and call 
embeddedDocumentExtractor.parseEmbedded(), similar to other Container parsers.)
Affects Version/s: 1.5
   1.6

> Tika does not extract attachments from RFC822 files
> ---
>
> Key: TIKA-1222
> URL: https://issues.apache.org/jira/browse/TIKA-1222
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4, 1.5, 1.6
>Reporter: Luis Filipe Nassif
> Attachments: Tika-1222.patch
>
>
> TikaApp --extract option does not extract attachments from RFC822 files. The 
> issue happens because MailContentHandler.body(...) method gets a Parser.class 
> object from the context and calls parser.parse(). It should get a 
> EmbeddedDocumentExtractor.class object from the ParseContext one and call 
> embeddedDocumentExtractor.parseEmbedded(), similar to other Container parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: ExternalParser isn't called

2015-01-12 Thread Konstantin Gribov
Hello.

I haven't such issue on current trunk (r1651112) with both avi and mp4 test
files when invoking via tika-app.

-- 
Best regards,
Konstantin Gribov

Mon Jan 12 2015 at 23:05:57, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov>:

> So..in trunk right now if you try and run tika against the only
> ExternalParser defined (for ffmpeg), the ExternalParser isn’t
> called. It’s not for lack of MIME type too (they look right), so
> I’m investigating.
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274159#comment-14274159
 ] 

Nick Burch commented on TIKA-1511:
--

Just to be sure, since SQLite doesn't show up in the [Apache Legal FAQ 
list|http://www.apache.org/legal/resolved.html], it'd probably be worth raising 
a legal jira (link from [the legal 
page|http://www.apache.org/legal/resolved.html) just to get confirmation that 
it's fine to use + clarify what (if any) notice entry is needed for it

> Create a parser for SQLite3
> ---
>
> Key: TIKA-1511
> URL: https://issues.apache.org/jira/browse/TIKA-1511
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
> Fix For: 1.8
>
>
> I think it would be very useful, as sqlite is used as data storage by a wide 
> range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1511:
--
Priority: Major  (was: Minor)

> Create a parser for SQLite3
> ---
>
> Key: TIKA-1511
> URL: https://issues.apache.org/jira/browse/TIKA-1511
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
> Fix For: 1.8
>
>
> I think it would be very useful, as sqlite is used as data storage by a wide 
> range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1511:
--
Fix Version/s: 1.8

> Create a parser for SQLite3
> ---
>
> Key: TIKA-1511
> URL: https://issues.apache.org/jira/browse/TIKA-1511
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
>Priority: Minor
> Fix For: 1.8
>
>
> I think it would be very useful, as sqlite is used as data storage by a wide 
> range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274135#comment-14274135
 ] 

Tim Allison commented on TIKA-1511:
---

Agreed on the license.

I'm able to create and write to a sqlite db with just the jar from maven:

{noformat}

  org.xerial
  sqlite-jdbc
  3.8.7

{noformat}

I don't think I have native libs kicking around my system somewhere, or do I? 

This will add another 4 MB to tika-app/tika-server, but I think that it is 
worth it...


> Create a parser for SQLite3
> ---
>
> Key: TIKA-1511
> URL: https://issues.apache.org/jira/browse/TIKA-1511
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
>Priority: Minor
>
> I think it would be very useful, as sqlite is used as data storage by a wide 
> range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


ExternalParser isn't called

2015-01-12 Thread Mattmann, Chris A (3980)
So..in trunk right now if you try and run tika against the only
ExternalParser defined (for ffmpeg), the ExternalParser isn’t
called. It’s not for lack of MIME type too (they look right), so
I’m investigating. 

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273905#comment-14273905
 ] 

Luis Filipe Nassif commented on TIKA-1511:
--

I don't see any problems too. I think "public domain" is more liberal than 
apache v2, because the authors abdicated their copyright.

But sqlite needs native libs. Could it be a poblem?

> Create a parser for SQLite3
> ---
>
> Key: TIKA-1511
> URL: https://issues.apache.org/jira/browse/TIKA-1511
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
>Priority: Minor
>
> I think it would be very useful, as sqlite is used as data storage by a wide 
> range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1511) Create a parser for SQLite3

2015-01-12 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273869#comment-14273869
 ] 

Tim Allison commented on TIKA-1511:
---

See any licensing problems with bundling sqlite dependency?  It isn't Apache 
v2, but what we'd bundle isn't licensed at all 
([link|https://www.sqlite.org/copyright.html]).

I don't see a problem, but wanted to check to see if anyone has any issues. 

Thank you for opening this issue!

> Create a parser for SQLite3
> ---
>
> Key: TIKA-1511
> URL: https://issues.apache.org/jira/browse/TIKA-1511
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
>Priority: Minor
>
> I think it would be very useful, as sqlite is used as data storage by a wide 
> range of applications. Opening the ticket to track it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: TestMultiPart tests failing

2015-01-12 Thread Tyler Palsulich
Hi Chris,

I'm not getting any test failures from trunk on my Mac. So, I'm also
curious what revision you're on.

uname -a:
Darwin Tylers-MacBook-Pro.local 14.0.0 Darwin Kernel Version 14.0.0: Fri
Sep 19 00:26:44 PDT 2014; root:xnu-2782.1.97~2/RELEASE_X86_64 x86_64

tesseract --version:
tesseract 3.02.02
 leptonica-1.71
  libjpeg 8d : libpng 1.6.13 : libtiff 4.0.3 : zlib 1.2.5

java -version:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)

Tyler

On Mon, Jan 12, 2015 at 8:34 AM, Allison, Timothy B. 
wrote:

> Chris,
>
> Is this on an updated and/or reverted trunk or on an modified rc-3?
>
> I haven't gotten around to installing tesseract yet so I can't actually
> kick the tires, but the last time there was a test for 5 items on line 91
> of RFC822ParserTest was in r1552405...before the fixes for TIKA-1422.
>
> But r1552405 doesn't quite seem to fit the error message, which says that
> it can't find 5 "div" (if I understand correctly), and in r1552405 the test
> was for 5 "p".
>
> In r161 and 1633325, there is a path through the code to test for 5
> "div" if Tesseract is running, but that isn't occurring on line 91 in those
> revisions.
>
> -Original Message-
> From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
> Sent: Sunday, January 11, 2015 7:33 PM
> To: dev@tika.apache.org
> Subject: TestMultiPart tests failing
>
> Hey Guys,
>
> I’m on Mac OS X 10.9.4, Java version:
>
> [chipotle:~/src/tika] mattmann% uname -a
> Darwin chipotle.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun  3
> 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64
> [chipotle:~/src/tika] mattmann% java -version
> java version "1.7.0_60"
> Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
> Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
> [chipotle:~/src/tika] mattmann%
>
>
>
> With Tesseract installed:
>
> [chipotle:~/src/tika] mattmann% tesseract --version
> tesseract 3.02.02
>  leptonica-1.71
>   libjpeg 8d : libpng 1.6.13 : libtiff 4.0.3 : zlib 1.2.5
>
> [chipotle:~/src/tika] mattmann%
>
>
>
> And the following tests are failing:
>
> Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec
> Running org.apache.tika.parser.xml.EmptyAndDuplicateElementsXMLParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec
> Running org.apache.tika.parser.xml.FictionBookParserTest
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
> Running org.apache.tika.sax.PhoneExtractingContentHandlerTest
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec
> Running org.apache.tika.TestParsers
> Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.669 sec
>
> Results :
>
> Failed tests:
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..)
>
> Tests run: 572, Failures: 1, Errors: 0, Skipped: 2
>
> [INFO] -
>
>
>
> Test set: org.apache.tika.parser.mail.RFC822ParserTest
> ---
> 
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.903 sec
> <<< FAILURE!
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed:
> 0.289 sec  <<< FAILURE!
> org.mockito.exceptions.verification.TooLittleActualInvocations:
> xHTMLContentHandler.startElement(
> "http://www.w3.org/1999/xhtml";,
> "div",
> "div",
> isA(org.xml.sax.Attributes)
> );
> Wanted 5 times but was 4
> at
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest
> .java:91)
> Caused by: org.mockito.exceptions.cause.TooLittleInvocations:
> Too little invocations:
> at
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDeco
> rator.java:126)
> at
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java
> :264)
> at
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.ja
> va:254)
> at
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.ja
> va:291)
> at
> org.apache.tika.parser.mail.MailContentHandler.startBodyPart(MailContentHan
> dler.java:242)
> at org.apache.james.mime4j.parser.MimeStr
>
>
>
> Ideas?
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> +

RE: TestMultiPart tests failing

2015-01-12 Thread Allison, Timothy B.
Chris,

Is this on an updated and/or reverted trunk or on an modified rc-3?

I haven't gotten around to installing tesseract yet so I can't actually kick 
the tires, but the last time there was a test for 5 items on line 91 of 
RFC822ParserTest was in r1552405...before the fixes for TIKA-1422.  

But r1552405 doesn't quite seem to fit the error message, which says that it 
can't find 5 "div" (if I understand correctly), and in r1552405 the test was 
for 5 "p".

In r161 and 1633325, there is a path through the code to test for 5 "div" 
if Tesseract is running, but that isn't occurring on line 91 in those revisions.

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Sunday, January 11, 2015 7:33 PM
To: dev@tika.apache.org
Subject: TestMultiPart tests failing

Hey Guys,

I’m on Mac OS X 10.9.4, Java version:

[chipotle:~/src/tika] mattmann% uname -a
Darwin chipotle.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun  3
21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64
[chipotle:~/src/tika] mattmann% java -version
java version "1.7.0_60"
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
[chipotle:~/src/tika] mattmann%



With Tesseract installed:

[chipotle:~/src/tika] mattmann% tesseract --version
tesseract 3.02.02
 leptonica-1.71
  libjpeg 8d : libpng 1.6.13 : libtiff 4.0.3 : zlib 1.2.5

[chipotle:~/src/tika] mattmann%



And the following tests are failing:

Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec
Running org.apache.tika.parser.xml.EmptyAndDuplicateElementsXMLParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec
Running org.apache.tika.parser.xml.FictionBookParserTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running org.apache.tika.sax.PhoneExtractingContentHandlerTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec
Running org.apache.tika.TestParsers
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.669 sec

Results :

Failed tests:   
testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..)

Tests run: 572, Failures: 1, Errors: 0, Skipped: 2

[INFO] -



Test set: org.apache.tika.parser.mail.RFC822ParserTest
---

Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.903 sec
<<< FAILURE!
testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed:
0.289 sec  <<< FAILURE!
org.mockito.exceptions.verification.TooLittleActualInvocations:
xHTMLContentHandler.startElement(
"http://www.w3.org/1999/xhtml";,
"div",
"div",
isA(org.xml.sax.Attributes)
);
Wanted 5 times but was 4
at 
org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest
.java:91)
Caused by: org.mockito.exceptions.cause.TooLittleInvocations:
Too little invocations:
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDeco
rator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java
:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.ja
va:254)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.ja
va:291)
at 
org.apache.tika.parser.mail.MailContentHandler.startBodyPart(MailContentHan
dler.java:242)
at org.apache.james.mime4j.parser.MimeStr



Ideas?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






[jira] [Commented] (TIKA-1512) WordParser fails on many Word files

2015-01-12 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273681#comment-14273681
 ] 

Nick Burch commented on TIKA-1512:
--

What about subsequent runs - I'm wondering where the closing quote is for the 
hyperlink?

Also, does the text contain the whole of the URL, or is it truncated at all?

If you open the file in Word and do a save-as, does the file then parse 
properly, or does the problem remain?

If you open the file in Word, does the hyperlink work properly in Word?

> WordParser fails on many Word files
> ---
>
> Key: TIKA-1512
> URL: https://issues.apache.org/jira/browse/TIKA-1512
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6, 1.7, 1.8
> Environment: Linux 64bit
> OpenJDK Runtime Environment (IcedTea 2.4.4) (suse-24.13.5-x86_64)
> OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
> and
> java version "1.6.0"
> Java(TM) SE Runtime Environment
> IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 (JIT enabled, AOT 
> enabled)
>Reporter: F Seid
>Assignee: Jukka Zitting
>
> WordParser fail on some word files. A negative value is sent to substring



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1512) WordParser fails on many Word files

2015-01-12 Thread F Seid (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273632#comment-14273632
 ] 

F Seid commented on TIKA-1512:
--

text var is like this on error (shown via jdb):

text = " HYPERLINK "http://foo?bar=t&faa=http://fbb/abc.html&fbb=tata "

> WordParser fails on many Word files
> ---
>
> Key: TIKA-1512
> URL: https://issues.apache.org/jira/browse/TIKA-1512
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6, 1.7, 1.8
> Environment: Linux 64bit
> OpenJDK Runtime Environment (IcedTea 2.4.4) (suse-24.13.5-x86_64)
> OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
> and
> java version "1.6.0"
> Java(TM) SE Runtime Environment
> IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 (JIT enabled, AOT 
> enabled)
>Reporter: F Seid
>Assignee: Jukka Zitting
>
> WordParser fail on some word files. A negative value is sent to substring



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1512) WordParser fails on many Word files

2015-01-12 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273586#comment-14273586
 ] 

Nick Burch commented on TIKA-1512:
--

I worry that might be solving the symptom not the problem

Any chance you could step into a problem file with a debugger, and let us know 
what the text of each character run in a problematic paragraph are?

> WordParser fails on many Word files
> ---
>
> Key: TIKA-1512
> URL: https://issues.apache.org/jira/browse/TIKA-1512
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6, 1.7, 1.8
> Environment: Linux 64bit
> OpenJDK Runtime Environment (IcedTea 2.4.4) (suse-24.13.5-x86_64)
> OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
> and
> java version "1.6.0"
> Java(TM) SE Runtime Environment
> IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 (JIT enabled, AOT 
> enabled)
>Reporter: F Seid
>Assignee: Jukka Zitting
>
> WordParser fail on some word files. A negative value is sent to substring



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1510) FFMpeg installed but not parsing video files

2015-01-12 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273520#comment-14273520
 ] 

Ray Gauss II commented on TIKA-1510:


Yes.

The only reason I haven't myself is that I've been trying to find some time to 
refactor the vorbis stuff per the previous 
[conversation|http://mail-archives.apache.org/mod_mbox/tika-dev/201408.mbox/%3calpine.deb.2.02.1408221155450.8...@urchin.earth.li%3E]
 with [~gagravarr].

> FFMpeg installed but not parsing video files
> 
>
> Key: TIKA-1510
> URL: https://issues.apache.org/jira/browse/TIKA-1510
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: FFMPEG, Mac OS X 10.9 with HomeBrew
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
>
> I have FFMPEG installed with homebrew:
> {noformat}
> # brew install ffmpeg
> {noformat}
> I've got some AVI files and have tried to parse them with Tika:
> {noformat}
> [chipotle:~/Desktop/drone-vids] mattmann% tika -m SPOT11_01\ 17.AVI
> Content-Length: 334917340
> Content-Type: video/x-msvideo
> X-Parsed-By: org.apache.tika.parser.EmptyParser
> resourceName: SPOT11_01 17.AVI
> {noformat}
> I took a look at the ExternalParser, which is configured for using ffmpeg if 
> it's installed. It seems it only works on:
> {code:xml}
>
>video/avi
>video/mpeg
>  
> {code}
> I'll add video/x-msvideo and see if that fixes it. I also stumbled upon the 
> work by [~rgauss] at Github - Ray I noticed there is no parser in that work:
> https://github.com/AlfrescoLabs/tika-ffmpeg
> But there seems to be metadata extraction code, etc. Ray should I do 
> something with this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1512) WordParser fails on many Word files

2015-01-12 Thread F Seid (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273515#comment-14273515
 ] 

F Seid commented on TIKA-1512:
--

Sorry, but i cannot share those files. 

But it helped to put another check
in the if clause above WordExtractor.java:407 to see if lastIndexOf is
behind indexOf('"').

> WordParser fails on many Word files
> ---
>
> Key: TIKA-1512
> URL: https://issues.apache.org/jira/browse/TIKA-1512
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6, 1.7, 1.8
> Environment: Linux 64bit
> OpenJDK Runtime Environment (IcedTea 2.4.4) (suse-24.13.5-x86_64)
> OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
> and
> java version "1.6.0"
> Java(TM) SE Runtime Environment
> IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 (JIT enabled, AOT 
> enabled)
>Reporter: F Seid
>Assignee: Jukka Zitting
>
> WordParser fail on some word files. A negative value is sent to substring



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1512) WordParser fails on many Word files

2015-01-12 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273486#comment-14273486
 ] 

Nick Burch commented on TIKA-1512:
--

Do you have a very small sample file that triggers the problem?

> WordParser fails on many Word files
> ---
>
> Key: TIKA-1512
> URL: https://issues.apache.org/jira/browse/TIKA-1512
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6, 1.7, 1.8
> Environment: Linux 64bit
> OpenJDK Runtime Environment (IcedTea 2.4.4) (suse-24.13.5-x86_64)
> OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
> and
> java version "1.6.0"
> Java(TM) SE Runtime Environment
> IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux amd64-64 (JIT enabled, AOT 
> enabled)
>Reporter: F Seid
>Assignee: Jukka Zitting
>
> WordParser fail on some word files. A negative value is sent to substring



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)