[jira] [Commented] (TIKA-3884) MarianTranslator blocks on Windows

2022-11-06 Thread Dave Meikle (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629472#comment-17629472
 ] 

Dave Meikle commented on TIKA-3884:
---

Hi [~tallison]  - yes it is now. Just marking it as such.

> MarianTranslator blocks on Windows
> --
>
> Key: TIKA-3884
> URL: https://issues.apache.org/jira/browse/TIKA-3884
> Project: Tika
>  Issue Type: Bug
>  Components: translation
>Affects Versions: 2.5.0
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.6.0
>
>
> MarianTranslator blocks on Windows when using a local Marian Decoder.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3884) MarianTranslator blocks on Windows

2022-11-06 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-3884.
---
Resolution: Fixed

> MarianTranslator blocks on Windows
> --
>
> Key: TIKA-3884
> URL: https://issues.apache.org/jira/browse/TIKA-3884
> Project: Tika
>  Issue Type: Bug
>  Components: translation
>Affects Versions: 2.5.0
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.6.0
>
>
> MarianTranslator blocks on Windows when using a local Marian Decoder.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3884) MarianTranslator blocks on Windows

2022-10-17 Thread Dave Meikle (Jira)
Dave Meikle created TIKA-3884:
-

 Summary: MarianTranslator blocks on Windows
 Key: TIKA-3884
 URL: https://issues.apache.org/jira/browse/TIKA-3884
 Project: Tika
  Issue Type: Bug
  Components: translation
Affects Versions: 2.5.0
Reporter: Dave Meikle
Assignee: Dave Meikle
 Fix For: 2.5.1


MarianTranslator blocks on Windows when using a local Marian Decoder.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-3660) Add parser for TMX Files

2022-01-22 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-3660.
---
Fix Version/s: 2.2.2
   Resolution: Done

Added in: 
https://github.com/apache/tika/commit/af51ea0d6b36cdee4d29f4dbb5eb8f193b3ca25c

> Add parser for TMX Files
> 
>
> Key: TIKA-3660
> URL: https://issues.apache.org/jira/browse/TIKA-3660
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.2.2
>
>
> Add a parser for Translation Memory eXchange (TMX) files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (TIKA-3660) Add parser for TMX Files

2022-01-22 Thread Dave Meikle (Jira)
Dave Meikle created TIKA-3660:
-

 Summary: Add parser for TMX Files
 Key: TIKA-3660
 URL: https://issues.apache.org/jira/browse/TIKA-3660
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Dave Meikle
Assignee: Dave Meikle


Add a parser for Translation Memory eXchange (TMX) files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (TIKA-3636) Add MarianTranslator to support Marian NMT Engines

2021-12-30 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-3636.
---
Resolution: Fixed

> Add MarianTranslator to support Marian NMT Engines
> --
>
> Key: TIKA-3636
> URL: https://issues.apache.org/jira/browse/TIKA-3636
> Project: Tika
>  Issue Type: New Feature
>  Components: translation
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.2.1
>
>
> [Marian NMT|https://marian-nmt.github.io] is a popular machine translation 
> engine framework which would be good to support in Apache Tika.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3636) Add MarianTranslator to support Marian NMT Engines

2021-12-30 Thread Dave Meikle (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17467057#comment-17467057
 ] 

Dave Meikle commented on TIKA-3636:
---

Merged in 
https://github.com/apache/tika/commit/5fd2d011bf816745053c447ef92e5017681ae23b

> Add MarianTranslator to support Marian NMT Engines
> --
>
> Key: TIKA-3636
> URL: https://issues.apache.org/jira/browse/TIKA-3636
> Project: Tika
>  Issue Type: New Feature
>  Components: translation
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.2.1
>
>
> [Marian NMT|https://marian-nmt.github.io] is a popular machine translation 
> engine framework which would be good to support in Apache Tika.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (TIKA-3636) Add MarianTranslator to support Marian NMT Engines

2021-12-30 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-3636:
--
Fix Version/s: 2.2.1

> Add MarianTranslator to support Marian NMT Engines
> --
>
> Key: TIKA-3636
> URL: https://issues.apache.org/jira/browse/TIKA-3636
> Project: Tika
>  Issue Type: New Feature
>  Components: translation
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.2.1
>
>
> [Marian NMT|https://marian-nmt.github.io] is a popular machine translation 
> engine framework which would be good to support in Apache Tika.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (TIKA-3636) Add MarianTranslator to support Marian NMT Engines

2021-12-30 Thread Dave Meikle (Jira)
Dave Meikle created TIKA-3636:
-

 Summary: Add MarianTranslator to support Marian NMT Engines
 Key: TIKA-3636
 URL: https://issues.apache.org/jira/browse/TIKA-3636
 Project: Tika
  Issue Type: New Feature
  Components: translation
Reporter: Dave Meikle
Assignee: Dave Meikle


[Marian NMT|https://marian-nmt.github.io] is a popular machine translation 
engine framework which would be good to support in Apache Tika.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3453) SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" Defaulting to no-operation (NOP) logger implementation for tika-docker 2.0.0-BETA and 2.1.0

2021-10-10 Thread Dave Meikle (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17426851#comment-17426851
 ] 

Dave Meikle commented on TIKA-3453:
---

Good spot [~lewismc]. I got to the same conclusion as you. Merging the PR now.

> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" Defaulting to 
> no-operation (NOP) logger implementation for tika-docker 2.0.0-BETA and 2.1.0
> ---
>
> Key: TIKA-3453
> URL: https://issues.apache.org/jira/browse/TIKA-3453
> Project: Tika
>  Issue Type: Bug
>  Components: docker, server
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.1.1
>
>
> It looks like logging libraries are not being interpreted correctly from Java 
> classpath.
> We need logging turned on so we can intercept anomalies.
> Investigating...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3227) Allow Tika Server to skip embedded files through HTTP Header

2020-11-11 Thread Dave Meikle (Jira)
Dave Meikle created TIKA-3227:
-

 Summary: Allow Tika Server to skip embedded files through HTTP 
Header
 Key: TIKA-3227
 URL: https://issues.apache.org/jira/browse/TIKA-3227
 Project: Tika
  Issue Type: Improvement
  Components: server
Reporter: Dave Meikle
Assignee: Dave Meikle






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3191) Issue with GrobidJournalParser

2020-11-09 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-3191.
---
Fix Version/s: 1.25
   Resolution: Fixed

Added to branch_1x in 
[https://github.com/apache/tika/commit/175766713ec404418f349206dc43ffb9730994e2]

And main branch in
[https://github.com/apache/tika/commit/1957a60575075fe60e367a506bdbf0136d653547]

 

> Issue with GrobidJournalParser
> --
>
> Key: TIKA-3191
> URL: https://issues.apache.org/jira/browse/TIKA-3191
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
> Environment: tika-server-1.24.1.jar
> Grobid 0.6.2
> On Windows 10
>Reporter: Nav
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 1.25
>
> Attachments: StackTraceTikaGrobidException.txt
>
>
> I followed the GrobidJournalParser instructions as per [this 
> link.|[https://cwiki.apache.org/confluence/display/TIKA/GrobidJournalParser]]
> I am getting an error when I submit a PDF. Stack trace attached.
> I also noticed a similar open issue on 
> [Stackoverflow|[https://stackoverflow.com/questions/62932722/tika-with-grobid-throwing-error-when-parsing-pdf-document]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TIKA-3191) Issue with GrobidJournalParser

2020-11-09 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-3191:
-

Assignee: Dave Meikle

> Issue with GrobidJournalParser
> --
>
> Key: TIKA-3191
> URL: https://issues.apache.org/jira/browse/TIKA-3191
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
> Environment: tika-server-1.24.1.jar
> Grobid 0.6.2
> On Windows 10
>Reporter: Nav
>Assignee: Dave Meikle
>Priority: Major
> Attachments: StackTraceTikaGrobidException.txt
>
>
> I followed the GrobidJournalParser instructions as per [this 
> link.|[https://cwiki.apache.org/confluence/display/TIKA/GrobidJournalParser]]
> I am getting an error when I submit a PDF. Stack trace attached.
> I also noticed a similar open issue on 
> [Stackoverflow|[https://stackoverflow.com/questions/62932722/tika-with-grobid-throwing-error-when-parsing-pdf-document]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3156) Missing content from .odt file with hyperlinked image

2020-11-09 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-3156.
---
Fix Version/s: 1.25
   Resolution: Fixed

Resolved in main in:
[https://github.com/apache/tika/commit/2b456679200bd8b2e86864ae6db847923d2bc134]

Resolved in branch_1x in:
[https://github.com/apache/tika/commit/38d226801725ce3742bbc29ca62400cee115927a]
[https://github.com/apache/tika/commit/4778eded858bdbc23ac6156085eb19e13e8a77cf]

 

> Missing content from .odt file with hyperlinked image
> -
>
> Key: TIKA-3156
> URL: https://issues.apache.org/jira/browse/TIKA-3156
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Robert Kaulbach
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.25
>
> Attachments: link-gdocs.odt
>
>
> The attached file was created in Google Docs with an image inside and saved 
> as an .odt file. After saving, I opened the file with LibreOffice and added a 
> hyperlink to the image.
>  
> When I parse the file with Tika, neither LinkContentHandler or 
> ToXMLContentHandler show any trace of the hyperlink.
>  
> The link is clickable when I open the document, and inside content.xml as :
> _http://example.test/];>_
>  
> I tried enabling all options in OfficeParserConfig and OOXMLParser but the 
> link is still not extracted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TIKA-3156) Missing content from .odt file with hyperlinked image

2020-11-08 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-3156:
-

Assignee: Dave Meikle

> Missing content from .odt file with hyperlinked image
> -
>
> Key: TIKA-3156
> URL: https://issues.apache.org/jira/browse/TIKA-3156
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24.1
>Reporter: Robert Kaulbach
>Assignee: Dave Meikle
>Priority: Minor
> Attachments: link-gdocs.odt
>
>
> The attached file was created in Google Docs with an image inside and saved 
> as an .odt file. After saving, I opened the file with LibreOffice and added a 
> hyperlink to the image.
>  
> When I parse the file with Tika, neither LinkContentHandler or 
> ToXMLContentHandler show any trace of the hyperlink.
>  
> The link is clickable when I open the document, and inside content.xml as :
> _http://example.test/];>_
>  
> I tried enabling all options in OfficeParserConfig and OOXMLParser but the 
> link is still not extracted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3189) Add FrameMaker MIF Parser

2020-09-21 Thread Dave Meikle (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199597#comment-17199597
 ] 

Dave Meikle commented on TIKA-3189:
---

No worries. It was good getting up to speed with all the awesome updates you've 
done in the main branch.

Yes the existing one is an very old version of the format. It's been through a 
large number of changes/iterations across the years but are still handled by 
Adobe FrameMaker. So its the same family of file type, but structure inside has 
changed significantly.

I have no strong preference on which module it goes in to be honest.

 

 

 

> Add FrameMaker MIF Parser
> -
>
> Key: TIKA-3189
> URL: https://issues.apache.org/jira/browse/TIKA-3189
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3189) Add FrameMaker MIF Parser

2020-09-21 Thread Dave Meikle (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199584#comment-17199584
 ] 

Dave Meikle commented on TIKA-3189:
---

HI [~tallison]

Looks like we were off trying to do a similar thing.

I had ended up creating a new module (tika-parser-adobe-module) and then 
sharing some stuff with the text module (tika-parser-text-commons) but I'm not 
sure I like the name as people may expect PDF to be in there too.

I've committed here on my fork:
[https://github.com/apache/tika/commit/a73d4b0e159dfd1cecbd00d0b96afacbb0922a87]

Not sure what you think?

Yeah, I hadn't noticed that file. The older MakerFiles as well as MIFFiles 
below version 7 are pretty funky - I really should have thought about that. 
They have their own specific encoding in points, which we'd need to build a 
custom encoder for to support! We've seen this in Okapi Framework for 
translating MIFFiles but no one has ever finished the job of the custom encoder.

I've added a version check now, and will move to branch_1x as well.

Cheers,
Dave

> Add FrameMaker MIF Parser
> -
>
> Key: TIKA-3189
> URL: https://issues.apache.org/jira/browse/TIKA-3189
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (TIKA-2976) Add an XLZ parser

2020-09-20 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle closed TIKA-2976.
-
Resolution: Implemented

Implemented in:
[https://github.com/apache/tika/commit/003da648b52829fb0f19201bd6acda3687d83d31]

> Add an XLZ parser
> -
>
> Key: TIKA-2976
> URL: https://issues.apache.org/jira/browse/TIKA-2976
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.23
>
>
> Add an XLZ parser that processes the embedded XLF content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-2976) Add an XLZ parser

2020-09-20 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-2976:
--
Fix Version/s: 1.23

> Add an XLZ parser
> -
>
> Key: TIKA-2976
> URL: https://issues.apache.org/jira/browse/TIKA-2976
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.23
>
>
> Add an XLZ parser that processes the embedded XLF content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3188) Add IDML Parser

2020-09-20 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-3188:
--
Description: Add a basic IDML parser to get content, XMP metadata and 
spread counts.

> Add IDML Parser
> ---
>
> Key: TIKA-3188
> URL: https://issues.apache.org/jira/browse/TIKA-3188
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 1.25
>
>
> Add a basic IDML parser to get content, XMP metadata and spread counts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3188) Add IDML Parser

2020-09-20 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-3188.
---
Resolution: Implemented

Implemented in commit:
[https://github.com/apache/tika/commit/56ad41892036dbd75e5fe8ebb34100c8aafde757]

 

> Add IDML Parser
> ---
>
> Key: TIKA-3188
> URL: https://issues.apache.org/jira/browse/TIKA-3188
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 1.25
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3188) Add IDML Parser

2020-09-20 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-3188:
--
Fix Version/s: 1.25

> Add IDML Parser
> ---
>
> Key: TIKA-3188
> URL: https://issues.apache.org/jira/browse/TIKA-3188
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 1.25
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3189) Add FrameMaker MIF Parser

2020-09-01 Thread Dave Meikle (Jira)
Dave Meikle created TIKA-3189:
-

 Summary: Add FrameMaker MIF Parser
 Key: TIKA-3189
 URL: https://issues.apache.org/jira/browse/TIKA-3189
 Project: Tika
  Issue Type: Task
  Components: parser
Reporter: Dave Meikle
Assignee: Dave Meikle






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3188) Add IDML Parser

2020-08-31 Thread Dave Meikle (Jira)
Dave Meikle created TIKA-3188:
-

 Summary: Add IDML Parser
 Key: TIKA-3188
 URL: https://issues.apache.org/jira/browse/TIKA-3188
 Project: Tika
  Issue Type: Task
  Components: parser
Reporter: Dave Meikle
Assignee: Dave Meikle






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3121) Rename master branch

2020-07-13 Thread Dave Meikle (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156951#comment-17156951
 ] 

Dave Meikle commented on TIKA-3121:
---

Hi [~tallison],

Just pushed a main branch there and let Drew know on INFRA-20500.

Cheers,
Dave

> Rename master branch
> 
>
> Key: TIKA-3121
> URL: https://issues.apache.org/jira/browse/TIKA-3121
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I started a discussion on the dev list for this here:
> http://mail-archives.us.apache.org/mod_mbox/tika-dev/202006.mbox/%3CCAC1dCwW9FuK%2BkSzokmweeYwLFiED9g0W-43J1TNhMwnv7rdp8g%40mail.gmail.com%3E
> One committer would prefer that we not make this change, but seems ok with it.
> Recommendations:
> * main
> * trunk
> * development
> * stable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3014) XLIFF12Parser fails with ToXMLHandler

2019-12-18 Thread Dave Meikle (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999571#comment-16999571
 ] 

Dave Meikle commented on TIKA-3014:
---

Scratch that, easier just to map lang over to XHTML one as no need for other 
attributes. Commit coming right up.

> XLIFF12Parser fails with ToXMLHandler 
> --
>
> Key: TIKA-3014
> URL: https://issues.apache.org/jira/browse/TIKA-3014
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> XLIFF12Parser fails with ToXMLHandler because xml namespace isn't set, but is 
> needed for "xml:lang".
> One option would be to remove the namespace on the lang attribute?
> [~dmeikle], any recommendations?
> To see the problem:
> 1) Make XLIFF12ParserTest extend TikaTest
> 2) add this test:
> {noformat}
> @Test
> public void testToXMLHandler() throws Exception {
> String xml = getXML("testXLIFF12.xlf").xml;
> assertContains("Another trans-unit", xml);
> assertContains("Un autre trans-unit", xml);
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3014) XLIFF12Parser fails with ToXMLHandler

2019-12-18 Thread Dave Meikle (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999507#comment-16999507
 ] 

Dave Meikle commented on TIKA-3014:
---

Good spot. I think we need to add the explict declaration in the 
_XHTMLContentHandler._

I'll put a fix in tonight.

> XLIFF12Parser fails with ToXMLHandler 
> --
>
> Key: TIKA-3014
> URL: https://issues.apache.org/jira/browse/TIKA-3014
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> XLIFF12Parser fails with ToXMLHandler because xml namespace isn't set, but is 
> needed for "xml:lang".
> One option would be to remove the namespace on the lang attribute?
> [~dmeikle], any recommendations?
> To see the problem:
> 1) Make XLIFF12ParserTest extend TikaTest
> 2) add this test:
> {noformat}
> @Test
> public void testToXMLHandler() throws Exception {
> String xml = getXML("testXLIFF12.xlf").xml;
> assertContains("Another trans-unit", xml);
> assertContains("Un autre trans-unit", xml);
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-2976) Add an XLZ parser

2019-10-29 Thread Dave Meikle (Jira)
Dave Meikle created TIKA-2976:
-

 Summary: Add an XLZ parser
 Key: TIKA-2976
 URL: https://issues.apache.org/jira/browse/TIKA-2976
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Dave Meikle
Assignee: Dave Meikle


Add an XLZ parser that processes the embedded XLF content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-2894) Add support for WebAssembly (Content-Type application/wasm, or .wasm extension)

2019-10-28 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-2894.
---
Fix Version/s: 1.23
   Resolution: Fixed

Added to master in 31304f8ba1aad16d76c3381d25efc245c32743e8

Add to branch_1x in 593e29d62347e6352d8ec01ad82573e8ad1f8dab

> Add support for WebAssembly (Content-Type application/wasm, or .wasm 
> extension)
> ---
>
> Key: TIKA-2894
> URL: https://issues.apache.org/jira/browse/TIKA-2894
> Project: Tika
>  Issue Type: Improvement
>  Components: detector
>Affects Versions: 1.21
>Reporter: Fredrik Söderström
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 1.23
>
>
> Right now I cannot find any support for wasm (WebAssembly) files, I need to 
> add an external if statement in my spring boot project.
> {quote}String path = resource.getFile().getPath();
> if (path.endsWith(".wasm")) {
>   servletResponse.setContentType("application/wasm");
> } else {
>   servletResponse.setContentType(tika.detect(path));
> }
> {quote}
> It would be nice to add support for this format as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TIKA-2894) Add support for WebAssembly (Content-Type application/wasm, or .wasm extension)

2019-10-28 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-2894:
-

Assignee: Dave Meikle

> Add support for WebAssembly (Content-Type application/wasm, or .wasm 
> extension)
> ---
>
> Key: TIKA-2894
> URL: https://issues.apache.org/jira/browse/TIKA-2894
> Project: Tika
>  Issue Type: Improvement
>  Components: detector
>Affects Versions: 1.21
>Reporter: Fredrik Söderström
>Assignee: Dave Meikle
>Priority: Major
>
> Right now I cannot find any support for wasm (WebAssembly) files, I need to 
> add an external if statement in my spring boot project.
> {quote}String path = resource.getFile().getPath();
> if (path.endsWith(".wasm")) {
>   servletResponse.setContentType("application/wasm");
> } else {
>   servletResponse.setContentType(tika.detect(path));
> }
> {quote}
> It would be nice to add support for this format as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TIKA-2900) Removing comments from *.docx, *.pdf files

2019-10-26 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-2900:
-

Assignee: Dave Meikle

> Removing comments from *.docx, *.pdf files
> --
>
> Key: TIKA-2900
> URL: https://issues.apache.org/jira/browse/TIKA-2900
> Project: Tika
>  Issue Type: Wish
>  Components: app, example
>Affects Versions: 1.21
>Reporter: Md
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Document_with_Comments_Text_extarction_Tika_APP.docx, 
> Document_with_Comments_Text_extarction_Tika_APP.docx.txt
>
>
> Hello,
> I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
> Sometimes there are comments in the file and tika is extracting them and 
> adding them at the end of the file. I am wondering to know is there a way to 
> exclude comments when it will be extracting text. 
> Here is the following code I am using 
> {code:java}
>  StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new 
> FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
> factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, 
> parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-2975) XLIFF 1.2 Parser

2019-10-26 Thread Dave Meikle (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-2975.
---
Fix Version/s: 1.23
   Resolution: Fixed

> XLIFF 1.2 Parser
> 
>
> Key: TIKA-2975
> URL: https://issues.apache.org/jira/browse/TIKA-2975
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.23
>
>
> Basic parser for XLIFF 1.2 files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2975) XLIFF 1.2 Parser

2019-10-26 Thread Dave Meikle (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960484#comment-16960484
 ] 

Dave Meikle commented on TIKA-2975:
---

Merged into master in 80b533b86cf8e4c8e090e95479ef60f2a641194f

Merged into branch_1x in 7ac9eabe97f97ef33ac9a610b1f0b614d4d0f9b8

> XLIFF 1.2 Parser
> 
>
> Key: TIKA-2975
> URL: https://issues.apache.org/jira/browse/TIKA-2975
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Minor
>
> Basic parser for XLIFF 1.2 files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-2975) XLIFF 1.2 Parser

2019-10-26 Thread Dave Meikle (Jira)
Dave Meikle created TIKA-2975:
-

 Summary: XLIFF 1.2 Parser
 Key: TIKA-2975
 URL: https://issues.apache.org/jira/browse/TIKA-2975
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Dave Meikle
Assignee: Dave Meikle


Basic parser for XLIFF 1.2 files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-11-01 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671189#comment-16671189
 ] 

Dave Meikle commented on TIKA-2760:
---

Hi [~markus17],

Looking at the Nutch code I can see that TikaParser has logic to honour the 
setting in the robots metadata.  As this page is setting _nofollow,_ the parser 
doesn't add the links found by Tika's LinkContentHandler to the outlinks.

If you remove the nofollow from the HTML files metadata you'll see it all flow 
through into Nutch.

{{}}

to

{{}}

It should all flow through as normal.

Cheers,
Dave

 

> LinkContentHandler does not report hyperlinks
> -
>
> Key: TIKA-2760
> URL: https://issues.apache.org/jira/browse/TIKA-2760
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: TIKA-2760 - Test for Outlinks.diff, TIKA-2760.patch, 
> ronaldmcdonald-nolinks.html
>
>
> Nutch uses LinkContentHandler for collection hyperlinks, and does not report 
> any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also 
> attach to this ticket.
> Debugging LinkContentHandler to print element names in startElement reveals 
> only very few HTML elements get reported, which i think is incorrect.
> Our own parser in Nutch uses a custom ContentHandler and does report many 
> elements, including hyperlinks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-10-31 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671148#comment-16671148
 ] 

Dave Meikle commented on TIKA-2760:
---

Hi [~markus17],

I used your test but moved it in the tika-parsers project so the HtmlParser is 
registered, as in tika-core it is just the MockParser so I get the same results 
as you there.

Here's a diff based on your patch:
[^TIKA-2760 - Test for Outlinks.diff]

I've just forked nutch and will have a wee look in parse-tika and parse-html 
modules.

Cheers,
Dave

 

> LinkContentHandler does not report hyperlinks
> -
>
> Key: TIKA-2760
> URL: https://issues.apache.org/jira/browse/TIKA-2760
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: TIKA-2760 - Test for Outlinks.diff, TIKA-2760.patch, 
> ronaldmcdonald-nolinks.html
>
>
> Nutch uses LinkContentHandler for collection hyperlinks, and does not report 
> any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also 
> attach to this ticket.
> Debugging LinkContentHandler to print element names in startElement reveals 
> only very few HTML elements get reported, which i think is incorrect.
> Our own parser in Nutch uses a custom ContentHandler and does report many 
> elements, including hyperlinks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-10-31 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-2760:
--
Attachment: TIKA-2760 - Test for Outlinks.diff

> LinkContentHandler does not report hyperlinks
> -
>
> Key: TIKA-2760
> URL: https://issues.apache.org/jira/browse/TIKA-2760
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: TIKA-2760 - Test for Outlinks.diff, TIKA-2760.patch, 
> ronaldmcdonald-nolinks.html
>
>
> Nutch uses LinkContentHandler for collection hyperlinks, and does not report 
> any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also 
> attach to this ticket.
> Debugging LinkContentHandler to print element names in startElement reveals 
> only very few HTML elements get reported, which i think is incorrect.
> Our own parser in Nutch uses a custom ContentHandler and does report many 
> elements, including hyperlinks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-10-29 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667966#comment-16667966
 ] 

Dave Meikle commented on TIKA-2760:
---

[~markus17] - is it typically the HTML parser being used in Nutch? Using your 
test with the HTML parser registered gives me 94 links.

> LinkContentHandler does not report hyperlinks
> -
>
> Key: TIKA-2760
> URL: https://issues.apache.org/jira/browse/TIKA-2760
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: TIKA-2760.patch, ronaldmcdonald-nolinks.html
>
>
> Nutch uses LinkContentHandler for collection hyperlinks, and does not report 
> any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also 
> attach to this ticket.
> Debugging LinkContentHandler to print element names in startElement reveals 
> only very few HTML elements get reported, which i think is incorrect.
> Our own parser in Nutch uses a custom ContentHandler and does report many 
> elements, including hyperlinks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2630) Wrong height and width metadata for JPEG images

2018-10-29 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667918#comment-16667918
 ] 

Dave Meikle commented on TIKA-2630:
---

After writing it, I know it really wont given the class of metadata keys 
between the Exif directories.

Wondering if we could short term just add the directory name in as a key 
qualifier for just Exif information, given it is there where this is an issue 
just now.

Will create a proposed pull request and see what others think.

> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2630) Wrong height and width metadata for JPEG images

2018-10-29 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667903#comment-16667903
 ] 

Dave Meikle commented on TIKA-2630:
---

Thanks for raising this one. Short term we can add in the reading from these 
fields for compressed images knowing it will set the tiff:ImageHeight and 
tiff:ImageLength to the correct value.

Longer term we need to address the metadata clashes are part of the 2.x series 
as whilst we could add in the directory name as a key to the metadata (e.g. 
Exif IFD0:Image Height: 520 pixels) I would be worried about the impact on 
downstream code that has got used to what we do. This means we can also build 
up on the Metadata proposals for 2.x.

Will this work for you?

 

> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-2630) Wrong height and width metadata for JPEG images

2018-10-29 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-2630:
-

Assignee: Dave Meikle

> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2767) Problem with import xlsx with null cells

2018-10-29 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667840#comment-16667840
 ] 

Dave Meikle commented on TIKA-2767:
---

Hi [~iodor] - I've tried to recreate this by building my own Excel but don't 
get the issue with the latest build. Do you have an example file for this?

TIKA-2479 should have fixed this.

> Problem with import xlsx with null cells
> 
>
> Key: TIKA-2767
> URL: https://issues.apache.org/jira/browse/TIKA-2767
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: ionut hodor
>Priority: Major
> Attachments: example.png
>
>
> I have a problem with xlsx when there are cell without value. The cells are 
> not considered and the next cells on the same row are tranlated.
>  in the example the cells with value "value4" are combined with header2.
> i'm use tika 1.18 but i met the same problem with tika 1.19
> I have this problem only with xlsx, with xls everything is ok



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-2599.
---
Resolution: Fixed

Commited to branch_1x in 324cbd2eb4d64f1e34aba9789ee8b06cbf4d991e and master in 
6ccedbadd4f79d7888eabfcd3a74ab85e168.

> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667823#comment-16667823
 ] 

Dave Meikle commented on TIKA-2599:
---

Commited to branch_1x in 324cbd2eb4d64f1e34aba9789ee8b06cbf4d991e and master in 
6ccedbadd4f79d7888eabfcd3a74ab85e168.

Thanks [~ronanos]!

> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-2599:
--
Fix Version/s: 1.20

> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-2599:
-

Assignee: Dave Meikle

> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2740) Update Python dependency check for TesseractOCR Parser rotation.py script

2018-09-28 Thread Dave Meikle (JIRA)
Dave Meikle created TIKA-2740:
-

 Summary: Update Python dependency check for TesseractOCR Parser 
rotation.py script
 Key: TIKA-2740
 URL: https://issues.apache.org/jira/browse/TIKA-2740
 Project: Tika
  Issue Type: Bug
  Components: ocr
Affects Versions: 1.19
Reporter: Dave Meikle
Assignee: Dave Meikle


TesseractOCRParserTest.testRotatedOCR fails when TkInter module is not 
available but Terreract and other named Python dependencies are installed.

To address this we should:
 * Update the _hasPython_ me to include this in the check
 * Update the Wiki to list explicitly the dependencies and how to install them 
on major platforms/variants.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390317#comment-16390317
 ] 

Dave Meikle commented on TIKA-1518:
---

[~talli...@mitre.org] - ah it looks like the proxy settings aren't being passed 
into the Docker container.

Normally I've passed proxy settings via buildArgs to docker but I am not sure 
how this is handled by the Maven plugin.  I've not done docker behind a proxy 
for a while.

Can you try -X on the maven command to see what is being set?

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390284#comment-16390284
 ] 

Dave Meikle edited comment on TIKA-1518 at 3/7/18 9:41 PM:
---

It is a choice we have to make. There are three mains routes to Docker 
packaging that I have used:
 # Automated builds that pull in pre-packaged and then get bundled into an 
image on any change in the an repository - like what we are doing n 
docker-tikaserver approach where is goes and downloads the signed JARs
 # Automated builds that compile the code in the image (e.g. using the maven 
Docker image) and then package them
 # Building a release image and then distributing that - which is what this 
does but requires us to decide when an official release is available and push 
it somewhere

The first and second are really good for leveraging things like Docker Hub to 
automatically build from your repository, where as the third means you have to 
have Docker on your machine when you want to build an image.

I never really like number two as it means the builds are always recompiles of 
the code each time a change is triggered, so you can easily be packing up 
different code as the same version without realising it.

The challenge with the approach in docker-tikaserver is maintaining when assets 
that are being pulled in move - i.e. when an release JAR is move from 
dist.apache.org - but that could easily be solved by going to Nexus for the 
JARs based on the release packages.

I personally quite like the third approach as it means you explicit create an 
image that has its own life and was thinking that we could potentially add this 
to the release process, pushing the image from the release build to Docker 
Hub/Nexus/Another Repos so it is an official build.  So just like when we do a 
mvn release we can go to tika-server and do a mvn dockerfile:build and if happy 
mvn dockerfile:push (once we bottom out where).

Not sure what others think?


was (Author: davemeikle):
It is a choice we have to make. There are three mains routes to Docker 
packaging that I have used:
 # Automated builds that pull in pre-packaged and then get bundled into an 
image on any change in the an repository - like what we are doing n 
docker-tikaserver approach where is goes and downloads the signed JARs
 # Automated builds that compile the code in the image (e.g. using the maven 
Docker image) and then package them
 # Building a release image and then distributing that - which is what this 
does but requires us to decide when an official release is available and push 
it somewhere

The first and second are really good for leveraging things like Docker Hub to 
automatically build from your repository, where as the third means you have to 
have Docker on your machine when you want to build an image.

I never really like number two as it means the builds are always recompiles of 
the code each time a change is triggered, so you can easily be packing up 
different code as the same version without realising it.

The challenge with the approach in docker-tikaserver is maintaining when assets 
that are being pulled in move - i.e. when an release JAR is move from 
dist.apache.org - but that could easily be solved by going to Nexus for the 
JARs based on the release packages.

I personally quite like the third approach as it means you explicit create an 
image that has its own life and was thinking that we could potentially add this 
to the release process, pushing the image from the release build to Docker 
Hub/Nexus/Another Repos so it is an official build.

Not sure what others think?

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390284#comment-16390284
 ] 

Dave Meikle commented on TIKA-1518:
---

It is a choice we have to make. There are three mains routes to Docker 
packaging that I have used:
 # Automated builds that pull in pre-packaged and then get bundled into an 
image on any change in the an repository - like what we are doing n 
docker-tikaserver approach where is goes and downloads the signed JARs
 # Automated builds that compile the code in the image (e.g. using the maven 
Docker image) and then package them
 # Building a release image and then distributing that - which is what this 
does but requires us to decide when an official release is available and push 
it somewhere

The first and second are really good for leveraging things like Docker Hub to 
automatically build from your repository, where as the third means you have to 
have Docker on your machine when you want to build an image.

I never really like number two as it means the builds are always recompiles of 
the code each time a change is triggered, so you can easily be packing up 
different code as the same version without realising it.

The challenge with the approach in docker-tikaserver is maintaining when assets 
that are being pulled in move - i.e. when an release JAR is move from 
dist.apache.org - but that could easily be solved by going to Nexus for the 
JARs based on the release packages.

I personally quite like the third approach as it means you explicit create an 
image that has its own life and was thinking that we could potentially add this 
to the release process, pushing the image from the release build to Docker 
Hub/Nexus/Another Repos so it is an official build.

Not sure what others think?

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
> Attachments: tika-server-docker-err-msg.txt
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390241#comment-16390241
 ] 

Dave Meikle commented on TIKA-1518:
---

{quote}I do have Docker installed, [0] but it is Windows, and I've noticed 
some, um, areas for improvement in Docker on Windows.
{quote}
I've found on Windows I have had to enable the "Expose daemon on 
tcp://localhost:2375 without TLS" in Docker for Windows to talk to it with many 
of the clients out there. Does this solve it for you?

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390202#comment-16390202
 ] 

Dave Meikle edited comment on TIKA-1518 at 3/7/18 8:51 PM:
---

Sorry [~talli...@mitre.org] - this is me getting too excited. I'll need to 
remove it from being hooked on the "build" phase so those without Docker can 
build without this!

Will do this just now.


was (Author: davemeikle):
Sorry [~talli...@mitre.org] - this is me getting too excited. I'll need to 
remove it from being hooked on the "build" phase so those without Docker can 
build without this!

Will do this just now.

 

 

 

 

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390202#comment-16390202
 ] 

Dave Meikle commented on TIKA-1518:
---

Sorry [~talli...@mitre.org] - this is me getting too excited. I'll need to 
remove it from being hooked on the "build" phase so those without Docker can 
build without this!

Will do this just now.

 

 

 

 

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1518) Docker with Tika Server

2018-03-04 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385167#comment-16385167
 ] 

Dave Meikle commented on TIKA-1518:
---

As the current Dockerfile was out of date, I've updated it to use the build 
artefacts to create the docker image. This means you can run the following in 
the tika-server project:

{{mvn package dockerfile:build}}

We can setup the POM to allow a push to Dockerhub that we can setup on the 
deploy stage, that can be executed at release time so we always release a 
tagged version that can be used.

Will speak to INFRA about getting access to an account owned by the Apache 
organisation.

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-1518) Docker with Tika Server

2018-03-04 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-1518:
-

Assignee: Dave Meikle

> Docker with Tika Server
> ---
>
> Key: TIKA-1518
> URL: https://issues.apache.org/jira/browse/TIKA-1518
> Project: Tika
>  Issue Type: New Feature
>Reporter: Paul Ramirez
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 2.0, 1.17
>
>
> This version should be able to demonstrate as many of Apache Tika's 
> capabilities as possible. For instance with GDAL, Tesseract, and FFmpeg to 
> show parsers which require installation of other dependencies. In addition, 
> this should help move TIKA-1301 forward and should leverage the suggestion 
> made by [~lewismc] of a script which can pull down the latest version of 
> Apache Tika.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2509) TesseractOCRParser ignores configured ImageMagickPath in processImage method

2018-01-15 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326160#comment-16326160
 ] 

Dave Meikle commented on TIKA-2509:
---

Created new improvement ticket for the Python path configuration noted in this 
ticket

> TesseractOCRParser ignores configured ImageMagickPath in processImage method
> 
>
> Key: TIKA-2509
> URL: https://issues.apache.org/jira/browse/TIKA-2509
> Project: Tika
>  Issue Type: Bug
>  Components: ocr
>Affects Versions: 1.16, 1.17
>Reporter: Richard Jones
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 1.18
>
>
> The TesseractOCRParser class uses the configured ImageMagickPath in method 
> hasImageMagick to determine whether ImageMagick is present.  Ref:
> String ImageMagick = config.getImageMagickPath() + getImageMagickProg();
> BUT then completely ignores the configured path in the processImage method 
> meaning ImageMagick has to be present on system path (so what's the point of 
> the ImageMagickPath config setting).
> The doOCR method on the other hand DOES use the configured tesseractPath.
> Incidentally I notice there is no equivalent PythonPath config setting even 
> though Python is attempted to be found/used.
> Some consistency would be appreciated so that ImageMagick and Python don't 
> have to be present on the system path.  i.e. follow the model already in 
> place for finding/using Tesseract.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2548) Add Python Path configuration to TesseractOCRParser

2018-01-15 Thread Dave Meikle (JIRA)
Dave Meikle created TIKA-2548:
-

 Summary: Add Python Path configuration to TesseractOCRParser
 Key: TIKA-2548
 URL: https://issues.apache.org/jira/browse/TIKA-2548
 Project: Tika
  Issue Type: Improvement
  Components: ocr
 Environment: Add Python Path configuration setting to 
TesseractOCRParser to allow for different python environments to be used 
similar to Tesseract and ImageMagick settings
Reporter: Dave Meikle
Assignee: Dave Meikle






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2509) TesseractOCRParser ignores configured ImageMagickPath in processImage method

2018-01-15 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-2509.
---
   Resolution: Fixed
Fix Version/s: 1.18

Updated in 
[0b9aa9b5efde795f6b863c987abff5be07530a41|https://github.com/apache/tika/commit/0b9aa9b5efde795f6b863c987abff5be07530a41]
 on master and 
[2922511b5d1662654921a2e02599324aae4a84f4|https://github.com/apache/tika/commit/2922511b5d1662654921a2e02599324aae4a84f4]
 on branch_1x. Thank you!

> TesseractOCRParser ignores configured ImageMagickPath in processImage method
> 
>
> Key: TIKA-2509
> URL: https://issues.apache.org/jira/browse/TIKA-2509
> Project: Tika
>  Issue Type: Bug
>  Components: ocr
>Affects Versions: 1.16, 1.17
>Reporter: Richard Jones
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 1.18
>
>
> The TesseractOCRParser class uses the configured ImageMagickPath in method 
> hasImageMagick to determine whether ImageMagick is present.  Ref:
> String ImageMagick = config.getImageMagickPath() + getImageMagickProg();
> BUT then completely ignores the configured path in the processImage method 
> meaning ImageMagick has to be present on system path (so what's the point of 
> the ImageMagickPath config setting).
> The doOCR method on the other hand DOES use the configured tesseractPath.
> Incidentally I notice there is no equivalent PythonPath config setting even 
> though Python is attempted to be found/used.
> Some consistency would be appreciated so that ImageMagick and Python don't 
> have to be present on the system path.  i.e. follow the model already in 
> place for finding/using Tesseract.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-2509) TesseractOCRParser ignores configured ImageMagickPath in processImage method

2018-01-15 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-2509:
-

Assignee: Dave Meikle

> TesseractOCRParser ignores configured ImageMagickPath in processImage method
> 
>
> Key: TIKA-2509
> URL: https://issues.apache.org/jira/browse/TIKA-2509
> Project: Tika
>  Issue Type: Bug
>  Components: ocr
>Affects Versions: 1.16, 1.17
>Reporter: Richard Jones
>Assignee: Dave Meikle
>Priority: Major
>
> The TesseractOCRParser class uses the configured ImageMagickPath in method 
> hasImageMagick to determine whether ImageMagick is present.  Ref:
> String ImageMagick = config.getImageMagickPath() + getImageMagickProg();
> BUT then completely ignores the configured path in the processImage method 
> meaning ImageMagick has to be present on system path (so what's the point of 
> the ImageMagickPath config setting).
> The doOCR method on the other hand DOES use the configured tesseractPath.
> Incidentally I notice there is no equivalent PythonPath config setting even 
> though Python is attempted to be found/used.
> Some consistency would be appreciated so that ImageMagick and Python don't 
> have to be present on the system path.  i.e. follow the model already in 
> place for finding/using Tesseract.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-2385) Tesseract OCR rotation.py not run

2017-11-24 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-2385:
-

Assignee: Dave Meikle

> Tesseract OCR rotation.py not run
> -
>
> Key: TIKA-2385
> URL: https://issues.apache.org/jira/browse/TIKA-2385
> Project: Tika
>  Issue Type: Bug
>  Components: ocr
>Affects Versions: 1.15
>Reporter: Peter Weiss
>Assignee: Dave Meikle
> Fix For: 1.17
>
>
> It appears that even if Python is installed, the rotation.py that calculates 
> rotation angle of the image does not run because of indentation/spacing 
> errors in the Python script.
> Also recommend making this a configurable parameter since it does add time 
> and can produce unexpected results if the supplied image contains more than 
> just plain text.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2347) Underlined text is not decorated as such when extracting from word documents

2017-11-23 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-2347.
---
   Resolution: Fixed
Fix Version/s: 1.17

Committed in 
[639f3bf361a08210da8fae68e3eeb4e12df6c4de|https://github.com/apache/tika/commit/639f3bf361a08210da8fae68e3eeb4e12df6c4de].
 Thanks Stuart!

> Underlined text is not decorated as such when extracting from word documents
> 
>
> Key: TIKA-2347
> URL: https://issues.apache.org/jira/browse/TIKA-2347
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0, 1.14
>Reporter: Stuart Hendren
>Assignee: Dave Meikle
> Fix For: 1.17
>
>
> When extracting from doc and docx bold and italic text decoration is 
> extracted, however underlining is not.  Can be demonstrated in WordParserTest 
> or OOXMLParserTest (change to docx) with the following test case.
> {code:title=WordParserTest.java|borderStyle=solid}
> @Test
> public void testTextDecoration() throws Exception {
>   XMLResult result = getXML("testWORD_various.doc");
>   String xml = result.xml;
>   assertTrue(xml.contains("Bold"));
>   assertTrue(xml.contains("italic"));
>   assertTrue(xml.contains("underline"));
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (TIKA-2347) Underlined text is not decorated as such when extracting from word documents

2017-11-23 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-2347:
-

Assignee: Dave Meikle

> Underlined text is not decorated as such when extracting from word documents
> 
>
> Key: TIKA-2347
> URL: https://issues.apache.org/jira/browse/TIKA-2347
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0, 1.14
>Reporter: Stuart Hendren
>Assignee: Dave Meikle
>
> When extracting from doc and docx bold and italic text decoration is 
> extracted, however underlining is not.  Can be demonstrated in WordParserTest 
> or OOXMLParserTest (change to docx) with the following test case.
> {code:title=WordParserTest.java|borderStyle=solid}
> @Test
> public void testTextDecoration() throws Exception {
>   XMLResult result = getXML("testWORD_various.doc");
>   String xml = result.xml;
>   assertTrue(xml.contains("Bold"));
>   assertTrue(xml.contains("italic"));
>   assertTrue(xml.contains("underline"));
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2357) Allow Tesseract PSM up to 13

2017-05-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-2357.
---
Resolution: Fixed
  Assignee: Dave Meikle

Merged in 
[0aaa121|https://github.com/apache/tika/commit/0aaa1215fd11632c349e9bdebac9829578276cb1].
 Thanks Rafael!

> Allow Tesseract PSM up to 13
> 
>
> Key: TIKA-2357
> URL: https://issues.apache.org/jira/browse/TIKA-2357
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Affects Versions: 1.14
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.15
>
>
> From https://github.com/apache/tika/pull/177 by Rafael Ferreira  
> Extend support for increased PSM options up to 13 for modern versions of 
> Tesseract.
> {code}
> $ tesseract --version
> tesseract 3.05.00
>  leptonica-1.74.1
>   libjpeg 8d : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.8
> $ tesseract --help-psm
> Page segmentation modes:
>   0Orientation and script detection (OSD) only.
>   1Automatic page segmentation with OSD.
>   2Automatic page segmentation, but no OSD, or OCR.
>   3Fully automatic page segmentation, but no OSD. (Default)
>   4Assume a single column of text of variable sizes.
>   5Assume a single uniform block of vertically aligned text.
>   6Assume a single uniform block of text.
>   7Treat the image as a single text line.
>   8Treat the image as a single word.
>   9Treat the image as a single word in a circle.
>  10Treat the image as a single character.
>  11Sparse text. Find as much text as possible in no particular order.
>  12Sparse text with OSD.
>  13Raw line. Treat the image as a single text line, bypassing hacks that 
> are Tesseract-specific.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TIKA-2357) Allow Tesseract PSM up to 13

2017-05-08 Thread Dave Meikle (JIRA)
Dave Meikle created TIKA-2357:
-

 Summary: Allow Tesseract PSM up to 13
 Key: TIKA-2357
 URL: https://issues.apache.org/jira/browse/TIKA-2357
 Project: Tika
  Issue Type: Improvement
  Components: ocr
Affects Versions: 1.14
Reporter: Dave Meikle
Priority: Minor
 Fix For: 1.15


>From https://github.com/apache/tika/pull/177 by Rafael Ferreira  

Extend support for increased PSM options up to 13 for modern versions of 
Tesseract.

{code}
$ tesseract --version
tesseract 3.05.00
 leptonica-1.74.1
  libjpeg 8d : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.8

$ tesseract --help-psm
Page segmentation modes:
  0Orientation and script detection (OSD) only.
  1Automatic page segmentation with OSD.
  2Automatic page segmentation, but no OSD, or OCR.
  3Fully automatic page segmentation, but no OSD. (Default)
  4Assume a single column of text of variable sizes.
  5Assume a single uniform block of vertically aligned text.
  6Assume a single uniform block of text.
  7Treat the image as a single text line.
  8Treat the image as a single word.
  9Treat the image as a single word in a circle.
 10Treat the image as a single character.
 11Sparse text. Find as much text as possible in no particular order.
 12Sparse text with OSD.
 13Raw line. Treat the image as a single text line, bypassing hacks that 
are Tesseract-specific.
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TIKA-2297) Add Lingo24 Language Detector

2017-03-13 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-2297.
---
Resolution: Fixed

Added in commit 64652824f4fd7e9bbbd0c66701c6a814d3739157
(https://github.com/apache/tika/commit/64652824f4fd7e9bbbd0c66701c6a814d3739157)

> Add Lingo24 Language Detector
> -
>
> Key: TIKA-2297
> URL: https://issues.apache.org/jira/browse/TIKA-2297
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Reporter: Dave Meikle
>Assignee: Dave Meikle
> Fix For: 1.15
>
>
> Add LanguageDetector for the Lingo24 Premium MT API's /langid resource



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2297) Add Lingo24 Language Detector

2017-03-13 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907117#comment-15907117
 ] 

Dave Meikle commented on TIKA-2297:
---

Failure due to issue communicating with https://repository.apache.org. Build 
was successful in re-triggered build (See 
https://builds.apache.org/job/Tika-trunk/1218/).

> Add Lingo24 Language Detector
> -
>
> Key: TIKA-2297
> URL: https://issues.apache.org/jira/browse/TIKA-2297
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Reporter: Dave Meikle
>Assignee: Dave Meikle
> Fix For: 1.15
>
>
> Add LanguageDetector for the Lingo24 Premium MT API's /langid resource



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TIKA-2292) Update CXF version to 3.0.12

2017-03-12 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-2292.
---
Resolution: Fixed
  Assignee: Dave Meikle  (was: Sergey Beryozkin)

Committed in 
https://github.com/apache/tika/commit/79b6c15edddb5d98a68dc4d4fc31025ae034dd5e



> Update CXF version to 3.0.12
> 
>
> Key: TIKA-2292
> URL: https://issues.apache.org/jira/browse/TIKA-2292
> Project: Tika
>  Issue Type: Task
>  Components: server
>Reporter: Sergey Beryozkin
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.15
>
>
> This is the last version in the CXF 3.0.x line which supports Java 6



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2297) Add Lingo24 Language Detector

2017-03-11 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-2297:
--
Fix Version/s: 1.15

> Add Lingo24 Language Detector
> -
>
> Key: TIKA-2297
> URL: https://issues.apache.org/jira/browse/TIKA-2297
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Reporter: Dave Meikle
>Assignee: Dave Meikle
> Fix For: 1.15
>
>
> Add LanguageDetector for the Lingo24 Premium MT API's /langid resource



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TIKA-2297) Add Lingo24 Language Detector

2017-03-11 Thread Dave Meikle (JIRA)
Dave Meikle created TIKA-2297:
-

 Summary: Add Lingo24 Language Detector
 Key: TIKA-2297
 URL: https://issues.apache.org/jira/browse/TIKA-2297
 Project: Tika
  Issue Type: Improvement
  Components: languageidentifier
Reporter: Dave Meikle
Assignee: Dave Meikle






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (TIKA-2003) Tika 1.13 gpg signature not validating.

2016-06-15 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-2003:
-

Assignee: Dave Meikle

> Tika 1.13 gpg signature not validating.
> ---
>
> Key: TIKA-2003
> URL: https://issues.apache.org/jira/browse/TIKA-2003
> Project: Tika
>  Issue Type: Bug
>Reporter: Stephen Durham
>Assignee: Dave Meikle
>
> I am using Tika via the logicalspark/docker-tikaserver instance and I noticed 
> that the latest update to 1.13 failed the build process for the docker 
> instance due to a bad signature. I took the
>  steps outlined below to make sure that this was actually an issue before 
> submitting the ticket.
> There is a related issue from a few years back, same RSA key 0EB30B07. The 
> ticket is 1345.
> Thanks in advance for any assistance with this issue.
> -Stephen
> First I tested with the Docker instance. I cloned the 
> logicalspark/docker-tikaserver repo and attempted the docker build locally. 
> The build encountered the following error:
> {noformat}
> gpg: Signature made Mon May  9 17:34:48 2016 UTC using RSA key ID 0EB30B07
> gpg: Can't check signature: public key not found
> {noformat}
> I then tested locally. With no keys other than those contained in tika.asc
> {noformat}
> wget https://people.apache.org/keys/group/tika.asc
> wget http://apache.mirrors.tds.net/tika/tika-server-1.13.jar
> wget https://www.apache.org/dist/tika/tika-server-1.13.jar.asc
> {noformat}
> Then I verified the MD5 sum matches the download page.
> {noformat}
> md5 tika-server-1.13.jar
> MD5 (tika-server-1.13.jar) = 155bec7b7cb25b22effa99db1fb8e233
> {noformat}
> Next I verified the signature following the steps on the download page.
> 1. Import the Keys.
> {noformat}
> gpg --import tika.asc
> gpg: /Users/stephen/.gnupg/trustdb.gpg: trustdb created
> gpg: key B876884A: public key "Chris Mattmann (CODE SIGNING KEY)" imported
> gpg: key 6ED9BE21: public key "Bob Paulin (CODE SIGNING KEY)" imported
> gpg: key 0890B1AB: public key "Konstantin Gribov (gross)" imported
> gpg: key 6E68DA61: public key "Michael McCandless (CODE SIGNING KEY)" imported
> gpg: key A355A63E: public key "Jukka Zitting" imported
> gpg: key 8A26D9A6: public key "Jukka Zitting" imported
> gpg: key 42CFAE07: public key "Jukka Zitting (CODE SIGNING KEY)" imported
> gpg: key 95D21F2E: public key "Ray Gauss II (CODE SIGNING KEY)" imported
> gpg: key D4F10117: public key "Tyler Palsulich" imported
> gpg: key DEDEAB92: public key "Sergey Beryozkin (Release Management)" imported
> gpg: key 97EDDE66: public key "tallison (apache_distro_keys)" imported
> gpg: key 48BAEBF6: public key "Lewis John McGibbney (CODE SIGNING KEY)" 
> imported
> gpg: key D84E41AE: public key "Nick Burch" imported
> gpg: Total number processed: 13
> gpg:   imported: 13  (RSA: 8)
> gpg: no ultimately trusted keys found
> {noformat}
> 2. Verify the signature.
> {noformat}
> gpg --verify tika-server-1.13.jar.asc
> gpg: assuming signed data in `tika-server-1.13.jar'
> gpg: Signature made Mon May  9 12:34:48 2016 CDT using RSA key ID 0EB30B07
> gpg: Can't check signature: public key not found
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1972) Download page points to 1.12 which is not on the ASF mirror hosts anymore

2016-05-16 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-1972.
---
Resolution: Fixed
  Assignee: Dave Meikle

The website is up to date now (pushed after mirror started).

> Download page points to 1.12 which is not on the ASF mirror hosts anymore
> -
>
> Key: TIKA-1972
> URL: https://issues.apache.org/jira/browse/TIKA-1972
> Project: Tika
>  Issue Type: Bug
>Reporter: Sebb
>Assignee: Dave Meikle
>
> Further to INFRA-11869, release 1.13 is now on the ASF mirror hosts. However 
> the download page still refers to 1.12, so its hash and sig links are broken, 
> and the archive links will break when the mirrors start removing old links.
> I don't know if this is the first TIKA release to use svnpubsub, but if so, 
> please note that the ASF mirror hosts will contain only what is in the 
> dist/release/tika directory - whatever is put there is copied (published) to 
> www.a.o/dist/tika.
> Any existing files on www.a.o are wiped (but will still be on the archive 
> server).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1972) Download page points to 1.12 which is not on the ASF mirror hosts anymore

2016-05-16 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284782#comment-15284782
 ] 

Dave Meikle commented on TIKA-1972:
---

Hi [~s...@apache.org] 

No, we have done a few like this. I just hadn't pushed the site updated until 
confirmed the sync was working to avoid it completely breaking. It should be 
all up to date now.

Cheers,
Dave

> Download page points to 1.12 which is not on the ASF mirror hosts anymore
> -
>
> Key: TIKA-1972
> URL: https://issues.apache.org/jira/browse/TIKA-1972
> Project: Tika
>  Issue Type: Bug
>Reporter: Sebb
>
> Further to INFRA-11869, release 1.13 is now on the ASF mirror hosts. However 
> the download page still refers to 1.12, so its hash and sig links are broken, 
> and the archive links will break when the mirrors start removing old links.
> I don't know if this is the first TIKA release to use svnpubsub, but if so, 
> please note that the ASF mirror hosts will contain only what is in the 
> dist/release/tika directory - whatever is put there is copied (published) to 
> www.a.o/dist/tika.
> Any existing files on www.a.o are wiped (but will still be on the archive 
> server).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1885) Tika MIME updates for *.cdf and *.xar and custom zero length file detector based on TREC-DD-Polar

2016-05-08 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275763#comment-15275763
 ] 

Dave Meikle commented on TIKA-1885:
---

Good point re Stream. Checking for -1 from read() will be more accurate.

Re mimetype, not sure if this matters to you and the team from USC 
[~chrismattmann]?



> Tika MIME updates for *.cdf and *.xar and custom zero length file detector 
> based on TREC-DD-Polar
> -
>
> Key: TIKA-1885
> URL: https://issues.apache.org/jira/browse/TIKA-1885
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, detector, mime
>Affects Versions: 1.11
> Environment: Windows OS X64 , Java
>Reporter: Adesh Gupta
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: memex, nsfpolar
> Fix For: 1.13
>
>
> Updated tika-mimetypes.xml and detector to identify new file types in TREC DD 
> Polar dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1939) Preparation for Tika 1.13 release

2016-05-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-1939.
---
Resolution: Fixed

> Preparation for Tika 1.13 release
> -
>
> Key: TIKA-1939
> URL: https://issues.apache.org/jira/browse/TIKA-1939
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> Let's use this to track tasks/discussion/links for release of Tika 1.13.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1955) MIME types updates and additions for Scientific Data based on TREC-DD-Polar

2016-05-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-1955.
---
Resolution: Fixed

> MIME types updates and additions for Scientific Data based on TREC-DD-Polar
> ---
>
> Key: TIKA-1955
> URL: https://issues.apache.org/jira/browse/TIKA-1955
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>  Labels: memex, nsfpolar
> Fix For: 1.13
>
>
> We used http://github.com/chrismattmann/trec-dd-polar/ and submitted several 
> PRs that update MIME type info and/or add it to better support scientific 
> data files. I'll link all the PRs and relevant issues here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1885) Tika MIME updates for *.cdf and *.xar and custom zero length file detector based on TREC-DD-Polar

2016-05-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-1885.
---
Resolution: Fixed

Code committed in d447193f29531df3022f5137b8f0ec1c73e58cc8

> Tika MIME updates for *.cdf and *.xar and custom zero length file detector 
> based on TREC-DD-Polar
> -
>
> Key: TIKA-1885
> URL: https://issues.apache.org/jira/browse/TIKA-1885
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, detector, mime
>Affects Versions: 1.11
> Environment: Windows OS X64 , Java
>Reporter: Adesh Gupta
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: memex, nsfpolar
> Fix For: 1.13
>
>
> Updated tika-mimetypes.xml and detector to identify new file types in TREC DD 
> Polar dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1885) Tika MIME updates for *.cdf and *.xar and custom zero length file detector based on TREC-DD-Polar

2016-05-08 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275549#comment-15275549
 ] 

Dave Meikle edited comment on TIKA-1885 at 5/8/16 10:31 AM:


Have incorporated this code as to not block TIKA-1955.

Ended up making the following changes:
* Renamed the detector to ZeroSizeFileDetector
* Moved it from tika-parsers into tika-core under org.apache.detect package
* Added a test class in tika-core 
* Changed mime type to application/x-zerosize
* Added in the ASF header to all files.

[~adeshgup] - there was no tika-mimetypes.xml updates in the PR. Has this been 
done elsewhere?




was (Author: davemeikle):
Have incorporated this code as to not block TIKA-1955.

Ended up making the following changes:
* Renamed the detector to ZeroSizeFileDetector
* Moved it from tika-parsers into tika-core under org.apache.detect package
* Added a test class in tika-core 
* Added in the ASF header to all files.

[~adeshgup] - there was no tika-mimetypes.xml updates in the PR. Has this been 
done elsewhere?



> Tika MIME updates for *.cdf and *.xar and custom zero length file detector 
> based on TREC-DD-Polar
> -
>
> Key: TIKA-1885
> URL: https://issues.apache.org/jira/browse/TIKA-1885
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, detector, mime
>Affects Versions: 1.11
> Environment: Windows OS X64 , Java
>Reporter: Adesh Gupta
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: memex, nsfpolar
> Fix For: 1.13
>
>
> Updated tika-mimetypes.xml and detector to identify new file types in TREC DD 
> Polar dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1885) Tika MIME updates for *.cdf and *.xar and custom zero length file detector based on TREC-DD-Polar

2016-05-08 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275549#comment-15275549
 ] 

Dave Meikle commented on TIKA-1885:
---

Have incorporated this code as to not block TIKA-1955.

Ended up making the following changes:
* Renamed the detector to ZeroSizeFileDetector
* Moved it from tika-parsers into tika-core under org.apache.detect package
* Added a test class in tika-core 
* Added in the ASF header to all files.

[~adeshgup] - there was no tika-mimetypes.xml updates in the PR. Has this been 
done elsewhere?



> Tika MIME updates for *.cdf and *.xar and custom zero length file detector 
> based on TREC-DD-Polar
> -
>
> Key: TIKA-1885
> URL: https://issues.apache.org/jira/browse/TIKA-1885
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, detector, mime
>Affects Versions: 1.11
> Environment: Windows OS X64 , Java
>Reporter: Adesh Gupta
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: memex, nsfpolar
> Fix For: 1.13
>
>
> Updated tika-mimetypes.xml and detector to identify new file types in TREC DD 
> Polar dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1965) Added types to Grobid quantities parser

2016-05-07 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-1965.
---
Resolution: Fixed

Tested locally and added in 8e4c3ff0a37fa7a64f5f675ffb7c0f7a8322cfc4. Thanks!

> Added types to Grobid quantities parser
> ---
>
> Key: TIKA-1965
> URL: https://issues.apache.org/jira/browse/TIKA-1965
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.13
>Reporter: Can Menekse
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.13
>
>
> Grobid Quantities returns information about the measurement("type"), one 
> example could be : length



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1885) Tika MIME updates for *.cdf and *.xar and custom zero length file detector based on TREC-DD-Polar

2016-05-07 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275267#comment-15275267
 ] 

Dave Meikle commented on TIKA-1885:
---

Hi [~adeshgup] - Just reviewing the pull request. Do you have any tests with 
this change?  If not, I can drop some in and assuming they pass include this in 
1.13.

> Tika MIME updates for *.cdf and *.xar and custom zero length file detector 
> based on TREC-DD-Polar
> -
>
> Key: TIKA-1885
> URL: https://issues.apache.org/jira/browse/TIKA-1885
> Project: Tika
>  Issue Type: Sub-task
>  Components: core, detector, mime
>Affects Versions: 1.11
> Environment: Windows OS X64 , Java
>Reporter: Adesh Gupta
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: memex, nsfpolar
> Fix For: 1.13
>
>
> Updated tika-mimetypes.xml and detector to identify new file types in TREC DD 
> Polar dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1966) Issue in parsing iWorksDocument with Apache Tika

2016-05-04 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270642#comment-15270642
 ] 

Dave Meikle commented on TIKA-1966:
---

Yes, the iWorks 13 formats are very different. I have done some work on a 
filter for translation, that I can see if I can port over to Tika.

> Issue in parsing iWorksDocument with Apache Tika
> 
>
> Key: TIKA-1966
> URL: https://issues.apache.org/jira/browse/TIKA-1966
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.12
> Environment: Ubuntu 15
>Reporter: Sachin Shaju
> Attachments: budget.numbers, connors_20040127.key, pages.pages, 
> sample code
>
>
> I was trying to parse iWorksDoc with Apache Tika. But am not getting parsed 
> content as it is instead getting some other output from the content handler. 
> Code snippet that I've used is attached with this.
> Output :-
> Contents of the file :
> Index/Document.iwa
> Index/ViewState.iwa
> Index/CalculationEngine.iwa
> Index/Tables/HeaderStorageBucket-2.iwa
> Index/Tables/Tile.iwa
> Index/Metadata.iwa
> Metadata/Properties.plist
> I'm able to detect the file type using Detector api correctly. But am not 
> getting the useful content out of the document.
> I'm attaching the iWorks docs that I've tested with (made with latest version 
> of iOS). I got it working when testing with older versions. Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1939) Preparation for Tika 1.13 release

2016-05-04 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270316#comment-15270316
 ] 

Dave Meikle commented on TIKA-1939:
---

Just reviewing the two remaining items (TIKA-1885 and TIKA-1955) to see if they 
can be pulled in, and will start cutting.

> Preparation for Tika 1.13 release
> -
>
> Key: TIKA-1939
> URL: https://issues.apache.org/jira/browse/TIKA-1939
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
> Fix For: 1.13
>
>
> Let's use this to track tasks/discussion/links for release of Tika 1.13.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1705) Update ASM dependency to 5.0.4

2015-08-11 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681530#comment-14681530
 ] 

Dave Meikle commented on TIKA-1705:
---

Thanks [~thetaphi]. Have made the change and adjusted the tika-bundle as well.

[~gagravarr] - that is a good shout re tests. Will add some today.

 Update ASM dependency to 5.0.4
 --

 Key: TIKA-1705
 URL: https://issues.apache.org/jira/browse/TIKA-1705
 Project: Tika
  Issue Type: Task
Affects Versions: 1.7
Reporter: Uwe Schindler
Assignee: Dave Meikle
 Fix For: 1.11

 Attachments: TIKA-1705-2.patch, TIKA-1705.patch


 Currently the Class file parser uses ASM 4.1. This older version cannot read 
 Java 8 / Java 9 class files (fails with Exception).
 The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The 
 code change is only to update the visitor version, so it gets new Java 8 
 features like lambdas reported, but this is not really required, but should 
 be done for full support.
 FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 
 5, too.
 You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no 
 problem with Lucene using a newer version). Since ASM 4.x the updates are 
 more easy (no visitor interfaces anymore, instead abstract classes), so it 
 does not break if you just replace the JAR file. So just see this as a 
 recommendatation, not urgent! Solr/Lucene will also work without this patch 
 (it just replaces the shipped ASM by newer version in our packaging).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1705) Update ASM dependency to 5.0.4

2015-08-10 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-1705.
---
   Resolution: Fixed
 Assignee: Dave Meikle
Fix Version/s: 1.11

Fixed committed in r1695177.

 Update ASM dependency to 5.0.4
 --

 Key: TIKA-1705
 URL: https://issues.apache.org/jira/browse/TIKA-1705
 Project: Tika
  Issue Type: Task
Affects Versions: 1.7
Reporter: Uwe Schindler
Assignee: Dave Meikle
 Fix For: 1.11

 Attachments: TIKA-1705.patch


 Currently the Class file parser uses ASM 4.1. This older version cannot read 
 Java 8 / Java 9 class files (fails with Exception).
 The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The 
 code change is only to update the visitor version, so it gets new Java 8 
 features like lambdas reported, but this is not really required, but should 
 be done for full support.
 FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 
 5, too.
 You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no 
 problem with Lucene using a newer version). Since ASM 4.x the updates are 
 more easy (no visitor interfaces anymore, instead abstract classes), so it 
 does not break if you just replace the JAR file. So just see this as a 
 recommendatation, not urgent! Solr/Lucene will also work without this patch 
 (it just replaces the shipped ASM by newer version in our packaging).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1705) Update ASM dependency to 5.0.4

2015-08-10 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680853#comment-14680853
 ] 

Dave Meikle commented on TIKA-1705:
---

Committed in r1695177. Thanks [~thetaphi]!

 Update ASM dependency to 5.0.4
 --

 Key: TIKA-1705
 URL: https://issues.apache.org/jira/browse/TIKA-1705
 Project: Tika
  Issue Type: Task
Affects Versions: 1.7
Reporter: Uwe Schindler
 Attachments: TIKA-1705.patch


 Currently the Class file parser uses ASM 4.1. This older version cannot read 
 Java 8 / Java 9 class files (fails with Exception).
 The upgrade to ASM 5.0.4 is very simple, just Maven dependency change. The 
 code change is only to update the visitor version, so it gets new Java 8 
 features like lambdas reported, but this is not really required, but should 
 be done for full support.
 FYI, in LUCENE-6729 we want to upgrade the Lucene Expressions module to ASM 
 5, too.
 You can hot-swap ASM 4.1 with ASM 5.0.4 without recompilation (so we have no 
 problem with Lucene using a newer version). Since ASM 4.x the updates are 
 more easy (no visitor interfaces anymore, instead abstract classes), so it 
 does not break if you just replace the JAR file. So just see this as a 
 recommendatation, not urgent! Solr/Lucene will also work without this patch 
 (it just replaces the shipped ASM by newer version in our packaging).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-776) ExifTool Embedder

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-776:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 ExifTool Embedder
 -

 Key: TIKA-776
 URL: https://issues.apache.org/jira/browse/TIKA-776
 Project: Tika
  Issue Type: New Feature
  Components: metadata
Affects Versions: 1.0
 Environment: ExifTool is required 
 (http://www.sno.phy.queensu.ca/~phil/exiftool/)
Reporter: Ray Gauss II
  Labels: embed, exiftool, patch
 Fix For: 1.11

 Attachments: tika-parsers-exiftool-embed-patch.txt


 This patch adds an ExifTool ExternalEmbedder which builds upon the work in 
 issue TIKA-774 and TIKA-775.
 In the tika-parsers an ExiftoolExternalEmbedder is added which extends 
 ExternalEmbedder to programmatically create an Embedder which calls the 
 ExifTool command line to embed tika metadata into a file stream and an 
 ExiftoolExternalEmbedderTest unit test is added which embeds several IPTC and 
 XMP fields then parses the resulting file stream to verify the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1435) Update rome dependency to 1.5

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1435:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Update rome dependency to 1.5
 -

 Key: TIKA-1435
 URL: https://issues.apache.org/jira/browse/TIKA-1435
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Johannes Mockenhaupt
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.11

 Attachments: netcdf-deps-changes.diff


 Rome 1.5 has been released to Sonatype 
 (https://github.com/rometools/rome/issues/183). Though the website 
 (http://rometools.github.io/rome/) is blissfully ignorant of that. The update 
 is mostly maintenance, adopting slf4j and generics as well as moving the 
 namespace from _com.sun.syndication_ to _com.rometools_. PR upcoming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1106) CLAVIN Integration

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1106:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 CLAVIN Integration
 --

 Key: TIKA-1106
 URL: https://issues.apache.org/jira/browse/TIKA-1106
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.3
 Environment: All
Reporter: Adam Estrada
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: entity, geospatial, new-parser
 Fix For: 1.11


 I've been evaluating CLAVIN as a way to extract location information from 
 unstructured text. It seems like meshing it with Tika in some way would make 
 a lot of sense. From CLAVIN website...
 {quote}
 CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source 
 software package for document geotagging and geoparsing that employs 
 context-based geographic entity resolution. It combines a variety of open 
 source tools with natural language processing techniques to extract location 
 names from unstructured text documents and resolve them against gazetteer 
 records. Importantly, CLAVIN does not simply look up location names; 
 rather, it uses intelligent heuristics in an attempt to identify precisely 
 which Springfield (for example) was intended by the author, based on the 
 context of the document. CLAVIN also employs fuzzy search to handle 
 incorrectly-spelled location names, and it recognizes alternative names 
 (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic 
 entity. By enriching text documents with structured geo data, CLAVIN enables 
 hierarchical geospatial search and advanced geospatial analytics on 
 unstructured data.
 {quote}
 There was only one other instance of the word clavin mentioned in the ASF 
 jira site so I thought it was definitely worth posting here.
 https://github.com/Berico-Technologies/CLAVIN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-987) Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-987:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Embedded drawing (SHAPE MERGEFORMAT) sometimes not extracted
 

 Key: TIKA-987
 URL: https://issues.apache.org/jira/browse/TIKA-987
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
 Fix For: 1.11

 Attachments: picture.doc, picture_3.doc


 I have two Word docs, both containing the same drawing, but one has
 text added.
 In one case (picture.doc) the extraction is correct: it contains only
 an embedded image.wmf; when I view the image it's correct.
 In the second case (picture_3.doc) the picture is extracted as image
 (no extension), and is 0 bytes, and there is an invalid character
 (mapped to unicode replacement char) inserted before the image:
 {noformat}
 title/
 /head
 bodyp�img src=embedded:image1 alt=image1//p
 p/
 p/
 pvehicle
 /p
 {noformat}
 (Though, the text vehicle is extracted correctly).
 I dug a bit, and with the 2nd doc there is an embedded {SHAPE *
 MERGEFORMAT} field, which we invoke
 WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts
 the 0-byte no-extension image as well as the invalid character.  With
 the first doc there is no field (at least not one that's handle with
 handleSpecialCharacterRuns...).  Otherwise I'm not sure how to
 fix... it could be something is going wrong in how POI parses the
 Pictures from PictureSource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1672) Integrate tika-java7 component

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1672:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Integrate tika-java7 component
 --

 Key: TIKA-1672
 URL: https://issues.apache.org/jira/browse/TIKA-1672
 Project: Tika
  Issue Type: Improvement
Reporter: Tyler Palsulich
 Fix For: 1.11


 Code requiring Java 7 doesn't need to be in a separate module now that 
 TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1379) error in Tika().detect for xml files with xades signature

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1379:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 error in Tika().detect for xml files with xades signature
 -

 Key: TIKA-1379
 URL: https://issues.apache.org/jira/browse/TIKA-1379
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 1.4
Reporter: Alessandro De Angelis
  Labels: new-parser
 Fix For: 1.11


 we tried to get the mime type of an xml file with xades signature embedded. 
 the result is text/html and not the expected text/xml or 
 application/xml.
 here is an example of the xml file:
 {code}
 VERBALI ad_cod=D69017 batch_id=0 cds_cod=D69 data_app=2013-09-23
 VERBALE Id=1 tipologia=Verbale esame
   VERB_NUM00094853 0003 2/VERB_NUM
   DATA_APP2013-09-23/DATA_APP
   DATA_ESA2013-09-23/DATA_ESA
   AD_CODD69017/AD_COD
   ADFILOSOFIA DELLA SCIENZA/AD
   CDS_CODD69/CDS_COD
   CDSTEATRO E ARTI VISIVE/CDS
   TIPO_ESA/TIPO_ESA
   MAT1233456/MAT
   NOMEPAOLINO/NOME
   COGNOMEPAPERINO/COGNOME
   VOTO23.0/VOTO
   VOTODECOD23/VOTODECOD
   CAUSALE/CAUSALE
   TIPO_MODULO/TIPO_MODULO
   IMG_PATH/IMG_PATH
   AA_SES_ID2012/AA_SES_ID
   AD_CFU6.0/AD_CFU
   NOTA/NOTA
   ATENEO9/ATENEO
   ATENEO_DESجامعة البندقية - TEST/ATENEO_DES
   TIPO_DOCUMENTOVerbale_3/TIPO_DOCUMENTO
   TITOLARE_PROCEDIMENTOQUI QUO QUA/TITOLARE_PROCEDIMENTO
   AD_STU_CODD69017/AD_STU_COD
   AD_STUFILOSOFIA DELLA SCIENZA/AD_STU
   CDS_STU_CODD69/CDS_STU_COD
   CDS_STUTEATRO E ARTI VISIVE/CDS_STU
   DOCENTEQUI QUO QUA/DOCENTE
 DATA_DOCUMENTO26-09-2013 09:55:53 CEST(+0200)/DATA_DOCUMENTO
 SOFTWARE_DI_CREAZIONE
   NOME3/NOME
   VERSIONE11.09.03/VERSIONE
 /SOFTWARE_DI_CREAZIONE
 /VERBALEds:Signature xmlns:ds=http://www.w3.org/2000/09/xmldsig#; 
 Id=sig08744308748201048377
 ds:SignedInfo
 ds:CanonicalizationMethod 
 Algorithm=http://www.w3.org/2006/12/xml-c14n11;/ds:CanonicalizationMethod
 ds:SignatureMethod 
 Algorithm=http://www.w3.org/2001/04/xmldsig-more#rsa-sha256;/ds:SignatureMethod
 ds:Reference URI=
 ds:Transforms
 ds:Transform Algorithm=http://www.w3.org/2002/06/xmldsig-filter2;
 dsig-xpath:XPath 
 xmlns:dsig-xpath=http://www.w3.org/2002/06/xmldsig-filter2; 
 Filter=subtract/descendant::ds:Signature/dsig-xpath:XPath
 /ds:Transform
 ds:Transform Algorithm=http://www.w3.org/TR/1999/REC-xslt-19991116;
 xsl:stylesheet xmlns:kion=http://www.kion.it/webesse3/multilingua; 
 xmlns:xsl=http://www.w3.org/1999/XSL/Transform; 
 exclude-result-prefixes=kion version=1.0
   kion:ml module=FirmaDigitale target=kion/kion:ml
   xsl:output method=xml/xsl:output
   xsl:variable name=mostra_ad_figlie select=1/xsl:variable
   xsl:variable name=verbale_root 
 select=/VERBALI/VERBALE/xsl:variable
   xsl:variable name=sostituzione_root 
 select=/VERBALI/VERBALE/SOSTITUZIONE_DOCUMENTO/xsl:variable
   xsl:variable name=RAGG_ROOT 
 select=/VERBALI/VERBALE/RAGGRUPPAMENTO/xsl:variable
   xsl:variable name=COMM_ROOT 
 select=/VERBALI/VERBALE/COMMISSIONE/xsl:variable
   
   xsl:template match=/
   html
   head
   meta content=text/html;charset=UTF-8 
 http-equiv=Content-Type/meta
   xsl:choose 
   xsl:when 
 test=$sostituzione_root
   titleDichiarazione 
 conformità Verbale Esame/title
   /xsl:when
   xsl:otherwise
   titleVerbalizzazione 
 esame/title
   /xsl:otherwise
   /xsl:choose
   style type=text/css
td  {font-family: Arial; font-size:10pt;} 
div {font-family: Arial; font-size:10pt;}
pre {font-family: Arial; font-size:10pt;} 
   /style
   /head
   body
   table
   xsl:choose 
   xsl:when 
 test=$sostituzione_root
   trtd align=center 
 colspan=2bigstrongxsl:value-of 
 select=$verbale_root/ATENEO_DES/xsl:value-of/strong/bigbr/br/td/tr
   trtd align=center 
 

[jira] [Updated] (TIKA-1308) Support in memory parse mode(don't create temp file): to support run Tika in GAE

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1308:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Support in memory parse mode(don't create temp file): to support run Tika in 
 GAE
 

 Key: TIKA-1308
 URL: https://issues.apache.org/jira/browse/TIKA-1308
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: jefferyyuan
  Labels: gae
 Fix For: 1.11


 I am trying to use Tika in GAE and write a simple servlet to extract meta 
 data info from jpeg:
 {code}
 String urlStr = req.getParameter(imageUrl);
 byte[] oldImageData = IOUtils.toByteArray(new URL(urlStr));
 ByteArrayInputStream bais = new ByteArrayInputStream(oldImageData);
 Metadata metadata = new Metadata();
 BodyContentHandler ch = new BodyContentHandler();
 AutoDetectParser parser = new AutoDetectParser();
 parser.parse(bais, ch, metadata, new ParseContext());
 bais.close();
 {code}
 This fails with exception:
 {code}
 Caused by: java.lang.SecurityException: Unable to create temporary file
   at java.io.File.createTempFile(File.java:1986)
   at 
 org.apache.tika.io.TemporaryResources.createTemporaryFile(TemporaryResources.java:66)
   at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:533)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
 {code}
 Checked the code, in 
 org.apache.tika.parser.jpeg.JpegParser.parse(InputStream, ContentHandler, 
 Metadata, ParseContext), it creates a temp file from the input stream.
 I can understand why tika create temp file from the stream: so tika can parse 
 it multiple times.
 But as GAE and other cloud servers are getting more popular, is it possible 
 to avoid create temp file: instead we can copy the origin stream to a 
 byteArray stream, so tika can also parse it multiple times.
 -- This will have a limit on the file size, as tika keeps the whole file in 
 memory, but this can make tika work in GAE and maybe other cloud server.
 We can add a parameter in parser.parse to indicate whether do in memory parse 
 only.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-894:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Add webapp mode for Tika Server, simplifies deployment
 --

 Key: TIKA-894
 URL: https://issues.apache.org/jira/browse/TIKA-894
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.1, 1.2
Reporter: Chris Wilson
  Labels: maven, newbie, patch
 Fix For: 1.11

 Attachments: tika-server-webapp.patch


 For use in production services, Tika Server should really be deployed as a 
 WAR file, under a reliable servlet container that knows how to run as a 
 system service, for example Tomcat or JBoss.
 This is especially important on Windows, where I wasted an entire day trying 
 to make TikaServerCli run as some kind of a service. 
 Maven makes building a webapp pretty trivial. With the attached patch 
 applied, mvn war:war should work. It seems to run fine in Tomcat, which 
 makes Windows deployment much simpler. Just install Tomcat and drop the WAR 
 file into tomcat's webapps directory and you're away.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1108) Represent individual slides in pptx

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1108:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Represent individual slides in pptx
 ---

 Key: TIKA-1108
 URL: https://issues.apache.org/jira/browse/TIKA-1108
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Daniel Bonniot de Ruisselet
 Fix For: 1.11


 When parsing ppt, tika produces for each slide:
 div class=slide
 However for pptx these seem to be missing, all the text is directly under 
 body.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1688) Tika Version in Metadata

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1688:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Tika Version in Metadata
 

 Key: TIKA-1688
 URL: https://issues.apache.org/jira/browse/TIKA-1688
 Project: Tika
  Issue Type: Improvement
Reporter: Paul Ramirez
Priority: Minor
 Fix For: 1.11


 Could this be added as X-Tika:version that way downstream there would be 
 traceability to extraction based on version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1696) Language Identification with Text Processing Toolkit from MITLL

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1696:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Language Identification with Text Processing Toolkit from MITLL
 ---

 Key: TIKA-1696
 URL: https://issues.apache.org/jira/browse/TIKA-1696
 Project: Tika
  Issue Type: New Feature
  Components: languageidentifier
Reporter: Paul Ramirez
 Fix For: 1.11


 The aim here is to extend the methods for language identification within 
 text. MIT Lincoln Labs has an open source library [1] written in Julia. 
 Having spoken  with the MITLL guys there is a possibility that there is a 
 scala version of this library which would make it easier to package in with 
 Tika. 
 At this point I'm not quite sure how many languages this library supports by 
 default but it can be extended when provided some training data.
 [1] https://github.com/mit-nlp/Text.jl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1616) Tika Parser for GIBS Metadata

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1616:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Tika Parser for GIBS Metadata
 -

 Key: TIKA-1616
 URL: https://issues.apache.org/jira/browse/TIKA-1616
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11


 [GIBS|https://earthdata.nasa.gov/about-eosdis/science-system-description/eosdis-components/global-imagery-browse-services-gibs]
  metadata currently consists of simple stuff in the WMTS GetCapabilities 
 request (e.g. 
 http://map1.vis.earthdata.nasa.gov/wmts-arctic/1.0.0/WMTSCapabilities.xml) 
 which includes available layers, extents, time ranges, map projections, color 
 maps, etc. We will eventually have more detailed visualization metadata 
 available in ECHO/CMR which will include linkages to data products, 
 provenance, etc. 
 Some investigation and a Tika parser would be excellent to extract and 
 assimilate GIBS Metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1366) Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1366:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Update some of Tika Server services to support JAX-RS 2.0 AsyncResponse 
 

 Key: TIKA-1366
 URL: https://issues.apache.org/jira/browse/TIKA-1366
 Project: Tika
  Issue Type: Improvement
  Components: server
Reporter: Sergey Beryozkin
Priority: Minor
 Fix For: 1.11


 Some of Tika Server services will benefit from optionally supporting JAX-RS 
 2.0 AsyncResponse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-891) Use POST in addition to PUT on method calls in tika-server

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-891:
-
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Use POST in addition to PUT on method calls in tika-server
 --

 Key: TIKA-891
 URL: https://issues.apache.org/jira/browse/TIKA-891
 Project: Tika
  Issue Type: Improvement
  Components: general
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Trivial
  Labels: newbie
 Fix For: 1.11


 Per Jukka's email:
 http://s.apache.org/uR
 It would be a better use of REST/HTTP verbs to use POST to put content to a 
 resource where we don't intend to store that content (which is the 
 implication of PUT). Max suggested adding:
 {code}
 @POST
 {code}
 annotations to the methods we are currently exposing using PUT to take care 
 of this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1425) Automatic batching of Microsoft service calls

2015-08-08 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-1425:
--
Fix Version/s: (was: 1.10)
   1.11

* Pushed to 1.11 following 1.10 release

 Automatic batching of Microsoft service calls
 -

 Key: TIKA-1425
 URL: https://issues.apache.org/jira/browse/TIKA-1425
 Project: Tika
  Issue Type: Improvement
  Components: translation
Affects Versions: 1.6
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11


 Right now when I use the following code I get the stack trace at the bottom 
 of this description. This seems to be because the Request URI is too large to 
 make the service request. We need to have a mechansim within the call to 
 Tika.translate which will, on a service-by-service basis, determine the 
 maximum Request URI which can be sent. I beleive that this should be on the 
 Tika side as how else am I meant to know the maximum request size?
 {code:title=translator.java|borderStyle=solid}
 +Translator translate = new MicrosoftTranslator();
 +((MicrosoftTranslator) translate).setId(...);
 +((MicrosoftTranslator) translate).setSecret(...);
  for (java.util.Map.EntryText, Parse entry : parseResult) {
Parse parse = entry.getValue();
LOG.info(-\nUrl\n---\n);
 @@ -201,7 +207,7 @@
System.out.print(parse.getData().toString());
if (dumpText) {
  LOG.info(-\nParseText\n-\n);
 -System.out.print(parse.getText());
 +System.out.print(translate.translate(parse.getText(), fr));
}
 {code}
 {code:title=stacktrace.log|borderStyle=solid}
 Exception in thread main java.lang.Exception: [microsoft-translator-api] 
 Error retrieving translation : Server returned HTTP response code: 414 for 
 URL: 
 http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0...
 ...
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:202)
   at com.memetix.mst.translate.Translate.execute(Translate.java:61)
   at com.memetix.mst.translate.Translate.execute(Translate.java:76)
   at 
 org.apache.tika.language.translate.MicrosoftTranslator.translate(MicrosoftTranslator.java:104)
   at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:210)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:228)
 Caused by: java.io.IOException: Server returned HTTP response code: 414 for 
 URL: 
 http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0%BE%D1%80%D1%83%D0%B...
 ...
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1675)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1673)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1671)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1244)
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:178)
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveString(MicrosoftTranslatorAPI.java:199)
   ... 6 more
 Caused by: java.io.IOException: Server returned HTTP response code: 414 for 
 URL: 
 http://api.microsofttranslator.com/V2/Ajax.svc/Translate?from=to=frtext=%D0%A4%D0%BE...
 ...
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
   at 
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
   at 
 com.memetix.mst.MicrosoftTranslatorAPI.retrieveResponse(MicrosoftTranslatorAPI.java:177)
   ... 7 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   >