[jira] [Closed] (TIKA-496) Language identifier profile comparison favors large profiles

2024-03-06 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl closed TIKA-496. Resolution: Won't Do Closing as this is only a problem for the original TIKA langid which is superced

[jira] [Updated] (TIKA-4242) Tika depends on non-existing plexus-utils version

2024-04-17 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Björn Kautler updated TIKA-4242: Description: In [https://github.com/apache/tika/pull/1461] [~tallison] moved the versions to Maven

[jira] [Created] (TIKA-4242) Tika depends on non-existing plexus-utils version

2024-04-17 Thread Jira
Björn Kautler created TIKA-4242: --- Summary: Tika depends on non-existing plexus-utils version Key: TIKA-4242 URL: https://issues.apache.org/jira/browse/TIKA-4242 Project: Tika Issue Type: Bug

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843355#comment-17843355 ] Luís Filipe Nassif commented on TIKA-4250: -- Hi [~tallison], I would like to

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843359#comment-17843359 ] Luís Filipe Nassif commented on TIKA-4250: -- One drawback of our libpff u

[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843359#comment-17843359 ] Luís Filipe Nassif edited comment on TIKA-4250 at 5/3/24 8:5

[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843359#comment-17843359 ] Luís Filipe Nassif edited comment on TIKA-4250 at 5/3/24 8:5

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843364#comment-17843364 ] Luís Filipe Nassif commented on TIKA-4250: -- We can improve our wrapper, for

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843381#comment-17843381 ] Luís Filipe Nassif commented on TIKA-4250: -- If our wrapper, or part of it, i

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-04 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843437#comment-17843437 ] Luís Filipe Nassif commented on TIKA-4250: -- PS: I have never used libpst,

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-04 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843509#comment-17843509 ] Luís Filipe Nassif commented on TIKA-4250: -- I'm running a compariso

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-05 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843593#comment-17843593 ] Luís Filipe Nassif commented on TIKA-4250: -- I included a patched version of

[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-05 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843593#comment-17843593 ] Luís Filipe Nassif edited comment on TIKA-4250 at 5/5/24 11:2

[jira] [Comment Edited] (TIKA-4250) Add a libpst-based parser

2024-05-05 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843593#comment-17843593 ] Luís Filipe Nassif edited comment on TIKA-4250 at 5/5/24 11:3

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-05 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843604#comment-17843604 ] Luís Filipe Nassif commented on TIKA-4250: -- Updating results with Li

[jira] [Commented] (TIKA-4250) Add a libpst-based parser

2024-05-06 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844097#comment-17844097 ] Luís Filipe Nassif commented on TIKA-4250: -- Just pushed the quick and dirty

[jira] [Created] (TIKA-4255) TextAndCSVParser ignores Metadata.CONTENT_ENCODING

2024-05-16 Thread Jira
Axel Dörfler created TIKA-4255: -- Summary: TextAndCSVParser ignores Metadata.CONTENT_ENCODING Key: TIKA-4255 URL: https://issues.apache.org/jira/browse/TIKA-4255 Project: Tika Issue Type: Bug

[jira] [Created] (TIKA-4300) Add audio/aac as an alias for the audio/x-aac MIME type

2024-08-27 Thread Jira
Jøger hansegård created TIKA-4300: - Summary: Add audio/aac as an alias for the audio/x-aac MIME type Key: TIKA-4300 URL: https://issues.apache.org/jira/browse/TIKA-4300 Project: Tika Issue

[jira] [Created] (TIKA-4304) Add audio/flac MIME type

2024-08-29 Thread Jira
Jøger hansegård created TIKA-4304: - Summary: Add audio/flac MIME type Key: TIKA-4304 URL: https://issues.apache.org/jira/browse/TIKA-4304 Project: Tika Issue Type: Improvement

[jira] [Commented] (TIKA-2986) Edge case (?) in file type detection

2019-11-18 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977050#comment-16977050 ] Luís Filipe Nassif commented on TIKA-2986: -- Hi [~tallison], It is not an

[jira] [Assigned] (TIKA-2892) ForkParser deadlock when InputStreamResource catches/returns IOException

2019-11-18 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif reassigned TIKA-2892: Assignee: Luís Filipe Nassif > ForkParser deadlock when InputStreamResou

[jira] [Resolved] (TIKA-2892) ForkParser deadlock when InputStreamResource catches/returns IOException

2019-11-18 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif resolved TIKA-2892. -- Fix Version/s: 1.23 2.0 Resolution: Fixed I am back

[jira] [Comment Edited] (TIKA-2986) Edge case (?) in file type detection

2019-11-18 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977050#comment-16977050 ] Luís Filipe Nassif edited comment on TIKA-2986 at 11/19/19 4:1

[jira] [Commented] (TIKA-2986) Edge case (?) in file type detection

2019-11-19 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977510#comment-16977510 ] Luís Filipe Nassif commented on TIKA-2986: -- Another ideia: create a spec

[jira] [Commented] (TIKA-2988) Add mime for alternative fdf format

2019-11-19 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977537#comment-16977537 ] Luís Filipe Nassif commented on TIKA-2988: -- Seems strict enough to me. &

[jira] [Commented] (TIKA-2986) Edge case (?) in file type detection

2019-11-19 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977649#comment-16977649 ] Luís Filipe Nassif commented on TIKA-2986: -- {quote}How do we know which ones

[jira] [Created] (TIKA-3003) Remove unused dependencies

2019-11-28 Thread Jira
César Soto Valero created TIKA-3003: --- Summary: Remove unused dependencies Key: TIKA-3003 URL: https://issues.apache.org/jira/browse/TIKA-3003 Project: Tika Issue Type: Improvement

[jira] [Updated] (TIKA-3003) Remove unused dependencies

2019-11-28 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] César Soto Valero updated TIKA-3003: Description: I noticed that dependency *org.jsoup:jsoup:1.12.1* is declared in module

[jira] [Updated] (TIKA-3003) Remove unused dependencies

2019-11-28 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] César Soto Valero updated TIKA-3003: Description: I noticed that dependency *org.jsoup:jsoup:1.12.1* is declared in module

[jira] [Updated] (TIKA-3003) Remove unused dependencies

2019-11-28 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] César Soto Valero updated TIKA-3003: Description: I noticed that dependency *org.jsoup:jsoup:1.12.1* is declared in module

[jira] [Updated] (TIKA-3003) Remove unused dependencies

2019-11-28 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] César Soto Valero updated TIKA-3003: Description: I noticed that dependency *org.jsoup:jsoup:1.12.1* is declared in module

[jira] [Commented] (TIKA-2925) General dependency/plugin upgrades for 1.23

2019-12-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987242#comment-16987242 ] Luís Filipe Nassif commented on TIKA-2925: -- Hi [~tallison], Just tested

[jira] [Commented] (TIKA-2925) General dependency/plugin upgrades for 1.23

2019-12-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987245#comment-16987245 ] Luís Filipe Nassif commented on TIKA-2925: -- Currently I have an alterna

[jira] [Commented] (TIKA-2925) General dependency/plugin upgrades for 1.23

2019-12-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987258#comment-16987258 ] Luís Filipe Nassif commented on TIKA-2925: -- No, our current version just th

[jira] [Assigned] (TIKA-2415) Upgrade libpst to 0.9.3

2019-12-04 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif reassigned TIKA-2415: Assignee: Luís Filipe Nassif > Upgrade libpst to 0.

[jira] [Assigned] (TIKA-2546) com.pff:java-libpst is branch EOL

2019-12-04 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif reassigned TIKA-2546: Assignee: Luís Filipe Nassif > com.pff:java-libpst is branch

[jira] [Resolved] (TIKA-2546) com.pff:java-libpst is branch EOL

2019-12-04 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif resolved TIKA-2546. -- Fix Version/s: 1.24 Resolution: Fixed > com.pff:java-libpst is branch

[jira] [Commented] (TIKA-2546) com.pff:java-libpst is branch EOL

2019-12-04 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988098#comment-16988098 ] Luís Filipe Nassif commented on TIKA-2546: -- Upgraded the lib and added a c

[jira] [Created] (TIKA-3004) OutlookPSTParser missing emails attached to other emails

2019-12-04 Thread Jira
Luís Filipe Nassif created TIKA-3004: Summary: OutlookPSTParser missing emails attached to other emails Key: TIKA-3004 URL: https://issues.apache.org/jira/browse/TIKA-3004 Project: Tika

[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2020-04-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080195#comment-17080195 ] Luís Filipe Nassif commented on TIKA-2849: -- Hi [~boris-petrov], There a

[jira] [Comment Edited] (TIKA-2849) TikaInputStream copies the input stream locally

2020-04-09 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080195#comment-17080195 ] Luís Filipe Nassif edited comment on TIKA-2849 at 4/10/20, 3:5

[jira] [Comment Edited] (TIKA-2849) TikaInputStream copies the input stream locally

2020-04-10 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080195#comment-17080195 ] Luís Filipe Nassif edited comment on TIKA-2849 at 4/10/20, 1:5

[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

2020-04-11 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081294#comment-17081294 ] Luís Filipe Nassif commented on TIKA-2849: -- [~tallison] actually parse tim

[jira] [Comment Edited] (TIKA-2849) TikaInputStream copies the input stream locally

2020-04-11 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081294#comment-17081294 ] Luís Filipe Nassif edited comment on TIKA-2849 at 4/11/20, 2:0

[jira] [Created] (TIKA-3091) java.lang.NullPointerException when calling hashCode after instantiating PDFParserConfig

2020-04-13 Thread Jira
Eduardo Guimarães created TIKA-3091: --- Summary: java.lang.NullPointerException when calling hashCode after instantiating PDFParserConfig Key: TIKA-3091 URL: https://issues.apache.org/jira/browse/TIKA-3091

[jira] [Updated] (TIKA-3091) java.lang.NullPointerException when calling hashCode after instantiating PDFParserConfig

2020-04-13 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eduardo Guimarães updated TIKA-3091: Description: _averageCharTolerance_ and _spacingTolerance_ are null after instantiating

[jira] [Created] (TIKA-3100) RFC822Parser ignore charset when extractAllAlternatives set to true

2020-05-11 Thread Jira
Mariusz Cieślukowski created TIKA-3100: -- Summary: RFC822Parser ignore charset when extractAllAlternatives set to true Key: TIKA-3100 URL: https://issues.apache.org/jira/browse/TIKA-3100 Project

[jira] [Updated] (TIKA-3100) RFC822Parser ignore charset when extractAllAlternatives set to true

2020-05-11 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mariusz Cieślukowski updated TIKA-3100: --- Labels: rfc822parser (was: ) > RFC822Parser ignore charset w

[jira] [Created] (TIKA-3105) OFT format detection based on file content

2020-06-03 Thread Jira
Ondřej Duchoň created TIKA-3105: --- Summary: OFT format detection based on file content Key: TIKA-3105 URL: https://issues.apache.org/jira/browse/TIKA-3105 Project: Tika Issue Type: Bug

[jira] [Updated] (TIKA-3105) OFT format detection based on file name (extension) instead of file content

2020-06-03 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ondřej Duchoň updated TIKA-3105: Summary: OFT format detection based on file name (extension) instead of file content (was: OFT

[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134345#comment-17134345 ] Andreas Lehmkühler commented on TIKA-3111: -- [~tilman] Yes, you're

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134381#comment-17134381 ] Christoph Läubrich commented on TIKA-3110: -- [~tallison] from an API poin

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134389#comment-17134389 ] Christoph Läubrich commented on TIKA-3110: -- BTW: Commons.io has a foreM

[jira] [Comment Edited] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-12 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134345#comment-17134345 ] Andreas Lehmkühler edited comment on TIKA-3111 at 6/12/20, 5:4

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

2020-06-12 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134426#comment-17134426 ] Christoph Läubrich commented on TIKA-3110: -- If you are only concerned a

[jira] [Updated] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-13 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated TIKA-3111: - Attachment: Patch_PDFStreamEngine.txt > Upgrade to PDFBox 2.0

[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-13 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134727#comment-17134727 ] Andreas Lehmkühler commented on TIKA-3111: -- I guess I've reinstat

[jira] [Comment Edited] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-13 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134727#comment-17134727 ] Andreas Lehmkühler edited comment on TIKA-3111 at 6/13/20, 10:2

[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-13 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134780#comment-17134780 ] Andreas Lehmkühler commented on TIKA-3111: -- Thanks for the fast feedback and

[jira] [Comment Edited] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-14 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134780#comment-17134780 ] Andreas Lehmkühler edited comment on TIKA-3111 at 6/14/20, 10:4

[jira] [Updated] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-14 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated TIKA-3111: - Attachment: (was: Patch_PDFStreamEngine.txt) > Upgrade to PDFBox 2.0

[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-14 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135145#comment-17135145 ] Andreas Lehmkühler commented on TIKA-3111: -- I've extended my patch

[jira] [Commented] (TIKA-3111) Upgrade to PDFBox 2.0.20

2020-06-14 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135405#comment-17135405 ] Andreas Lehmkühler commented on TIKA-3111: -- Thanks for the prompt feed

[jira] [Created] (TIKA-3123) request to parse Chinese, but return Russian

2020-06-23 Thread Jira
阿里木 created TIKA-3123: - Summary: request to parse Chinese, but return Russian Key: TIKA-3123 URL: https://issues.apache.org/jira/browse/TIKA-3123 Project: Tika Issue Type: Bug Affects Versions

[jira] [Updated] (TIKA-3123) request to parse Chinese, but return Russian

2020-06-23 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 阿里木 updated TIKA-3123: -- Description: Try to parse html text containing Chinese: {code:java}  被{code} tika-server return Russian: {code:java

[jira] [Commented] (TIKA-3123) request to parse Chinese, but return Russian

2020-06-23 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143467#comment-17143467 ] 阿里木 commented on TIKA-3123: --- complete HTML was: {code:java}  被 {code}   > request t

[jira] [Created] (TIKA-3127) When using html parser any empty attribute sets value to attribute name e.g. link gives href="href"

2020-06-30 Thread Jira
Milan Vereščák created TIKA-3127: Summary: When using html parser any empty attribute sets value to attribute name e.g. link gives href="href" Key: TIKA-3127 URL: https://issues.apache.org/jira/browse

[jira] [Updated] (TIKA-3127) When using html parser any empty attribute sets value to attribute name e.g. link gives href="href"

2020-06-30 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milan Vereščák updated TIKA-3127: - Description: Shouldn't it be rather empty string? It was present in 1.16 version but al

[jira] [Updated] (TIKA-3127) When using html parser any empty attribute sets value to attribute name e.g. link gives href="href"

2020-06-30 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milan Vereščák updated TIKA-3127: - Description: Shouldn't it be rather empty string? It was present in 1.16 version but al

[jira] [Created] (TIKA-3134) totalCharsPerPage and unmappedUnicodeCharsPerPage configuration

2020-07-15 Thread Jira
Dávid Tóth created TIKA-3134: Summary: totalCharsPerPage and unmappedUnicodeCharsPerPage configuration Key: TIKA-3134 URL: https://issues.apache.org/jira/browse/TIKA-3134 Project: Tika Issue

[jira] [Commented] (TIKA-3153) Text File identified as message/rfc822

2020-08-10 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175037#comment-17175037 ] Luís Filipe Nassif commented on TIKA-3153: -- Well, if we could add another wa

[jira] [Comment Edited] (TIKA-3153) Text File identified as message/rfc822

2020-08-10 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175037#comment-17175037 ] Luís Filipe Nassif edited comment on TIKA-3153 at 8/10/20, 8:1

[jira] [Commented] (TIKA-3153) Text File identified as message/rfc822

2020-08-11 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175523#comment-17175523 ] Luís Filipe Nassif commented on TIKA-3153: -- +1 > Text File identi

[jira] [Created] (TIKA-3174) tika解析ofd文档时,除了正文内容外,还出现了多余的数字。

2020-08-19 Thread Jira
天空 created TIKA-3174: Summary: tika解析ofd文档时,除了正文内容外,还出现了多余的数字。 Key: TIKA-3174 URL: https://issues.apache.org/jira/browse/TIKA-3174 Project: Tika Issue Type: Bug Reporter: 天空 ofd文档中正文内容:各地各

[jira] [Commented] (TIKA-3173) Tika server with spawnChild - server does not recover from OOM until an additional file comes in

2020-08-24 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183542#comment-17183542 ] Luís Filipe Nassif commented on TIKA-3173: -- I agree with [~tallison], we sh

[jira] [Comment Edited] (TIKA-3173) Tika server with spawnChild - server does not recover from OOM until an additional file comes in

2020-08-24 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183542#comment-17183542 ] Luís Filipe Nassif edited comment on TIKA-3173 at 8/24/20, 6:5

[jira] [Commented] (TIKA-3221) /rmeta/text endpoint - allow a "max parse time" parameter where after exceeded, return bytes/metadata mangaed to get up to that point

2020-11-05 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227095#comment-17227095 ] Luís Filipe Nassif commented on TIKA-3221: -- My 2 cents, in the past I

[jira] [Comment Edited] (TIKA-3221) /rmeta/text endpoint - allow a "max parse time" parameter where after exceeded, return bytes/metadata mangaed to get up to that point

2020-11-05 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227095#comment-17227095 ] Luís Filipe Nassif edited comment on TIKA-3221 at 11/6/20, 1:0

[jira] [Created] (TIKA-3236) Upgrade cxf-core to 3.3.8

2020-11-24 Thread Jira
Jesper Håsteen created TIKA-3236: Summary: Upgrade cxf-core to 3.3.8 Key: TIKA-3236 URL: https://issues.apache.org/jira/browse/TIKA-3236 Project: Tika Issue Type: Task Components

[jira] [Commented] (TIKA-3221) /rmeta/text endpoint - allow a "max parse time" parameter where after exceeded, return bytes/metadata mangaed to get up to that point

2020-11-25 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239012#comment-17239012 ] Luís Filipe Nassif commented on TIKA-3221: -- Sorry! Misunderstood the reques

[jira] [Created] (TIKA-3237) Great optimization in ForkParser

2020-11-26 Thread Jira
Luís Filipe Nassif created TIKA-3237: Summary: Great optimization in ForkParser Key: TIKA-3237 URL: https://issues.apache.org/jira/browse/TIKA-3237 Project: Tika Issue Type: Improvement

[jira] [Updated] (TIKA-3237) Great optimization in ForkParser

2020-11-26 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif updated TIKA-3237: - Description: There is a huge overhead in ForkParser ContentHandlerProxy and

[jira] [Resolved] (TIKA-3237) Great optimization in ForkParser

2020-11-26 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif resolved TIKA-3237. -- Fix Version/s: 2.0 Resolution: Fixed closed by

[jira] [Commented] (TIKA-3237) Great optimization in ForkParser

2020-11-26 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239412#comment-17239412 ] Luís Filipe Nassif commented on TIKA-3237: -- Should I open another issue for

[jira] [Resolved] (TIKA-3004) OutlookPSTParser missing emails attached to other emails

2020-11-26 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luís Filipe Nassif resolved TIKA-3004. -- Fix Version/s: 2.0 Resolution: Fixed resolved by

[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

2021-01-07 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260846#comment-17260846 ] Luís Filipe Nassif commented on TIKA-3258: -- Regarding  [~tilman] concern, PD

[jira] [Comment Edited] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

2021-01-07 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260846#comment-17260846 ] Luís Filipe Nassif edited comment on TIKA-3258 at 1/7/21, 9:5

[jira] [Commented] (TIKA-3270) Render non-text in PDFs for OCR

2021-01-13 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264228#comment-17264228 ] Luís Filipe Nassif commented on TIKA-3270: -- Is checking for missing ToUni

[jira] [Commented] (TIKA-3290) Extension reading it as eml instead of txt

2021-02-07 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280660#comment-17280660 ] Luís Filipe Nassif commented on TIKA-3290: -- I agree with [~nick] this sa

[jira] [Comment Edited] (TIKA-3290) Extension reading it as eml instead of txt

2021-02-07 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280660#comment-17280660 ] Luís Filipe Nassif edited comment on TIKA-3290 at 2/7/21, 9:1

[jira] [Commented] (TIKA-3290) Extension reading it as eml instead of txt

2021-02-07 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280662#comment-17280662 ] Luís Filipe Nassif commented on TIKA-3290: -- Looking into this, seems

[jira] [Comment Edited] (TIKA-3290) Extension reading it as eml instead of txt

2021-02-07 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280662#comment-17280662 ] Luís Filipe Nassif edited comment on TIKA-3290 at 2/7/21, 10:2

[jira] [Commented] (TIKA-3290) Extension reading it as eml instead of txt

2021-02-13 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284297#comment-17284297 ] Luís Filipe Nassif commented on TIKA-3290: -- Did you try the approach I sugge

[jira] [Commented] (TIKA-3300) Figure out if we can improve tesseract parallelization

2021-02-14 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284419#comment-17284419 ] Luís Filipe Nassif commented on TIKA-3300: -- I also set OMP_THREAD_LIMIT

[jira] [Commented] (TIKA-3290) Extension reading it as eml instead of txt

2021-02-17 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285829#comment-17285829 ] Luís Filipe Nassif commented on TIKA-3290: -- Currently I think it cannot be

[jira] [Comment Edited] (TIKA-3290) Extension reading it as eml instead of txt

2021-02-17 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285829#comment-17285829 ] Luís Filipe Nassif edited comment on TIKA-3290 at 2/17/21, 1:5

[jira] [Comment Edited] (TIKA-3300) Figure out if we can improve tesseract parallelization

2021-02-18 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284419#comment-17284419 ] Luís Filipe Nassif edited comment on TIKA-3300 at 2/18/21, 1:1

[jira] [Commented] (TIKA-3300) Figure out if we can improve tesseract parallelization

2021-02-18 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286543#comment-17286543 ] Luís Filipe Nassif commented on TIKA-3300: -- Hi [~tallison]! Just teste

[jira] [Commented] (TIKA-3300) Figure out if we can improve tesseract parallelization

2021-02-18 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286766#comment-17286766 ] Luís Filipe Nassif commented on TIKA-3300: -- So, as last tesseract version

[jira] [Commented] (TIKA-94) Speech recognition

2021-02-19 Thread Jira
[ https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17287356#comment-17287356 ] Luís Filipe Nassif commented on TIKA-94: maybe [https://github.com/moz

  1   2   3   4   5   6   7   8   9   10   >