[jira] [Comment Edited] (TIKA-3154) Exception while extracting msg files

2020-08-19 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180782#comment-17180782 ] Akash edited comment on TIKA-3154 at 8/19/20, 7:57 PM: --- Tried with b

[jira] [Commented] (TIKA-3154) Exception while extracting msg files

2020-08-19 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180782#comment-17180782 ] Akash commented on TIKA-3154: - Tried with below config. Did not help {code:java} /

[jira] [Comment Edited] (TIKA-3154) Exception while extracting msg files

2020-08-19 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180782#comment-17180782 ] Akash edited comment on TIKA-3154 at 8/19/20, 7:56 PM: --- Tried with b

[jira] [Commented] (TIKA-3154) Exception while extracting msg files

2020-08-19 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180512#comment-17180512 ] Akash commented on TIKA-3154: - Can we make this value configurable ? > Exception while extrac

[jira] [Commented] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2020-08-19 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180449#comment-17180449 ] Akash commented on TIKA-3172: - Thanks [~tallison] for the clarification. > PDF Parser configu

[jira] [Comment Edited] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2020-08-19 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180343#comment-17180343 ] Akash edited comment on TIKA-3172 at 8/19/20, 7:46 AM: --- [~tallison] 

[jira] [Comment Edited] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2020-08-19 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180343#comment-17180343 ] Akash edited comment on TIKA-3172 at 8/19/20, 7:46 AM: --- [~tallison] 

[jira] [Commented] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2020-08-19 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180343#comment-17180343 ] Akash commented on TIKA-3172: - [~tallison]  If we use above mentioned tika config file to ext

[jira] [Commented] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179993#comment-17179993 ] Akash commented on TIKA-3172: - Ok [~tilman]. If possible we can have that option. It will be a

[jira] [Closed] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash closed TIKA-3170. --- Fix Version/s: 1.25 Resolution: Duplicate Duplicate of TIKA-3131 > PDF extraction space issue > --

[jira] [Comment Edited] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179990#comment-17179990 ] Akash edited comment on TIKA-3170 at 8/18/20, 6:07 PM: --- Seems issue

[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179990#comment-17179990 ] Akash commented on TIKA-3170: - Seems issue is already fixed as part of this commit - [https:/

[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179679#comment-17179679 ] Akash commented on TIKA-3170: - Difference because of    !image-2020-08-18-20-23-16-159.png!

[jira] [Updated] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash updated TIKA-3170: Attachment: image-2020-08-18-20-23-16-159.png > PDF extraction space issue > -- > >

[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179666#comment-17179666 ] Akash commented on TIKA-3170: - [https://github.com/apache/tika/compare/35a2cd35129db3aae58fd65

[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179656#comment-17179656 ] Akash commented on TIKA-3170: - 1 more observation. Extracted output remains same from tika app

[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-18 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179469#comment-17179469 ] Akash commented on TIKA-3170: - Tried extracting using pdfbox-app jar for both versions. Observ

[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-17 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179392#comment-17179392 ] Akash commented on TIKA-3170: - Command - java -jar tika-app-1.24.1.jar --config=tika-config.xm

[jira] [Commented] (TIKA-3170) PDF extraction space issue

2020-08-17 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179221#comment-17179221 ] Akash commented on TIKA-3170: - If we set enableAutoSpace as false, results are looking much be

[jira] [Created] (TIKA-3172) PDF Parser configuration enable auto space using tika config file

2020-08-17 Thread Akash (Jira)
Akash created TIKA-3172: --- Summary: PDF Parser configuration enable auto space using tika config file Key: TIKA-3172 URL: https://issues.apache.org/jira/browse/TIKA-3172 Project: Tika Issue Type: Wish

[jira] [Created] (TIKA-3170) PDF extraction space issue

2020-08-17 Thread Akash (Jira)
Akash created TIKA-3170: --- Summary: PDF extraction space issue Key: TIKA-3170 URL: https://issues.apache.org/jira/browse/TIKA-3170 Project: Tika Issue Type: Bug Components: parser Affects

[jira] [Comment Edited] (TIKA-3154) Exception while extracting msg files

2020-08-17 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178797#comment-17178797 ] Akash edited comment on TIKA-3154 at 8/17/20, 7:33 AM: --- [~tallison] 

[jira] [Comment Edited] (TIKA-3154) Exception while extracting msg files

2020-08-17 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178797#comment-17178797 ] Akash edited comment on TIKA-3154 at 8/17/20, 7:33 AM: --- [~tallison] 

[jira] [Comment Edited] (TIKA-3154) Exception while extracting msg files

2020-08-17 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178797#comment-17178797 ] Akash edited comment on TIKA-3154 at 8/17/20, 7:33 AM: --- [~tallison] 

[jira] [Commented] (TIKA-3154) Exception while extracting msg files

2020-08-17 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178797#comment-17178797 ] Akash commented on TIKA-3154: - [~tallison] Can we make this as a configuration parameter rathe

[jira] [Commented] (TIKA-3155) Parse Error while extracting CSV files

2020-08-10 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175289#comment-17175289 ] Akash commented on TIKA-3155: - [https://github.com/apache/tika/blob/main/tika-parsers/src/main

[jira] [Created] (TIKA-3155) Parse Error while extracting CSV files

2020-08-10 Thread Akash (Jira)
Akash created TIKA-3155: --- Summary: Parse Error while extracting CSV files Key: TIKA-3155 URL: https://issues.apache.org/jira/browse/TIKA-3155 Project: Tika Issue Type: Bug Components: parser

[jira] [Commented] (TIKA-3154) Exception while extracting msg files

2020-08-10 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175035#comment-17175035 ] Akash commented on TIKA-3154: - Thanks [~tallison] > Exception while extracting msg files > --

[jira] [Updated] (TIKA-3154) Exception while extracting msg files

2020-08-07 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash updated TIKA-3154: Description: While parsing msg file containing some html text inside, we are getting exception from Tika. Command

[jira] [Updated] (TIKA-3154) Exception while extracting msg files

2020-08-07 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash updated TIKA-3154: Description: While parsing msg file containing some html text inside, we are getting exception from Tika. Command

[jira] [Created] (TIKA-3154) Exception while extracting msg files

2020-08-07 Thread Akash (Jira)
Akash created TIKA-3154: --- Summary: Exception while extracting msg files Key: TIKA-3154 URL: https://issues.apache.org/jira/browse/TIKA-3154 Project: Tika Issue Type: Bug Components: parser

[jira] [Updated] (TIKA-3153) Text File identified as message/rfc822

2020-08-07 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash updated TIKA-3153: Description: Text file containing the word Received: is identified as message/rfc22. We were earlier using version

[jira] [Created] (TIKA-3153) Text File identified as message/rfc822

2020-08-07 Thread Akash (Jira)
Akash created TIKA-3153: --- Summary: Text File identified as message/rfc822 Key: TIKA-3153 URL: https://issues.apache.org/jira/browse/TIKA-3153 Project: Tika Issue Type: Bug Components: detecto

[jira] [Updated] (TIKA-3153) Text File identified as message/rfc822

2020-08-07 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash updated TIKA-3153: Description: Text file containing the word Received: is identified as message/rfc22. We were earlier using version

[jira] [Commented] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-26 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045537#comment-17045537 ] Akash commented on TIKA-3048: - Can some one please update fix version. > Tika unable to parse

[jira] [Commented] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-26 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045533#comment-17045533 ] Akash commented on TIKA-3048: - It worked with tika 1.23. It does not work with 1.9 > Tika una

[jira] [Updated] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-26 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash updated TIKA-3048: Fix Version/s: 1.9 > Tika unable to parse html files with non UTF-8 charset > --

[jira] [Closed] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-26 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash closed TIKA-3048. --- > Tika unable to parse html files with non UTF-8 charset > -- > >

[jira] [Resolved] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-26 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash resolved TIKA-3048. - Resolution: Fixed > Tika unable to parse html files with non UTF-8 charset > -

[jira] [Commented] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-26 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045503#comment-17045503 ] Akash commented on TIKA-3048: - With the change [~tallison] provided in this comment  https://i

[jira] [Commented] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-24 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044199#comment-17044199 ] Akash commented on TIKA-3048: - Tested on different languages again. Here is the observation. 

[jira] [Commented] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-23 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042825#comment-17042825 ] Akash commented on TIKA-3048: - [~tallison] Let me try the above option once. https://issues.a

[jira] [Commented] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-20 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041176#comment-17041176 ] Akash commented on TIKA-3048: - Attached [^ChineseFile.html] which contains some Chinese text u

[jira] [Updated] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-20 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash updated TIKA-3048: Attachment: ChineseFile.html > Tika unable to parse html files with non UTF-8 charset >

[jira] [Commented] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-20 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040850#comment-17040850 ] Akash commented on TIKA-3048: - If we remove the charset header from html, tika parsing works f

[jira] [Updated] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-20 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash updated TIKA-3048: Description: Tika is returning junk characters when parsing chinese characters present inside html file. Html file

[jira] [Updated] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

2020-02-20 Thread Akash (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash updated TIKA-3048: Summary: Tika unable to parse html files with non UTF-8 charset (was: Tika unable to parse html files with GB2312 c

[jira] [Created] (TIKA-3048) Tika unable to parse html files with GB2312 charset

2020-02-19 Thread Akash (Jira)
Akash created TIKA-3048: --- Summary: Tika unable to parse html files with GB2312 charset Key: TIKA-3048 URL: https://issues.apache.org/jira/browse/TIKA-3048 Project: Tika Issue Type: Bug Compon