[
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180782#comment-17180782
]
Akash edited comment on TIKA-3154 at 8/19/20, 7:57 PM:
---
Tried with b
[
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180782#comment-17180782
]
Akash commented on TIKA-3154:
-
Tried with below config. Did not help
{code:java}
/
[
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180782#comment-17180782
]
Akash edited comment on TIKA-3154 at 8/19/20, 7:56 PM:
---
Tried with b
[
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180512#comment-17180512
]
Akash commented on TIKA-3154:
-
Can we make this value configurable ?
> Exception while extrac
[
https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180449#comment-17180449
]
Akash commented on TIKA-3172:
-
Thanks [~tallison] for the clarification.
> PDF Parser configu
[
https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180343#comment-17180343
]
Akash edited comment on TIKA-3172 at 8/19/20, 7:46 AM:
---
[~tallison]
[
https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180343#comment-17180343
]
Akash edited comment on TIKA-3172 at 8/19/20, 7:46 AM:
---
[~tallison]
[
https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17180343#comment-17180343
]
Akash commented on TIKA-3172:
-
[~tallison]
If we use above mentioned tika config file to ext
[
https://issues.apache.org/jira/browse/TIKA-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179993#comment-17179993
]
Akash commented on TIKA-3172:
-
Ok [~tilman]. If possible we can have that option. It will be a
[
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash closed TIKA-3170.
---
Fix Version/s: 1.25
Resolution: Duplicate
Duplicate of TIKA-3131
> PDF extraction space issue
> --
[
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179990#comment-17179990
]
Akash edited comment on TIKA-3170 at 8/18/20, 6:07 PM:
---
Seems issue
[
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179990#comment-17179990
]
Akash commented on TIKA-3170:
-
Seems issue is already fixed as part of this commit -
[https:/
[
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179679#comment-17179679
]
Akash commented on TIKA-3170:
-
Difference because of
!image-2020-08-18-20-23-16-159.png!
[
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash updated TIKA-3170:
Attachment: image-2020-08-18-20-23-16-159.png
> PDF extraction space issue
> --
>
>
[
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179666#comment-17179666
]
Akash commented on TIKA-3170:
-
[https://github.com/apache/tika/compare/35a2cd35129db3aae58fd65
[
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179656#comment-17179656
]
Akash commented on TIKA-3170:
-
1 more observation. Extracted output remains same from tika app
[
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179469#comment-17179469
]
Akash commented on TIKA-3170:
-
Tried extracting using pdfbox-app jar for both versions. Observ
[
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179392#comment-17179392
]
Akash commented on TIKA-3170:
-
Command - java -jar tika-app-1.24.1.jar --config=tika-config.xm
[
https://issues.apache.org/jira/browse/TIKA-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179221#comment-17179221
]
Akash commented on TIKA-3170:
-
If we set enableAutoSpace as false, results are looking much be
Akash created TIKA-3172:
---
Summary: PDF Parser configuration enable auto space using tika
config file
Key: TIKA-3172
URL: https://issues.apache.org/jira/browse/TIKA-3172
Project: Tika
Issue Type: Wish
Akash created TIKA-3170:
---
Summary: PDF extraction space issue
Key: TIKA-3170
URL: https://issues.apache.org/jira/browse/TIKA-3170
Project: Tika
Issue Type: Bug
Components: parser
Affects
[
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178797#comment-17178797
]
Akash edited comment on TIKA-3154 at 8/17/20, 7:33 AM:
---
[~tallison]
[
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178797#comment-17178797
]
Akash edited comment on TIKA-3154 at 8/17/20, 7:33 AM:
---
[~tallison]
[
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178797#comment-17178797
]
Akash edited comment on TIKA-3154 at 8/17/20, 7:33 AM:
---
[~tallison]
[
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178797#comment-17178797
]
Akash commented on TIKA-3154:
-
[~tallison] Can we make this as a configuration parameter rathe
[
https://issues.apache.org/jira/browse/TIKA-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175289#comment-17175289
]
Akash commented on TIKA-3155:
-
[https://github.com/apache/tika/blob/main/tika-parsers/src/main
Akash created TIKA-3155:
---
Summary: Parse Error while extracting CSV files
Key: TIKA-3155
URL: https://issues.apache.org/jira/browse/TIKA-3155
Project: Tika
Issue Type: Bug
Components: parser
[
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175035#comment-17175035
]
Akash commented on TIKA-3154:
-
Thanks [~tallison]
> Exception while extracting msg files
> --
[
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash updated TIKA-3154:
Description:
While parsing msg file containing some html text inside, we are getting
exception from Tika.
Command
[
https://issues.apache.org/jira/browse/TIKA-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash updated TIKA-3154:
Description:
While parsing msg file containing some html text inside, we are getting
exception from Tika.
Command
Akash created TIKA-3154:
---
Summary: Exception while extracting msg files
Key: TIKA-3154
URL: https://issues.apache.org/jira/browse/TIKA-3154
Project: Tika
Issue Type: Bug
Components: parser
[
https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash updated TIKA-3153:
Description:
Text file containing the word Received: is identified as message/rfc22.
We were earlier using version
Akash created TIKA-3153:
---
Summary: Text File identified as message/rfc822
Key: TIKA-3153
URL: https://issues.apache.org/jira/browse/TIKA-3153
Project: Tika
Issue Type: Bug
Components: detecto
[
https://issues.apache.org/jira/browse/TIKA-3153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash updated TIKA-3153:
Description:
Text file containing the word Received: is identified as message/rfc22.
We were earlier using version
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045537#comment-17045537
]
Akash commented on TIKA-3048:
-
Can some one please update fix version.
> Tika unable to parse
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045533#comment-17045533
]
Akash commented on TIKA-3048:
-
It worked with tika 1.23. It does not work with 1.9
> Tika una
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash updated TIKA-3048:
Fix Version/s: 1.9
> Tika unable to parse html files with non UTF-8 charset
> --
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash closed TIKA-3048.
---
> Tika unable to parse html files with non UTF-8 charset
> --
>
>
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash resolved TIKA-3048.
-
Resolution: Fixed
> Tika unable to parse html files with non UTF-8 charset
> -
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045503#comment-17045503
]
Akash commented on TIKA-3048:
-
With the change [~tallison] provided in this comment
https://i
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044199#comment-17044199
]
Akash commented on TIKA-3048:
-
Tested on different languages again. Here is the observation.
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042825#comment-17042825
]
Akash commented on TIKA-3048:
-
[~tallison] Let me try the above option once.
https://issues.a
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041176#comment-17041176
]
Akash commented on TIKA-3048:
-
Attached [^ChineseFile.html] which contains some Chinese text u
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash updated TIKA-3048:
Attachment: ChineseFile.html
> Tika unable to parse html files with non UTF-8 charset
>
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040850#comment-17040850
]
Akash commented on TIKA-3048:
-
If we remove the charset header from html, tika parsing works f
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash updated TIKA-3048:
Description:
Tika is returning junk characters when parsing chinese characters present
inside html file. Html file
[
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akash updated TIKA-3048:
Summary: Tika unable to parse html files with non UTF-8 charset (was: Tika
unable to parse html files with GB2312 c
Akash created TIKA-3048:
---
Summary: Tika unable to parse html files with GB2312 charset
Key: TIKA-3048
URL: https://issues.apache.org/jira/browse/TIKA-3048
Project: Tika
Issue Type: Bug
Compon
48 matches
Mail list logo