nt operating systems you could have seen the same
results as myself.
I've no idea why it thinks "a d" is IBM500 in french with 98% confidence...
If you think it is wrong to raise a defect, but with a file of such
few characters I would expect some strange detection.
John
On Fri, 1
the issue, if my example tika-app works for
you but your tika-parser still doesn't work then it helps identify
where to look next
>
> Can you try adding this debug line;
> System.out.println("file.encoding=" + System.setProperty("file.encoding"));
> I will try it.
>
>
tor.setText(TikaInputStream.get(is));
> CharsetMatch detected = detector.detect();
>
> On Sat, Sep 5, 2020 at 12:13 AM John Patrick
> wrote:
>>
>> Have you tried 1.24.1?
>> Did it detect as a different type on an older version?
>> Have you tried it on another mach
Have you tried 1.24.1?
Did it detect as a different type on an older version?
Have you tried it on another machine...
Are other files being detected as expected?
What os are you using and what java version are you using?
As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
this contains post example plus others, mainly string content not binary content
https://openjdk.java.net/groups/net/httpclient/recipes.html
this contains reading file examples
https://docs.oracle.com/javase/tutorial/essential/io/file.html
i’ll look at doing a pull request to the tika
Could you recreate a test pdf with similar page numbers and file size,
that behaves the same as your real pdf. It's probably the only way
people can help unless you investigate it yourself more.
Also as people have mentioned pdf are compressed so when decompressed
could be much larger, 100MB pdf
What jdk are you using?
Java 8? 11? 13? i.e. a version that is currently in active support
Are you using the latest release of that version?
Have you switch on gc logging and seen if that is the issue?
Constantly doing gc? You might need to tweak the arguments depending
on what gc you are using?
What java version are you using?
What size documents are you using?
Do you have sample files?
How frequently are you doing the conversion as sometimes performance
improves after the 1st document but is always slow for the 1st
document.
I had issues myself previously and either upgraded the java
Just to confirm the miss spelt words, are if you open the word doc do
you see them spelt the same way?
e.g.
1) is the word doc wrong
2) tika is renaming something incorrectly?
as it it's (2), then i would patch tika to correct the parser.
if it's (1) then i would extend the current parser being
com> wrote:
> The Breaks which i am trying to parse are those line present before
> Experience or Skills & Expertise (in attached pdf) but there is no
> indication of these lines when i am parsing the pdf through tika.
>
> On Thu, Jan 5, 2017 at 4:50 PM, John Patrick <n
Tika is a jar file..
Tesseract are native libraries installed into the operating system, that
are picked up by Tika.
If your referring to Tesseract needs to be a jar file that can be installed
on our Power Server. You'll need to go to the Tesseract project as see if
they will implement those
is up - I'd run into the same issue when working in
> Tika 2.0, and had made a note to discuss on the list.
>
> SLF4J would be my preferred option as well.
>
> Wondering if we could configure the build to check for use of other
> logging frameworks besides SLF4J?
>
> -- Ken
>
> --
Tika appears to use two logging frameworks, Commons Logging and SLF4J.
Is that correct?
Commons Logging is used by;
tika-app
tika-parsers
tika-server
SLF4J is used by;
tika-batch
tika-core
tika-parsers
tika-translate
If I do a patch which way should I refactor? My personal preference is to
use
at 13:30, John Patrick <nhoj.patr...@gmail.com> wrote:
> Cheers for replies, I now understand how tika developers intended
> tika-server should be used but for the custom code we have written we need
> to use a few classes that only live in tika-server.
>
> For Jackson I've do
://issues.apache.org/jira/browse/TIKA-1868
https://github.com/apache/tika/pull/75
Jackson upgrade;
https://issues.apache.org/jira/browse/TIKA-1869
https://github.com/apache/tika/pull/76
Cheers,
John
On 23 February 2016 at 19:41, Nick Burch <apa...@gagravarr.org> wrote:
> On Tue, 23 Feb 2016, John Patr
hiya,
I'm working with an existing code base that is using Jackson 2.6.3. Now
adding tika but because the tika-server jar containers Jackson 2.4.0 having
lots of compile issues.
1) Was it intentional to have a bloated/fat tika-server jar containing all
dependencies?
2) Can tika be upgraded to
16 matches
Mail list logo