Re: tika parser detecting "IBM500" for small files

2020-09-12 Thread John Patrick
nt operating systems you could have seen the same results as myself. I've no idea why it thinks "a d" is IBM500 in french with 98% confidence... If you think it is wrong to raise a defect, but with a file of such few characters I would expect some strange detection. John On Fri, 1

Re: tika parser detecting "IBM500" for small files

2020-09-11 Thread John Patrick
the issue, if my example tika-app works for you but your tika-parser still doesn't work then it helps identify where to look next > > Can you try adding this debug line; > System.out.println("file.encoding=" + System.setProperty("file.encoding")); > I will try it. > >

Re: tika parser detecting "IBM500" for small files

2020-09-08 Thread John Patrick
tor.setText(TikaInputStream.get(is)); > CharsetMatch detected = detector.detect(); > > On Sat, Sep 5, 2020 at 12:13 AM John Patrick > wrote: >> >> Have you tried 1.24.1? >> Did it detect as a different type on an older version? >> Have you tried it on another mach

Re: tika parser detecting "IBM500" for small files

2020-09-04 Thread John Patrick
Have you tried 1.24.1? Did it detect as a different type on an older version? Have you tried it on another machine... Are other files being detected as expected? What os are you using and what java version are you using? As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my

Re: Anyone can share an example of Java code POSTing a file to Tika-Server?

2020-02-04 Thread John Patrick
this contains post example plus others, mainly string content not binary content https://openjdk.java.net/groups/net/httpclient/recipes.html this contains reading file examples https://docs.oracle.com/javase/tutorial/essential/io/file.html i’ll look at doing a pull request to the tika

Re: Parsing huge PDF (400Mb, 2700 pages)

2019-11-14 Thread John Patrick
Could you recreate a test pdf with similar page numbers and file size, that behaves the same as your real pdf. It's probably the only way people can help unless you investigate it yourself more. Also as people have mentioned pdf are compressed so when decompressed could be much larger, 100MB pdf

Re: Parsing huge PDF (400Mb, 2700 pages)

2019-11-14 Thread John Patrick
What jdk are you using? Java 8? 11? 13? i.e. a version that is currently in active support Are you using the latest release of that version? Have you switch on gc logging and seen if that is the issue? Constantly doing gc? You might need to tweak the arguments depending on what gc you are using?

Re: Tika Performance in 1.9

2018-05-16 Thread John Patrick
What java version are you using? What size documents are you using? Do you have sample files? How frequently are you doing the conversion as sometimes performance improves after the 1st document but is always slow for the 1st document. I had issues myself previously and either upgraded the java

Re: . Extending Tika

2017-08-26 Thread John Patrick
Just to confirm the miss spelt words, are if you open the word doc do you see them spelt the same way? e.g. 1) is the word doc wrong 2) tika is renaming something incorrectly? as it it's (2), then i would patch tika to correct the parser. if it's (1) then i would extend the current parser being

Re: Fwd: Tika not parsing underlines

2017-01-05 Thread John Patrick
com> wrote: > The Breaks which i am trying to parse are those line present before > Experience or Skills & Expertise (in attached pdf) but there is no > indication of these lines when i am parsing the pdf through tika. > > On Thu, Jan 5, 2017 at 4:50 PM, John Patrick <n

Re: Extract Text from a TIFF image

2016-07-21 Thread John Patrick
Tika is a jar file.. Tesseract are native libraries installed into the operating system, that are picked up by Tika. If your referring to Tesseract needs to be a jar file that can be installed on our Power Server. You'll need to go to the Tesseract project as see if they will implement those

Re: Logging

2016-03-03 Thread John Patrick
is up - I'd run into the same issue when working in > Tika 2.0, and had made a note to discuss on the list. > > SLF4J would be my preferred option as well. > > Wondering if we could configure the build to check for use of other > logging frameworks besides SLF4J? > > -- Ken > > --

Logging

2016-03-02 Thread John Patrick
Tika appears to use two logging frameworks, Commons Logging and SLF4J. Is that correct? Commons Logging is used by; tika-app tika-parsers tika-server SLF4J is used by; tika-batch tika-core tika-parsers tika-translate If I do a patch which way should I refactor? My personal preference is to use

Re: Jackson & Fat tika-server jar question

2016-02-25 Thread John Patrick
at 13:30, John Patrick <nhoj.patr...@gmail.com> wrote: > Cheers for replies, I now understand how tika developers intended > tika-server should be used but for the custom code we have written we need > to use a few classes that only live in tika-server. > > For Jackson I've do

Re: Jackson & Fat tika-server jar question

2016-02-24 Thread John Patrick
://issues.apache.org/jira/browse/TIKA-1868 https://github.com/apache/tika/pull/75 Jackson upgrade; https://issues.apache.org/jira/browse/TIKA-1869 https://github.com/apache/tika/pull/76 Cheers, John On 23 February 2016 at 19:41, Nick Burch <apa...@gagravarr.org> wrote: > On Tue, 23 Feb 2016, John Patr

Jackson & Fat tika-server jar question

2016-02-23 Thread John Patrick
hiya, I'm working with an existing code base that is using Jackson 2.6.3. Now adding tika but because the tika-server jar containers Jackson 2.4.0 having lots of compile issues. 1) Was it intentional to have a bloated/fat tika-server jar containing all dependencies? 2) Can tika be upgraded to