Re: Selectively skipping parts of huge pages

2024-09-02 Thread Markus Jelsma
Just for reference, here's another nasty example. Different site, same
disorder:
https://www.bautzenerbote.de/250-jahre-dachrinnen-in-tschechien/

Op vr 30 aug 2024 om 11:40 schreef Markus Jelsma :

> Hello,
>
> Using Tika and a custom ContentHandler we're parsing messy HTML into
> readable text. We have a limit to the number of block type elements, and
> some line elements (li in this case), we're willing to parse. This causes a
> document [1] to skip processing and not extracting any useful text from it.
> This page has a thousands of articles neatly listed in li's in its header,
> so the limit of 2k is reached and everything else is skipped.
>
> Does anyone know of some clever tricks to deal with it? Semantically there
> is nothing wrong with the page having a huge article listing, but of course
> it is not a very smart move to deliver such HTML, even my browsers get
> bogged down by it.
>
> Thanks,
> Markus
>
> [1]
> https://lobjectif.net/la-pratique-deliberee-au-dela-du-mythe-de-la-maitrise/
>


Selectively skipping parts of huge pages

2024-08-30 Thread Markus Jelsma
Hello,

Using Tika and a custom ContentHandler we're parsing messy HTML into
readable text. We have a limit to the number of block type elements, and
some line elements (li in this case), we're willing to parse. This causes a
document [1] to skip processing and not extracting any useful text from it.
This page has a thousands of articles neatly listed in li's in its header,
so the limit of 2k is reached and everything else is skipped.

Does anyone know of some clever tricks to deal with it? Semantically there
is nothing wrong with the page having a huge article listing, but of course
it is not a very smart move to deliver such HTML, even my browsers get
bogged down by it.

Thanks,
Markus

[1]
https://lobjectif.net/la-pratique-deliberee-au-dela-du-mythe-de-la-maitrise/


Re: Script tag contents not always reported in ContentHandler

2024-05-30 Thread Markus Jelsma
Hello Tim,

Nothing to apologize. It is the embedded type="ld+json" script containing
Microdata in which we are not interested. Tika has no problem reporting it
in some cases, but in many cases we don't get the characters. We do get the
startElement reported. A product page on ritel [1] is one example. I'll
send you the actual HTML file just in case the online version has changed
since we last downloaded it.

If we enable schema.elementType("script", HTMLSchema.M_ANY, 255, 0); we do
get the JSON blob reported. But also the order of the elements in the head
received by startElement is suddenly different.

Many thanks already!
Markus

[1] https://www.ritel.nl/samsung/galaxy-a15/

Op do 30 mei 2024 om 14:53 schreef Tim Allison :

> Markus,
>   I'm sorry for my delay. We're migrating to jsoup in 3.x. I realize that
> 3.x isn't out yet, but I wanted to give you a heads up.
>
>   To extract scripts in 3.x, you'd do something like this:
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/resources/org/apache/tika/parser/html/tika-config.xml
>
>   You should be able to swap in the HtmlParser for the JsoupParser in that
> config and be good to go.
>
>   Are you able to share an example html with me, even if only privately? I
> _think_ we have a unit test for script handling in 2.x and 3.x, and it
> _should_ work.
>
>   Best,
>
> Tim
>
> On Wed, May 29, 2024 at 9:37 AM Markus Jelsma 
> wrote:
>
>> So i found HtmlParser.setExtractScripts(),this sounds very promising!
>> Changed the code to use HtmlParser instead of AutoDetectParser and set the
>> flag to true. Unforuntately, the script's contents were still not reported
>> in the characters method. No idea why.
>>
>> I also found TagSoup's Parser.*CDATAElementsFeature
>> <https://javadoc.io/static/org.ccil.cowan.tagsoup/tagsoup/1.2.1/org/ccil/cowan/tagsoup/Parser.html#CDATAElementsFeature>*
>> constant. Seems to be the same as:
>> http://www.ccil.org/~cowan/tagsoup/features/cdata-elementsA value of
>> "true" indicates that the parser will process the script and style
>> elements (or any elements with type='cdata' in the TSSL schema) as SGML
>> CDATA elements (that is, no markup is recognized except the matching
>> end-tag).
>>
>> Sounds promising, well, at least something to try. But how do we exactly
>> set that parameter from code or in tika-config.xml if that is better. It
>> isn't really obvious at the moment.
>>
>> Many thanks,
>> Markus
>>
>>
>>
>> Op di 28 mei 2024 om 12:19 schreef Markus Jelsma <
>> markus.jel...@openindex.io>:
>>
>>> Hello,
>>>
>>> We're using Tika to parse HTML via a custom ContentHandler. This works
>>> really well. Except that in some cases we do not get the contents of script
>>> tags in the head reported in the characters() method in the ContentHandler.
>>>
>>> We're using this code:
>>> TikaConfig tikaConfig = new
>>> TikaConfig(SAXTestCase.class.getResourceAsStream("/tika-config.xml"));
>>> Schema schema = new HTMLSchema();
>>> ParseContext context = new ParseContext();
>>> context.set(Schema.class, schema);
>>> context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);
>>> Metadata metadata = new Metadata();
>>> ReadableContentHandler handler = new ReadableContentHandler(url, config);
>>> AutoDetectParser parser = new AutoDetectParser(tikaConfig);
>>> InputStream stream = SAXTestCase.class.getResourceAsStream(path);
>>> parser.parse(stream, handler, metadata, context);
>>>
>>> If we fiddle with TagSoup's Schema we do see some bad examples suddenly
>>> report the characters of the script tag. But, as in good tradition, other
>>> stuff breaks and things like meta fields in some other HTML examples no
>>> longer get reported.
>>>
>>> schema.elementType("script", HTMLSchema.M_ANY, 255, 0);
>>>
>>> Now, i don't even know if changing the schema is a good idea, or if
>>> there is some other setting in Tika i do not know or forgot about.
>>>
>>> Anyone here having some ideas?
>>>
>>> Thanks,
>>> Markus
>>>
>>


Re: Script tag contents not always reported in ContentHandler

2024-05-29 Thread Markus Jelsma
So i found HtmlParser.setExtractScripts(),this sounds very promising!
Changed the code to use HtmlParser instead of AutoDetectParser and set the
flag to true. Unforuntately, the script's contents were still not reported
in the characters method. No idea why.

I also found TagSoup's Parser.*CDATAElementsFeature
<https://javadoc.io/static/org.ccil.cowan.tagsoup/tagsoup/1.2.1/org/ccil/cowan/tagsoup/Parser.html#CDATAElementsFeature>*
constant. Seems to be the same as:
http://www.ccil.org/~cowan/tagsoup/features/cdata-elementsA value of "true"
indicates that the parser will process the script and style elements (or
any elements with type='cdata' in the TSSL schema) as SGML CDATA elements
(that is, no markup is recognized except the matching end-tag).

Sounds promising, well, at least something to try. But how do we exactly
set that parameter from code or in tika-config.xml if that is better. It
isn't really obvious at the moment.

Many thanks,
Markus



Op di 28 mei 2024 om 12:19 schreef Markus Jelsma :

> Hello,
>
> We're using Tika to parse HTML via a custom ContentHandler. This works
> really well. Except that in some cases we do not get the contents of script
> tags in the head reported in the characters() method in the ContentHandler.
>
> We're using this code:
> TikaConfig tikaConfig = new
> TikaConfig(SAXTestCase.class.getResourceAsStream("/tika-config.xml"));
> Schema schema = new HTMLSchema();
> ParseContext context = new ParseContext();
> context.set(Schema.class, schema);
> context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);
> Metadata metadata = new Metadata();
> ReadableContentHandler handler = new ReadableContentHandler(url, config);
> AutoDetectParser parser = new AutoDetectParser(tikaConfig);
> InputStream stream = SAXTestCase.class.getResourceAsStream(path);
> parser.parse(stream, handler, metadata, context);
>
> If we fiddle with TagSoup's Schema we do see some bad examples suddenly
> report the characters of the script tag. But, as in good tradition, other
> stuff breaks and things like meta fields in some other HTML examples no
> longer get reported.
>
> schema.elementType("script", HTMLSchema.M_ANY, 255, 0);
>
> Now, i don't even know if changing the schema is a good idea, or if there
> is some other setting in Tika i do not know or forgot about.
>
> Anyone here having some ideas?
>
> Thanks,
> Markus
>


Script tag contents not always reported in ContentHandler

2024-05-28 Thread Markus Jelsma
Hello,

We're using Tika to parse HTML via a custom ContentHandler. This works
really well. Except that in some cases we do not get the contents of script
tags in the head reported in the characters() method in the ContentHandler.

We're using this code:
TikaConfig tikaConfig = new
TikaConfig(SAXTestCase.class.getResourceAsStream("/tika-config.xml"));
Schema schema = new HTMLSchema();
ParseContext context = new ParseContext();
context.set(Schema.class, schema);
context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);
Metadata metadata = new Metadata();
ReadableContentHandler handler = new ReadableContentHandler(url, config);
AutoDetectParser parser = new AutoDetectParser(tikaConfig);
InputStream stream = SAXTestCase.class.getResourceAsStream(path);
parser.parse(stream, handler, metadata, context);

If we fiddle with TagSoup's Schema we do see some bad examples suddenly
report the characters of the script tag. But, as in good tradition, other
stuff breaks and things like meta fields in some other HTML examples no
longer get reported.

schema.elementType("script", HTMLSchema.M_ANY, 255, 0);

Now, i don't even know if changing the schema is a good idea, or if there
is some other setting in Tika i do not know or forgot about.

Anyone here having some ideas?

Thanks,
Markus


Re: metadata keys

2022-10-07 Thread Markus Jelsma
Ah, there are some differences this time, except for MboxParser, of course
:)

Very nice to see this happening, it wasn't present/noticed in the other set
tiff:ImageWidth,727519
tiff:ImageLength,727512

There are this time also quite a few with whitespaces in the keys:
Dimension HorizontalPixelSize,166272
Dimension VerticalPixelSize,166272

Attempts to do some Javascript:

Re: Strange exif and tesseract exceptions since 2.x

2022-10-07 Thread Markus Jelsma
Hello Tim,

You are right, we can switch that class' level higher. But in general i
don't believe it is great to write 4 to 5 big stack traces at the debug
logging level as usually that level already has a big number of log lines.

If the trace is not necessary for ExternalParser users to fix a problem,
maybe a single log line would suffice instead.

Thanks!
Markus

Op wo 5 okt. 2022 om 14:53 schreef Tim Allison :

> If you need DEBUG elsewhere, can you selectively turn logging for the
> ExternalParser to ERROR?  Or is there a fix you'd recommend on the
> Tika side?
>
> On Wed, Oct 5, 2022 at 7:22 AM Markus Jelsma 
> wrote:
> >
> > Hello,
> >
> > We use Tika embedded in our Java programs and recently upgraded from one
> of the last 1.x to 2.x, currently 2.4.1.
> >
> > Since then, with debug logging on, Tika spews out a few pretty bug and
> partially repeating exceptions. This is not a real runtime problem, but
> just a distracting nuisance as my attention triggers when seeing stack
> traces.
> >
> > Is there something to do about it?
> >
> > This is the exif related trace:
> > 2022-10-05 13:16:42,136 DEBUG
> [TEST-SequenceBlockMarkerTest.testDierenforum-seed#[5F443E2359FE59DA]]
> external.ExternalParser (ExternalParser.java:172) - exit
> > value for ffmpeg: 0
> > 2022-10-05 13:16:42,140 DEBUG
> [TEST-SequenceBlockMarkerTest.testDierenforum-seed#[5F443E2359FE59DA]]
> external.ExternalParser (ExternalParser.java:180) - exce
> > ption trying to run  exiftool
> > java.io.IOException: Cannot run program "exiftool": error=2, No such
> file or directory
> >at java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) ~[?:?]
> >at java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) ~[?:?]
> >at java.lang.Runtime.exec(Runtime.java:592) ~[?:?]
> >at java.lang.Runtime.exec(Runtime.java:451) ~[?:?]
> >at
> org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:161)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.external.ExternalParsersConfigReader.readCheckTagAndCheck(ExternalParsersConfigReader.java:203)
> ~[tika-core-2.4.1.jar:2.4.1
> > ]
> >at
> org.apache.tika.parser.external.ExternalParsersConfigReader.readParser(ExternalParsersConfigReader.java:110)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:80)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:67)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:60)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:60)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:42)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method) ~[?:?]
> >at
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> ~[?:?]
> >at
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> ~[?:?]
> >at
> java.lang.reflect.Constructor.newInstance(Constructor.java:490) ~[?:?]
> >at java.lang.Class.newInstance(Class.java:584) ~[?:?]
> >at
> org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:80)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:345)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:105)
> ~[tika-core-2.4.1.jar:2.4.1]
> >at
> org.apache.tika.parser.DefaultParser.(DefaultParse

Strange exif and tesseract exceptions since 2.x

2022-10-05 Thread Markus Jelsma
Hello,

We use Tika embedded in our Java programs and recently upgraded from one of
the last 1.x to 2.x, currently 2.4.1.

Since then, with debug logging on, Tika spews out a few pretty bug and
partially repeating exceptions. This is not a real runtime problem, but
just a distracting nuisance as my attention triggers when seeing stack
traces.

Is there something to do about it?

This is the exif related trace:
2022-10-05 13:16:42,136 DEBUG
[TEST-SequenceBlockMarkerTest.testDierenforum-seed#[5F443E2359FE59DA]]
external.ExternalParser (ExternalParser.java:172) - exit
value for ffmpeg: 0
2022-10-05 13:16:42,140 DEBUG
[TEST-SequenceBlockMarkerTest.testDierenforum-seed#[5F443E2359FE59DA]]
external.ExternalParser (ExternalParser.java:180) - exce
ption trying to run  exiftool
java.io.IOException: Cannot run program "exiftool": error=2, No such file
or directory
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) ~[?:?]
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) ~[?:?]
   at java.lang.Runtime.exec(Runtime.java:592) ~[?:?]
   at java.lang.Runtime.exec(Runtime.java:451) ~[?:?]
   at
org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:161)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.external.ExternalParsersConfigReader.readCheckTagAndCheck(ExternalParsersConfigReader.java:203)
~[tika-core-2.4.1.jar:2.4.1
]
   at
org.apache.tika.parser.external.ExternalParsersConfigReader.readParser(ExternalParsersConfigReader.java:110)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:80)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:67)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.external.ExternalParsersConfigReader.read(ExternalParsersConfigReader.java:60)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:67)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:60)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:49)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.external.ExternalParsersFactory.create(ExternalParsersFactory.java:44)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:42)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.external.CompositeExternalParser.(CompositeExternalParser.java:37)
~[tika-core-2.4.1.jar:2.4.1]
   at
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) ~[?:?]
   at
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
~[?:?]
   at
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
~[?:?]
   at java.lang.reflect.Constructor.newInstance(Constructor.java:490)
~[?:?]
   at java.lang.Class.newInstance(Class.java:584) ~[?:?]
   at
org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:80)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:345)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.DefaultParser.getDefaultParsers(DefaultParser.java:105)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.DefaultParser.(DefaultParser.java:52)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.parser.DefaultParser.(DefaultParser.java:66)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.config.TikaConfig.getDefaultParser(TikaConfig.java:291)
~[tika-core-2.4.1.jar:2.4.1]
   at org.apache.tika.config.TikaConfig.access$900(TikaConfig.java:87)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.config.TikaConfig$ParserXmlLoader.createDefault(TikaConfig.java:878)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.config.TikaConfig$ParserXmlLoader.createDefault(TikaConfig.java:824)
~[tika-core-2.4.1.jar:2.4.1]
   at
org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:648)
~[tika-core-2.4.1.jar:2.4.1]
   at org.apache.tika.config.TikaConfig.(TikaConfig.java:170)
~[tika-core-2.4.1.jar:2.4.1]
   at org.apache.tika.config.TikaConfig.(TikaConfig.java:150)
~[tika-core-2.4.1.jar:2.4.1]
   at org.apache.tika.config.TikaConfig.(TikaConfig.java:142)
~[tika-core-2.4.1.jar:2.4.1]
   at org.apache.tika.config.TikaConfig.(TikaConfig.java:138)
~[tika-core-2.4.1.jar:2.4.1]
   at io.openindex.sax.SAXTestCase.getHandler(SAXTestCase.java:119)
~[test-classes/:?]
   at io.openindex.sax.SAXTestCase.getHandler(SAXTestCase.java:112)
~[test-classes/:?]
   at io.openindex.sax.SAXTestCase.ge

Re: metadata keys

2022-10-03 Thread Markus Jelsma
Hi Tim,

I would expect that many strange keys are actually present in the source
data, and are not due to an error somewhere in Tika or its dependencies.
Although mboxparser could have an issue somewhere.

But it might be an idea to map some bad keys to their proper counterpart,
such as keywords, content-type and friends.

Regards,
Markus

Op ma 3 okt. 2022 om 17:10 schreef Tim Allison :

> Thank you, Markus, for looking through these sheets.  There's a chance
> I botched the encodings in transferring data from one location to
> another.  Let me take another look, and yes, we've got to make some
> improvements to the mbox parser.
>
> More digging for me to do on the data and your findings!
>
> Thank you!
>
> On Mon, Oct 3, 2022 at 10:56 AM Markus Jelsma
>  wrote:
> >
> > Hi,
> >
> > These aggregations of large real world sets are always interesting to
> look through. Especially because they are bound to have a lot of garbage
> and peculiarities. There are probably some badly chosen key names, and very
> likely many programming errors.
> >
> > Some interesting examples:
> >
> > what is this:
> > Выберите_расширение_для_паковки
> >
> > the usual mixing of double-colon variants, there are also many escaped
> quotes:
> > ”keywords” and \"keywords\"
> >
> > these two are identical, but given a large enough set, they might not be:
> > height 512205
> > width 512205
> >
> > mboxparser spews out a lot of garbage, incredible:
> > MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
> > MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
> > MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3
> >
> > really, it does:
> > MboxParser-_blank">http 3
> >
> MboxParser-a-aa-azzz-aaz-azzzazazazaz 3
> > MboxParser-a-aa-azzz-a-aazazzazzz 3
> >
> > non-Latin scripts are expected, this is simplified Chinese:
> > if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style
> (?))
> >
> > perhaps shortest possible key name:
> > T 4
> >
> > mboxparser, again, this time with XML tags:
> > MboxParser-ype>state > MboxParser-ype>university 4
> >
> > the set seems to contain stuff from adult sites:
> > xhamster-site-verification
> >
> > for some reason, the Dutch government always pops up in large sets:
> > custom:OVERHEID.Informatietype/DC.type  13
> > custom:OVERHEID.Organisatietype/OVERHEID.organisationType   13
> >
> > there are 18 different ways to spell/use Content-Type, of which four
> are, of course, with mboxparser:
> > Content-Type6612729
> > content_type14
> > \"Content-Type\"9
> > \"content-type\"5
> >
> > the inevitable encoding error:
> > pdf:docinfo:custom:-ý§ Q 10
> > pagerankâ„¢ 50
> >
> > what.is.this:
> > Laisv371DiskusijuIrK363rybosForumas 4
> >
> > hey, another contenter for the shortest key name:
> > M 4
> >
> > there are 67 unique dcterms key names, but their counts are not very
> high:
> > DCTERMS.title   44
> > dcterms.title   26
> > dcterms:title   13
> > dcterms.Title   3
> >
> > there is also a Content-Type in Russian:
> > Тип-содержимое 3
> >
> > someone wants to remove your dust:
> > Dust_Removal_Data 339
> >
> > there are 908 unique unknown tags, no idea what that is:
> > Exif_IFD0:Unknown_tag_(0x8482)  36
> > Unknown_tag_(0x00bf)36
> > Exif_SubIFD:Unknown_tag_(0x9009)35
> > Unknown_tag_(0x00a0)35
> > Unknown_tag_(0x050e)35
> >
> > ah, the winner of the shortest key name (line 2235):
> > 71
> >
> > longest key, guess who:
> > MboxParser-
> http://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps
>   3
> >
> > Besides Latin, Japanese and Chinese, Cyrillic is also present. But the
> six most frequently used Arabic symbols are not present. I wonder why. But
> there is an RTL-script present, Hebrew. It is always strange to meet
> terms/wors of RTL-scripts in an otherwise general LTR-world.
> >
> > I was a bit disappointed not to find any obscene terms. The set seemed
> to be large enough for at least some ge

Re: metadata keys

2022-10-03 Thread Markus Jelsma
Hi,

These aggregations of large real world sets are always interesting to look
through. Especially because they are bound to have a lot of garbage and
peculiarities. There are probably some badly chosen key names, and very
likely many programming errors.

Some interesting examples:

what is this:
Выберите_расширение_для_паковки

the usual mixing of double-colon variants, there are also many escaped
quotes:
”keywords” and \"keywords\"

these two are identical, but given a large enough set, they might not be:
height 512205
width 512205

mboxparser spews out a lot of garbage, incredible:
MboxParser- $b!zf|!!;~![#1#07n#2#2f|!j6b!k#1#4;~h>!a#1#7;~h>!r$=$n8e 3
MboxParser- $b"($3$n%a!<%k$o!"4x@>#i#t6&f1bn$x$4;22cd 3
MboxParser- $b"(?=$79~$_!&%"%/%;%9ey!">\ 3

really, it does:
MboxParser-_blank">http 3
MboxParser-a-aa-azzz-aaz-azzzazazazaz 3
MboxParser-a-aa-azzz-a-aazazzazzz 3

non-Latin scripts are expected, this is simplified Chinese:
if:头像和分页采用圆形样式 (translation: Avatars and pagination in a circular style (?))

perhaps shortest possible key name:
T 4

mboxparser, again, this time with XML tags:
MboxParser-ype>stateuniversityhttp://www.facebook.com/donnakuhnarthttps://www.flickr.com/photos/donnakuhnhttp://picassogirl.tumblr.comhttps://twitter.com/digitalaardvarkhttps://plus.google.com/+digitalaardvarkshttps://www.linkedin.com/in/donnakuhnhttp://www.saatchionline.com/donnakuhnhttp://pinterest.com/sarcasthttps
   3

Besides Latin, Japanese and Chinese, Cyrillic is also present. But the six
most frequently used Arabic symbols are not present. I wonder why. But
there is an RTL-script present, Hebrew. It is always strange to meet
terms/wors of RTL-scripts in an otherwise general LTR-world.

I was a bit disappointed not to find any obscene terms. The set seemed to
be large enough for at least some general curse words.

MboxParser is the real winner with 1763 unique keys, this is really absurd!

Thanks, this was fun!
Markus

Op ma 3 okt. 2022 om 15:26 schreef Tim Allison :

> All,
>
>   I recently extracted metadata keys from 1 million files in our
> regression corpus and did a group by.  This allows insight into common
> metadata keys.
>
>   I've included two views, one looks at overall counts, and the other
> breaks down metadata keys by mime type.
>
>   Please let us know if you find anything interesting or have any
> questions.
>
> https://corpora.tika.apache.org/base/share/metadata-keys-overall-1m.txt.gz
> https://corpora.tika.apache.org/base/share/metadata-keys-by-mime-1m.txt.gz
>
>Best,
>
> Tim
>


Re: Upgrading to 2.x, ClassNotFoundException: o.a.t.io.CloseShieldInputStream

2022-02-09 Thread Markus Jelsma
Yes, these are the imports:


  org.apache.tika
  tika-core
  2.3.0


  org.apache.tika
  tika-parsers-standard-package
  2.3.0


  org.apache.tika
  tika-parser-scientific-package
  2.3.0



Op wo 9 feb. 2022 om 14:13 schreef Tim Allison :

> Is tika-core on your classpath?
>
> On Wed, Feb 9, 2022 at 8:03 AM Markus Jelsma 
> wrote:
> >
> > Hi again,
> >
> > I am resuming the upgrade from 1.26 to 2.3.0 and removed the
> tika-parsers dependency from my pom, and instead added two new
> dependencies: tika-parsers-standard-package and
> tika-parser-scientific-package.
> >
> > It compiles without issues, but unit tests won't run and exit with:
> >
> > java.lang.RuntimeException: Unable to load
> org.apache.tika.parser.pkg.ZipContainerDetector
> > at
> __randomizedtesting.SeedInfo.seed([F582ED2E2896A6A8:C5114DED19A8367F]:0)
> > at
> org.apache.tika.config.LoadErrorHandler$3.handleLoadError(LoadErrorHandler.java:65)
> > at
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
> > at
> org.apache.tika.detect.DefaultDetector.getDefaultDetectors(DefaultDetector.java:90)
> > at
> org.apache.tika.detect.DefaultDetector.(DefaultDetector.java:50)
> > at
> org.apache.tika.detect.DefaultDetector.(DefaultDetector.java:55)
> > at
> org.apache.tika.config.TikaConfig.getDefaultDetector(TikaConfig.java:264)
> > at
> org.apache.tika.config.TikaConfig$DetectorXmlLoader.createDefault(TikaConfig.java:1017)
> > at
> org.apache.tika.config.TikaConfig$DetectorXmlLoader.createDefault(TikaConfig.java:975)
> > at
> org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:630)
> > at org.apache.tika.config.TikaConfig.(TikaConfig.java:155)
> > at org.apache.tika.config.TikaConfig.(TikaConfig.java:141)
> > at org.apache.tika.config.TikaConfig.(TikaConfig.java:133)
> > at org.apache.tika.config.TikaConfig.(TikaConfig.java:129)
> > .
> > Caused by: java.lang.NoClassDefFoundError:
> org/apache/tika/io/CloseShieldInputStream
> > at
> org.apache.tika.parser.pkg.ZipContainerDetector.(ZipContainerDetector.java:99)
> > at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> > at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> > at
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > at
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
> > at java.base/java.lang.Class.newInstance(Class.java:584)
> > at
> org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:80)
> > at
> org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:345)
> > ... 36 more
> > Caused by: java.lang.ClassNotFoundException: org.apache.tika.io
> .CloseShieldInputStream
> > at
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
> > at
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
> > at
> java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
> > ... 44 more
> >
> > Any hints to share on this one?
> >
> > Many thanks!
> > Markus
> >
>


Upgrading to 2.x, ClassNotFoundException: o.a.t.io.CloseShieldInputStream

2022-02-09 Thread Markus Jelsma
Hi again,

I am resuming the upgrade from 1.26 to 2.3.0 and removed the tika-parsers
dependency from my pom, and instead added two new dependencies:
tika-parsers-standard-package and tika-parser-scientific-package.

It compiles without issues, but unit tests won't run and exit with:

java.lang.RuntimeException: Unable to load
org.apache.tika.parser.pkg.ZipContainerDetector
at
__randomizedtesting.SeedInfo.seed([F582ED2E2896A6A8:C5114DED19A8367F]:0)
at
org.apache.tika.config.LoadErrorHandler$3.handleLoadError(LoadErrorHandler.java:65)
at
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:358)
at
org.apache.tika.detect.DefaultDetector.getDefaultDetectors(DefaultDetector.java:90)
at
org.apache.tika.detect.DefaultDetector.(DefaultDetector.java:50)
at
org.apache.tika.detect.DefaultDetector.(DefaultDetector.java:55)
at
org.apache.tika.config.TikaConfig.getDefaultDetector(TikaConfig.java:264)
at
org.apache.tika.config.TikaConfig$DetectorXmlLoader.createDefault(TikaConfig.java:1017)
at
org.apache.tika.config.TikaConfig$DetectorXmlLoader.createDefault(TikaConfig.java:975)
at
org.apache.tika.config.TikaConfig$XmlLoader.loadOverall(TikaConfig.java:630)
at org.apache.tika.config.TikaConfig.(TikaConfig.java:155)
at org.apache.tika.config.TikaConfig.(TikaConfig.java:141)
at org.apache.tika.config.TikaConfig.(TikaConfig.java:133)
at org.apache.tika.config.TikaConfig.(TikaConfig.java:129)
.
Caused by: java.lang.NoClassDefFoundError:
org/apache/tika/io/CloseShieldInputStream
at
org.apache.tika.parser.pkg.ZipContainerDetector.(ZipContainerDetector.java:99)
at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at java.base/java.lang.Class.newInstance(Class.java:584)
at
org.apache.tika.utils.ServiceLoaderUtils.newInstance(ServiceLoaderUtils.java:80)
at
org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:345)
... 36 more
Caused by: java.lang.ClassNotFoundException:
org.apache.tika.io.CloseShieldInputStream
at
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 44 more

Any hints to share on this one?

Many thanks!
Markus


Re: [ANNOUNCE] Apache Tika 2.0.0-BETA released

2021-05-28 Thread Markus Jelsma
Hello Tim,

Wanted to try and build one of our projects against Tika 2.0.0-BETA but the
standard and extended jars didn't exist yet last wednesday, i have no
trouble finding the core jar. Now, they are still missing at:

https://repo1.maven.org/maven2/org/apache/tika/tika-parsers-extended/2.0.0-BETA/
https://repo1.maven.org/maven2/org/apache/tika/tika-parsers-standard/2.0.0-BETA/

Any ideas?

Thanks,
Markus


Op wo 26 mei 2021 om 12:54 schreef Tim Allison :

> The Apache Tika project is pleased to announce the release of Apache
> Tika 2.0.0-BETA. The release contents have been pushed out to the main
> Apache release site and to the Maven Central sync, so the releases
> should be available as soon as the mirrors get the syncs.
>
> Apache Tika is a toolkit for detecting and extracting metadata and
> structured text content from various documents using existing parser
> libraries.
>
> Apache Tika 2.0.0-BETA contains a number of improvements and bug fixes.
> Details can be found in the changes file:
> https://www.apache.org/dist/tika/CHANGES-2.0.0-BETA.txt
>
> Apache Tika is available on the download page:
> https://tika.apache.org/download.html
>
> Apache Tika is also available in binary form or for use using Maven 2
> from the Central Repository:
> https://repo1.maven.org/maven2/org/apache/tika/
>
> In the initial 48 hours, the release may not be available on all mirrors.
>
> When downloading from a mirror site, please remember to verify the
> downloads using signatures found:
> https://www.apache.org/dist/tika/KEYS
>
> For more information on Apache Tika, visit the project home page:
> https://tika.apache.org/
>
> -- Tim Allison, on behalf of the Apache Tika community
>


Re: Title extract logic

2021-04-23 Thread Markus Jelsma
If you are doing webcrawling, you can obtain the anchor texts of the
hyperlinks that link to the PDF. That text is usually very descriptive, and
can be used as title for a PDF.

Op do 22 apr. 2021 om 16:36 schreef Nicholas DiPiazza <
nicholas.dipia...@gmail.com>:

> Yeah you're right! Thanks for pointing that out I sent a bad example.
>
> So my results after parsing I try to show *Title*
> *(filename) *
>
> It makes for a much better document in a search result. But unfortunately,
> it's all too-often set to something like
>
> *- 4 - (myfilename.pdf)*
>
> > Another trick is to use the most common hyperlink anchor,
>
> Can you elaborate on this one?
>
>
> On Thu, Apr 22, 2021 at 5:44 AM Markus Jelsma 
> wrote:
>
>> Hello Nicholas,
>>
>> The PDF you link to has a decent title in its metadata, but if it isn't
>> there, i would not rely on the first N characters of the content, as it is
>> very unreliable. You can find all kinds of bad markup right at the start of
>> PDFs.
>>
>> But there is a choice, you can still use the raw filename, which is fine
>> in most cases, and usually prettier to read than the first N characters.
>> Another trick is to use the most common hyperlink anchor, which is most of
>> the times very readable and descriptive.
>>
>> Regards,
>> Markus
>>
>> Op wo 21 apr. 2021 om 18:02 schreef Nicholas DiPiazza <
>> nicholas.dipia...@gmail.com>:
>>
>>> Hi Tika Users:
>>>
>>> Does Tika have any built-in Title extract logic?
>>>
>>> I am currently using a simple algorithm that:
>>>
>>> 1) Checks metadata for a title. Use that if there.
>>> 2) If no title metadata, then use the body text. Extract the first line
>>> of the body text and use that as the title.
>>>
>>> Let's take this PDF for example:
>>> https://www.fdic.gov/regulations/reform/resplans/plans/icicibank-165-1612.pdf
>>>
>>> That results in
>>>
>>> - 4 -
>>>
>>> as a title. Not great, right? Ha!
>>>
>>> So then I add something like:
>>>
>>> 3) If the first line has < 5 alpha num characters, go to the next line
>>> until you find a title.
>>>
>>> That works in this case but doesn't work for many other cases.
>>>
>>> What are others doing for title extraction? I would imagine there's no
>>> perfect solution here. Just curious what ya'll are doing to troubleshoot
>>> this stuff.
>>>
>>


Re: Title extract logic

2021-04-22 Thread Markus Jelsma
Hello Nicholas,

The PDF you link to has a decent title in its metadata, but if it isn't
there, i would not rely on the first N characters of the content, as it is
very unreliable. You can find all kinds of bad markup right at the start of
PDFs.

But there is a choice, you can still use the raw filename, which is fine in
most cases, and usually prettier to read than the first N characters.
Another trick is to use the most common hyperlink anchor, which is most of
the times very readable and descriptive.

Regards,
Markus

Op wo 21 apr. 2021 om 18:02 schreef Nicholas DiPiazza <
nicholas.dipia...@gmail.com>:

> Hi Tika Users:
>
> Does Tika have any built-in Title extract logic?
>
> I am currently using a simple algorithm that:
>
> 1) Checks metadata for a title. Use that if there.
> 2) If no title metadata, then use the body text. Extract the first line of
> the body text and use that as the title.
>
> Let's take this PDF for example:
> https://www.fdic.gov/regulations/reform/resplans/plans/icicibank-165-1612.pdf
>
> That results in
>
> - 4 -
>
> as a title. Not great, right? Ha!
>
> So then I add something like:
>
> 3) If the first line has < 5 alpha num characters, go to the next line
> until you find a title.
>
> That works in this case but doesn't work for many other cases.
>
> What are others doing for title extraction? I would imagine there's no
> perfect solution here. Just curious what ya'll are doing to troubleshoot
> this stuff.
>


RE: Extract URLs from a document

2020-11-12 Thread Markus Jelsma
Hello,

Tika already comes with a handler for collecting links, see the 
LinkContentHandler [1]. Hyperlinks in PDFs are reported as anchors and can be 
picked up by this handler. We use it to collect links from any file type as if 
they were all HTML files.

Regards,
Markus

https://tika.apache.org/1.19/api/org/apache/tika/sax/LinkContentHandler.html
 
-Original message-
> From:Nick Burch 
> Sent: Thursday 12th November 2020 12:55
> To: nensick 
> Cc: user@tika.apache.org
> Subject: Re: Extract URLs from a document
> 
> On Wed, 11 Nov 2020, nensick wrote:
> > I am exploring the available features and I managed also to extract 
> > Office macros but I still don't find a way to get the links.
> >
> > Imagine to have a PDF, a DOCX in which you have a "click here" text as a 
> > link pointing
> > to a website (let's say example[.]com). How can I get example[.].com?
> 
> If you were calling the Java directly, it would be fairly easy - just 
> provide your own content handler that only captures the  tags and 
> records the href attributes of those. You can use the Tee content handler 
> to have a normal text-extraction handler called as well as your 
> link-capturing one
> 
> From the Tika Server, it's not quite so simple. I'd probably just say ask 
> the Tika Server for the xhtml version of your document (instead of the 
> plain text one), then use the xml parsing in your calling language to grab 
> the links from the a tags. Depending on your needs, either call the Tika 
> Server twice, once for xhtml to get tags and once for plain text, or just 
> once for xhtml and process the results twice
> 
> Nick
> 


RE: Unable to parse PDF due to NoSuchFieldError: HAS_XMP

2020-03-02 Thread Markus Jelsma
Hello Tim,

Good find. I left some part somewhere with tika-core 1.22. With that being 
fixed, i can parse PDFs again.

Many thanks,
Markus


 
-Original message-
> From:Tim Allison 
> Sent: Monday 2nd March 2020 16:50
> To: user@tika.apache.org
> Subject: Re: Unable to parse PDF due to NoSuchFieldError: HAS_XMP
> 
> Y, thats a Tika field.  Is there a chance that your tika-parsers version does 
> not match your tika-core version?  Which versions of each are you using?  
> 
> If this is a problem with Tika, well have time to fix it before the 1.24 
> release...coming soon... 
> 
> Cheers, 
> 
>     Tim
> 
> On Mon, Mar 2, 2020 at 9:44 AM Markus Jelsma  <mailto:markus.jel...@openindex.io>> wrote:
> Hello,
 
> 
 
> I recently upgraded to the latest Tika and am no longer able to parse PDF, at 
> least the 6 files i just tested, due to:
 
> 
 
> Caused by: java.lang.NoSuchFieldError: HAS_XMP
 
>         at 
> org.apache.tika.parser.pdf.PDMetadataExtractor.extract(PDMetadataExtractor.java:60)
 
>         at 
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:227)
 
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:147)
 
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
 
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
 
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
 
> 
 
> Trying to work-around the problem i upgraded PDFBox from 2.0.17 to 2.0.19, 
> but this did not help.
 
> 
 
> There are no other PDFBox libraries anywhere on the classpath.
 
> 
 
> Any suggestions?
 
> 
 
> Many thanks,
 
> Markus
 


Unable to parse PDF due to NoSuchFieldError: HAS_XMP

2020-03-02 Thread Markus Jelsma
Hello,

I recently upgraded to the latest Tika and am no longer able to parse PDF, at 
least the 6 files i just tested, due to:

Caused by: java.lang.NoSuchFieldError: HAS_XMP
at 
org.apache.tika.parser.pdf.PDMetadataExtractor.extract(PDMetadataExtractor.java:60)
at 
org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:227)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:147)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

Trying to work-around the problem i upgraded PDFBox from 2.0.17 to 2.0.19, but 
this did not help.

There are no other PDFBox libraries anywhere on the classpath.

Any suggestions?

Many thanks,
Markus


RE: [VOTE] Release Apache Tika 1.23 Candidate #1

2019-11-28 Thread Markus Jelsma
+1!

All tests pass and i can seamlessly update our internal software to 1.23.

Thanks!
 
-Original message-
> From:Tim Allison 
> Sent: Tuesday 26th November 2019 22:34
> To:  ; user@tika.apache.org
> Subject: [VOTE] Release Apache Tika 1.23 Candidate #1
> 
> All,  
> 
> A candidate for the Tika 1.23 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/ 
> 
> 
> The release candidate is a zip archive of the sources in:
>   https://github.com/apache/tika/tree/1.23-rc1/ 
> 
> 
> The SHA-512 checksum of the archive is
>   
> b0c277216e05c90f3cc40f591ef5d92707e94b47b54da0503bd54c0a3bdc1df41c63b0f996529206bca87afa28f6b62300113514959ac2470405b764094f9f8b.
> 
> In addition, a staged maven repository is available here:
>   
> https://repository.apache.org/content/repositories/orgapachetika-1056/org/apache/tika
>  
> 
> 
> Please vote on releasing this package as Apache Tika 1.23.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Tika 1.23
> [ ] -1 Do not release this package because...
> 
> This is my first time building on Ubuntu...please do look carefully! 
> 
> Heres my +1. 
> 
> Cheers, 
> 
>          Tim 


RE: How to increase ZIP bomb maximum depth

2019-08-26 Thread Markus Jelsma
Hello Jukka,

This is a customer's output by WYSIWYG, and is an error in my opinion that it 
generated this deeply nested structure. So no, this is not a valid document, 
although 100 could well be valid for real documents, but i have never seen that 
before, and i have seen thousands of unique sites.

I think 100 is fine, i just looked for a way to work around it by 
configuration, without bothering the customer's customer with it.

Thanks,
Markus
 
-Original message-
> From:Jukka Zitting 
> Sent: Monday 26th August 2019 19:48
> To: Tika Users ; talli...@apache.org
> Subject: Re: How to increase ZIP bomb maximum depth
> 
> Hi, 
> 
> I wonder if we should just increase the default thresholds to allow deeper 
> nesting before the exception gets thrown. The defaults should be tuned to 
> make the false-positive rate as low as possible without opening the door for 
> false negatives that could result denial of service attacks. 
> 
> The package-entry depth limit added in 
> https://issues.apache.org/jira/browse/TIKA-741 
> <https://issues.apache.org/jira/browse/TIKA-741> should make it OK to 
> increase the default maxDepth from 100 to say 200 if people are hitting this 
> limit with valid documents. 
> 
> Markus, what kind of documents are triggering the exception for you? What 
> would be a good maxDepth setting for your case? 
> 
> Best, 
> 
> Jukka 
> 
> 
> On Mon, Aug 26, 2019 at 1:40 PM Tim Allison  <mailto:talli...@apache.org>> wrote:
> Oh, ok.  This is helpful.  Got it.  The AutoDetectParser automatically
 
> wraps the incoming handler in a SecureContentHandler.  Some options...
 
> 
 
> 1) We could have the AutoDetectParser skip wrapping a
 
> SecureContentHandler around the incoming handler if the user calls
 
> parse with a SecureContentHandler...
 
> 2) We could add SecureContentHandler parameter settings to the
 
> AutoDetectParser, and it would configure the SecureContentHandler
 
> accordingly...I think there are a few subtleties, but this might get
 
> you configurability via tika-config.xml.
 
> 
 
> Im not offering static thresholds on the SecureContentHandler. :D
 
> 
 
> Fellow devs, how else might we make this work and make it configurable
 
> via tika-config.xml?
 
> 
 
> Cheers,
 
> 
 
>            Tim
 
> 
 
> 
 
> On Mon, Aug 26, 2019 at 1:24 PM Markus Jelsma
 
> mailto:markus.jel...@openindex.io>> wrote:
 
> >
 
> > Hello Tim,
 
> >
 
> > I use Tika embedded in another Java application. passing it a custom 
> > ContentHandler which collects interesting stuff, which we, after the parse, 
> > use to construct meaningful text.
 
> >
 
> >     ReadableContentHandler handler = new ReadableContentHandler(url, 
> >config);
 
> >
 
> >     AutoDetectParser parser = new AutoDetectParser(tikaConfig);
 
> >     parser.parse(stream, handler,  new Metadata(), context);
 
> >
 
> > My ContentHandler does not extend SecureContentHandler so i never have a 
> > chance to pass some different value for the nesting limit check.
 
> >
 
> > Many thanks,
 
> > Markus
 
> >
 
> > -Original message-
 
> > > From:Tim Allison mailto:talli...@apache.org>>
 
> > > Sent: Monday 26th August 2019 19:11
 
> > > To: user@tika.apache.org <mailto:user@tika.apache.org>
 
> > > Subject: Re: How to increase ZIP bomb maximum depth
 
> > >
 
> > > Hi Markus,
 
> > >
 
> > >   This requires some work...the zip bomb protections are currently
 
> > > handled by the handler.  We allow for configuration of the parsers,
 
> > > detectors, charset detectors, but not yet the handlers.  IIRC, weve
 
> > > talked a bit about specifying a custom handler via the commandline at
 
> > > least in tika-server.  I wonder if we should allow for a default
 
> > > handler configuration that would specify a handler to be used by the
 
> > > facade Tika.parse(inputStream)?
 
> > >
 
> > >   Fellow devs have any recommendations?
 
> > >
 
> > >   How are you currently calling Tika?  Via tika-server, Solrs DIH or
 
> > > something else?
 
> > >
 
> > >           Best,
 
> > >
 
> > >                 Tim
 
> > >
 
> > > On Mon, Aug 26, 2019 at 11:20 AM Markus Jelsma
 
> > > mailto:markus.jel...@openindex.io>> wrote:
 
> > > >
 
> > > > Hello,
 
> > > >
 
> > > > Ive been looking around to increase the limit, but i dont seem to be 
> > > > able to find how. I know there the setter for it, but using 
> > > > AutoDetectParser, id like to set it via tika-config. I havent seen a 
> > > > parameter for tika-config that would set that value and the manual on 
> > > > Configuring Tika doesnt mention it.
 
> > > >
 
> > > > Many thanks,
 
> > > > Markus
 
> > > >
 
> > > >
 
> > >
 


RE: How to increase ZIP bomb maximum depth

2019-08-26 Thread Markus Jelsma
Thanks! And indeed, no statics please.

Markus
 
-Original message-
> From:Tim Allison 
> Sent: Monday 26th August 2019 19:40
> To: user@tika.apache.org
> Subject: Re: How to increase ZIP bomb maximum depth
> 
> Oh, ok.  This is helpful.  Got it.  The AutoDetectParser automatically
> wraps the incoming handler in a SecureContentHandler.  Some options...
> 
> 1) We could have the AutoDetectParser skip wrapping a
> SecureContentHandler around the incoming handler if the user calls
> parse with a SecureContentHandler...
> 2) We could add SecureContentHandler parameter settings to the
> AutoDetectParser, and it would configure the SecureContentHandler
> accordingly...I think there are a few subtleties, but this might get
> you configurability via tika-config.xml.
> 
> I'm not offering static thresholds on the SecureContentHandler. :D
> 
> Fellow devs, how else might we make this work and make it configurable
> via tika-config.xml?
> 
> Cheers,
> 
>Tim
> 
> 
> On Mon, Aug 26, 2019 at 1:24 PM Markus Jelsma
>  wrote:
> >
> > Hello Tim,
> >
> > I use Tika embedded in another Java application. passing it a custom 
> > ContentHandler which collects interesting stuff, which we, after the parse, 
> > use to construct meaningful text.
> >
> > ReadableContentHandler handler = new ReadableContentHandler(url, 
> > config);
> >
> > AutoDetectParser parser = new AutoDetectParser(tikaConfig);
> > parser.parse(stream, handler,  new Metadata(), context);
> >
> > My ContentHandler does not extend SecureContentHandler so i never have a 
> > chance to pass some different value for the nesting limit check.
> >
> > Many thanks,
> > Markus
> >
> > -Original message-
> > > From:Tim Allison 
> > > Sent: Monday 26th August 2019 19:11
> > > To: user@tika.apache.org
> > > Subject: Re: How to increase ZIP bomb maximum depth
> > >
> > > Hi Markus,
> > >
> > >   This requires some work...the zip bomb protections are currently
> > > handled by the handler.  We allow for configuration of the parsers,
> > > detectors, charset detectors, but not yet the handlers.  IIRC, we've
> > > talked a bit about specifying a custom handler via the commandline at
> > > least in tika-server.  I wonder if we should allow for a default
> > > handler configuration that would specify a handler to be used by the
> > > facade Tika.parse(inputStream)?
> > >
> > >   Fellow devs have any recommendations?
> > >
> > >   How are you currently calling Tika?  Via tika-server, Solr's DIH or
> > > something else?
> > >
> > >   Best,
> > >
> > > Tim
> > >
> > > On Mon, Aug 26, 2019 at 11:20 AM Markus Jelsma
> > >  wrote:
> > > >
> > > > Hello,
> > > >
> > > > I've been looking around to increase the limit, but i don't seem to be 
> > > > able to find how. I know there the setter for it, but using 
> > > > AutoDetectParser, i'd like to set it via tika-config. I haven't seen a 
> > > > parameter for tika-config that would set that value and the manual on 
> > > > Configuring Tika doesn't mention it.
> > > >
> > > > Many thanks,
> > > > Markus
> > > >
> > > >
> > >
> 


RE: How to increase ZIP bomb maximum depth

2019-08-26 Thread Markus Jelsma
Hello Tim,

I use Tika embedded in another Java application. passing it a custom 
ContentHandler which collects interesting stuff, which we, after the parse, use 
to construct meaningful text.

ReadableContentHandler handler = new ReadableContentHandler(url, config);

AutoDetectParser parser = new AutoDetectParser(tikaConfig);
parser.parse(stream, handler,  new Metadata(), context);

My ContentHandler does not extend SecureContentHandler so i never have a chance 
to pass some different value for the nesting limit check. 

Many thanks,
Markus

-Original message-
> From:Tim Allison 
> Sent: Monday 26th August 2019 19:11
> To: user@tika.apache.org
> Subject: Re: How to increase ZIP bomb maximum depth
> 
> Hi Markus,
> 
>   This requires some work...the zip bomb protections are currently
> handled by the handler.  We allow for configuration of the parsers,
> detectors, charset detectors, but not yet the handlers.  IIRC, we've
> talked a bit about specifying a custom handler via the commandline at
> least in tika-server.  I wonder if we should allow for a default
> handler configuration that would specify a handler to be used by the
> facade Tika.parse(inputStream)?
> 
>   Fellow devs have any recommendations?
> 
>   How are you currently calling Tika?  Via tika-server, Solr's DIH or
> something else?
> 
>   Best,
> 
> Tim
> 
> On Mon, Aug 26, 2019 at 11:20 AM Markus Jelsma
>  wrote:
> >
> > Hello,
> >
> > I've been looking around to increase the limit, but i don't seem to be able 
> > to find how. I know there the setter for it, but using AutoDetectParser, 
> > i'd like to set it via tika-config. I haven't seen a parameter for 
> > tika-config that would set that value and the manual on Configuring Tika 
> > doesn't mention it.
> >
> > Many thanks,
> > Markus
> >
> >
> 


How to increase ZIP bomb maximum depth

2019-08-26 Thread Markus Jelsma
Hello,

I've been looking around to increase the limit, but i don't seem to be able to 
find how. I know there the setter for it, but using AutoDetectParser, i'd like 
to set it via tika-config. I haven't seen a parameter for tika-config that 
would set that value and the manual on Configuring Tika doesn't mention it.

Many thanks,
Markus




RE: [VOTE] Release Apache Tika 1.22 Candidate #4

2019-07-30 Thread Markus Jelsma
Hello,

Yes, everything builds fine and all tests pass. I did notice all HTTP lookups 
to https://opennlp.sourceforge.net/ failing with Connection Refused.

+1

Regards,
Markus




-Original message-
> From:Oleg Tikhonov 
> Sent: Tuesday 30th July 2019 12:24
> To: d...@tika.apache.org
> Cc: user@tika.apache.org
> Subject: Re: [VOTE] Release Apache Tika 1.22 Candidate #4
> 
> Hi Tim, 
> thanks for the release !!! 
> Here is my +1, tested on Ubuntu 18.04.2 LTS, x_86 arc. 
> 
> Best wishes, 
> Oleg
> 
> On Mon, Jul 29, 2019 at 8:50 PM Tim Allison  > wrote:
> A candidate for the Tika 1.22 release is available at:
 
> 
 
>   https://dist.apache.org/repos/dist/dev/tika/ 
> 
 
> 
 
> 
 
> The release candidate is a zip archive of the sources in:
 
> 
 
>   https://github.com/apache/tika/tree/1.22-rc4/ 
> 
 
> 
 
> 
 
> The SHA-512 checksum of the archive is
 
> 
 
>   
> bbdf2683a63a0e5fbe66f10eb88c29cd14128c3dd8c680bf1c86352c8068cd6d61358eb506f728f494c0dcd084af48f4312f832f6467863f58c3b90ab59e9966.
 
> 
 
> 
 
> In addition, a staged maven repository is available here:
 
> 
 
>   
> https://repository.apache.org/content/repositories/orgapachetika-1055/org/apache/tika
>  
> 
 
> 
 
> 
 
> Please vote on releasing this package as Apache Tika 1.22.
 
> 
 
> The vote is open for the next 72 hours and passes if a majority of at
 
> least three +1 Tika PMC votes are cast.
 
> 
 
> 
 
> [ ] +1 Release this package as Apache Tika 1.22
 
> 
 
> [ ] -1 Do not release this package because...
 
> 
 
> Heres my +1.  Ive built this on Windows and Mac w and w/out spaces
 
> in the path. :P . Thank you for your patience.
 
> 
 
> Cheers,
 
> 
 
>        Tim
 


RE: [ANNOUNCE] Apache Tika 1.21 released

2019-05-21 Thread Markus Jelsma
Thanks Tim, and all contributors to 1.21!

Regards,
Markus
 
-Original message-
> From:Tim Allison 
> Sent: Monday 20th May 2019 4:20
> To: annou...@apache.org; d...@tika.apache.org; user@tika.apache.org
> Subject: [ANNOUNCE] Apache Tika 1.21 released
> 
> The Apache Tika project is pleased to announce the release of Apache Tika
> 1.21. The release contents have been pushed out to the main Apache
> release site and to the Maven Central sync, so the releases should be
> available as soon as the mirrors get the syncs.
> 
> Apache Tika is a toolkit for detecting and extracting metadata and
> structured text content from various documents using existing parser
> libraries.
> 
> Apache Tika 1.21 contains a number of improvements and bug fixes.
> Details can be found in the changes file:
> https://www.apache.org/dist/tika/CHANGES-1.21.txt
> 
> Apache Tika is available on the download page:
> https://tika.apache.org/download.html
> 
> Apache Tika is also available in binary form or for use using Maven 2
> from the Central Repository:
> https://repo1.maven.org/maven2/org/apache/tika/
> 
> In the initial 48 hours, the release may not be available on all mirrors.
> When downloading from a mirror site, please remember to verify the
> downloads using signatures found:
> https://www.apache.org/dist/tika/KEYS
> 
> For more information on Apache Tika, visit the project home page:
> https://tika.apache.org/
> 
> -- Tim Allison, on behalf of the Apache Tika community
> 


RE: [VOTE] Release Apache Tika 1.21 Candidate #1

2019-05-14 Thread Markus Jelsma
+1

 
 
-Original message-
> From:Giovanni De Stefano 
> Sent: Tuesday 14th May 2019 18:42
> To: user@tika.apache.org; d...@tika.apache.org
> Subject: Re: [VOTE] Release Apache Tika 1.21 Candidate #1
> 
> [1]  
> From: Oleg Tikhonov 
 
> Reply-To: "user@tika.apache.org" 
 
> Date: Tuesday, 14 May 2019 at 15:59
 
> To: "d...@tika.apache.org" 
 
> Cc: "user@tika.apache.org" 
 
> Subject: Re: [VOTE] Release Apache Tika 1.21 Candidate #1 
> :-) 
> I'm good with any option. RC1 seems to be good from my point of view. 
> Cheers, 
> Oleg 
> On Tue, May 14, 2019 at 3:56 PM Tim Allison  > wrote: 
> All,
 
>   I'm happy to close rc1 and respin an rc2 after Oleg's findings
 
> (TIKA-2871 and TIKA-2872)...many thanks, Oleg!  I'm also happy to
 
> proceed with rc1 as is...Let me know your preferences.
 
> 
 
>           Cheers,
 
> 
 
>                        Tim
 
> 
 
> On Mon, May 13, 2019 at 1:32 PM Tim Allison  > wrote:
 
> >
 
> > A candidate for the Tika 1.21 release is available at:
 
> >
 
> >   https://dist.apache.org/repos/dist/dev/tika/ 
> >
 
> >
 
> > The release candidate is a zip archive of the sources in:
 
> >   https://github.com/apache/tika/tree/1.21-rc1/ 
> >
 
> >
 
> > The SHA-512 checksum of the archive is:
 
> > 4bc861f3b9ba37df14726d8acf173185a5414b88774c0b00c1f82140e290ebdac1a146952a0dd3755a29e7281cb45f55dceb96c7d7de5aef55fa5923f1164ac2.
 
> >
 
> >
 
> > In addition, a staged maven repository is available here:
 
> > 
 
> https://repository.apache.org/content/repositories/orgapachetika-1047/org/apache/tika
>  
> 
 
> >
 
> >
 
> > Please vote on releasing this package as Apache Tika 1.21.
 
> >
 
> > The vote is open for the next 72 hours and passes if a majority of at
 
> > least three 1 Tika PMC votes are cast.
 
> >
 
> > [ ] 1 Release this package as Apache Tika 1.21
 
> > [ ] -1 Do not release this package because...
 
> >
 
> > Here's my 1.
 
> >
 
> > Cheers,
 
> >
 
> >       Tim 


RE: TikaServer - extract only a specific part of HTML page

2019-01-09 Thread Markus Jelsma
Hello Harinder,

You could try Boilerpipe which is integrated in Tika, it tries to solve the 
problem automatically. If this doesn't work for you, you can create a custom 
ContentHandler and collect text only in the div that has the ID you want.

We do a similar thing as Boilerpipe and both are extending ContentHandler. In 
the overloaded methods you can check for the div element and ID attribute value 
in the startElement() method, and if the conditions are right, collect text in 
the characters() method.

Regards,
Markus

 
 
-Original message-
> From:Hanjan, Harinder 
> Sent: Wednesday 9th January 2019 22:06
> To: 'user@tika.apache.org' 
> Subject: TikaServer - extract only a specific part of HTML page
> 
> Hello! 
> I was wondering if there is a way to instruct Tika Server to extract content 
> only with in a div tag.
 
> I am extracting a Sharepoint site and do not want to see text from header, 
> footer etc. The important text is always inside a particular content div. I 
> only want text from inside that div. 
> Previously, I had switched to using the /tika/main endpoint. While this has 
> definitely given us some improvement, there are still many cases where text 
> from the header is also extracted. 
> Thanks! 
> Harinder 
> 
 
> --- 
> NOTICE -
 
> This communication is intended ONLY for the use of the person or entity named 
> above and may contain information that is confidential or legally privileged. 
> If you are not the intended recipient named above or a person responsible for 
> delivering messages or
 
>  communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any 
> use, distribution, or copying of this communication or any of the information 
> contained in it is strictly prohibited. If you have received this 
> communication in error, please notify
 
>  us immediately by telephone and then destroy or delete this communication, 
> or return it to us by mail if requested by us. The City of Calgary thanks you 
> for your attention and co-operation.
 


RE: Encoding issues when upgrading Tika 1.17 to 1.19.1

2018-10-18 Thread Markus Jelsma
Hello Tim,

Opened two issues to track the problems:
https://issues.apache.org/jira/browse/TIKA-2758
https://issues.apache.org/jira/browse/TIKA-2759

Many thanks,
Markus
 
-Original message-
> From:Tim Allison 
> Sent: Wednesday 17th October 2018 16:53
> To: user@tika.apache.org
> Subject: Re: Encoding issues when upgrading Tika 1.17 to 1.19.1
> 
> Hi Markus,
> 
>   On the scripts...we added an "extractScripts" option, but the
> default is false, and the idea is that the scripts should be extracted
> as embedded documents, which with xhtml, would be inlined.  But, with
> the default as false, you shouldn't be seeing anything from scripts.
> 
>   On charset detection, that was likely caused by our "upgrade" to a
> more recent copy of icu4j's charset detector.
> 
>   Thank you for letting us know about these.  Please do open issues
> and share files.
> 
>Cheers,
> 
>   Tim
> On Wed, Oct 17, 2018 at 10:24 AM Markus Jelsma
>  wrote:
> >
> > Hello,
> >
> > I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran 
> > all 995 unit tests and observed three failures, two encoding issues and one 
> > other weird thing. The tests use real HTML.
> >
> > Where we previously extracted text  such as 'Spokane, Wash. [— The solar' 
> > we now got 'Spokane, Wash. [â€" The solar' in one test. The other had 
> > 'could take ["weeks, or' but we not get 'could take [“weeks, or' 
> > extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1.
> >
> > The other test fails because we suddenly extracted a bunch of Javascript as 
> > text content while instead it is actually a script tag with base64 inline. 
> > This inline code is decoded and reported in the characters() method of our 
> > custom ContentHandler, and ends up as text being extracted, but it seems 
> > the Javascript start tag itself is never reported to startElement(). The 
> > Javascript is reported to characters() after we left the head and entered 
> > the body.
> >
> > Any idea on how to fix this encoding issue and the weird inline base64 
> > Javascript? Are there any Tika options that i am unaware of? Are these bugs?
> >
> > Of course, i can share the HTML files if needed.
> >
> > Many thanks,
> > Markus
> 


Encoding issues when upgrading Tika 1.17 to 1.19.1

2018-10-17 Thread Markus Jelsma
Hello,

I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 
995 unit tests and observed three failures, two encoding issues and one other 
weird thing. The tests use real HTML.

Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we 
now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take 
["weeks, or' but we not get 'could take [“weeks, or' extracted. Our tests 
pass with 1.17 but fail with 1.18 and 1.19.1. 

The other test fails because we suddenly extracted a bunch of Javascript as 
text content while instead it is actually a script tag with base64 inline. This 
inline code is decoded and reported in the characters() method of our custom 
ContentHandler, and ends up as text being extracted, but it seems the 
Javascript start tag itself is never reported to startElement(). The Javascript 
is reported to characters() after we left the head and entered the body.

Any idea on how to fix this encoding issue and the weird inline base64 
Javascript? Are there any Tika options that i am unaware of? Are these bugs? 

Of course, i can share the HTML files if needed.

Many thanks,
Markus


RE: [ANNOUNCE] Apache Tika 1.19.1 released

2018-10-09 Thread Markus Jelsma
Thanks Tim, and of course all contributors!

 
 
-Original message-
> From:Tim Allison 
> Sent: Tuesday 9th October 2018 21:58
> To: annou...@apache.org; d...@tika.apache.org; user@tika.apache.org
> Subject: [ANNOUNCE] Apache Tika 1.19.1 released
> 
> The Apache Tika project is pleased to announce the release of Apache
> Tika 1.19.1. The release contents have been pushed out to the main
> Apache release site and to the Maven Central sync, so the releases
> should be available as soon as the mirrors get the syncs.
> 
> Apache Tika is a toolkit for detecting and extracting metadata and
> structured text content from various documents using existing parser
> libraries.
> 
> Apache Tika 1.19.1 contains two critical bug fixes to the
> MP3Parser and the handling of SAX parsing.  Details can be found in
> the changes file:
> https://www.apache.org/dist/tika/CHANGES-1.19.1.txt
> 
> Apache Tika is available on the download page:
> https://tika.apache.org/download.html
> 
> Apache Tika is also available in binary form or for use using Maven 2
> from the Central Repository:
> https://repo1.maven.org/maven2/org/apache/tika/
> 
> In the initial 48 hours, the release may not be available on all mirrors.
> 
> When downloading from a mirror site, please remember to verify the
> downloads using signatures found on the Apache site:
> https://www.apache.org/dist/tika/KEYS
> 
> For more information on Apache Tika, visit the project home page:
> https://tika.apache.org/
> 
> -- Tim Allison, on behalf of the Apache Tika community
> 


Attributes of HTML element not reported in ContentHandler

2018-08-29 Thread Markus Jelsma
Hello,

We parse HTML using a ContentHandler. Tika uses TagSoup, which does not support 
modern HTML but we work-around the problem by fiddling with its HMTLSchema. Now 
we have access to HTML5 elements, and other curiosities such as allowing META 
anywhere in the body.

What we never managed to get to work, is reading attributes of the HTML 
element. So, any ideas on how to get attributes reported always?

Many thanks,
Markus


RE: Tika detects short Japanese sentences as Chinese

2018-04-06 Thread Markus Jelsma
Hi - We see this too with Japanese where just a few kanji can spoil the 
detection. The only solution i see is creating a better model.

Markus

 
 
-Original message-
> From:ar...@codec.ai 
> Sent: Friday 6th April 2018 12:51
> To: user@tika.apache.org
> Subject: Re: Tika detects short Japanese sentences as Chinese
> 
> Hi Ken, yes it's OptimaizeLangDetector.
> Should I post it to optimaize mailing list?
> 
> On 2018/04/05 18:42:25, Ken Krugler  wrote: 
> > Hi Artur,
> > 
> > Is the detector that you get back from getDefaultLanguageDetector the 
> > OptimaizeLangDetector?
> > 
> > — Ken
> > 
> > 
> > > On Apr 3, 2018, at 2:55 AM, Artur Rashitov  wrote:
> > > 
> > > Given the following code:
> > > 
> > > val japanese = "私はガラスを食べられます。それは私を傷つけません。"
> > > LanguageDetector.getDefaultLanguageDetector.loadModels().detectAll(japanese)
> > > 
> > > it produces [zh-CN: MEDIUM (0.579961), zh-TW: MEDIUM (0.405015)]
> > > And the same thing for many short Japanese sentences.
> > > 
> > > Apache Tika 1.17
> > 
> > 
> > http://about.me/kkrugler
> > +1 530-210-6378
> > 
> > 


RE: [VOTE] Release Apache Tika 1.17 Candidate #1

2017-12-08 Thread Markus Jelsma
+1

 
 
-Original message-
> From:Tim Allison 
> Sent: Friday 8th December 2017 20:52
> To: d...@tika.apache.org; user@tika.apache.org
> Subject: [VOTE] Release Apache Tika 1.17 Candidate #1
> 
> A candidate for the Tika 1.17 release is available at: 
>   https://dist.apache.org/repos/dist/dev/tika/ 
> 
> The release candidate is a zip archive of the sources in: 
>   https://github.com/apache/tika/tree/1.17-rc1 
> 
> The SHA1 checksum of the archive is 
>   37f3cd19051160a8c488b1aa7ff25c3ae515c359. 
> 
> In addition, a staged maven repository is available here: 
>   https://repository.apache.org/content/repositories/orgapachetika-1027 
> 
> Please vote on releasing this package as Apache Tika 1.17. 
> The vote is open for the next 72 hours and passes if a majority of at 
> least three +1 Tika PMC votes are cast. 
> 
> [ ] +1 Release this package as Apache Tika 1.17 
> [ ] -1 Do not release this package because... 
> 
> +1 from me 
> 
> Please note that this will be the last major version that will run with Java 
> 7. 
> The next major release will require Java 8. 


RE: Using TikaConfig troubles

2017-11-03 Thread Markus Jelsma
Awesome!!

This works very well!

Thanks,
Markus
 
-Original message-
> From:Nick Burch 
> Sent: Friday 3rd November 2017 18:13
> To: user@tika.apache.org
> Subject: Re: Using TikaConfig troubles
> 
> On Fri, 3 Nov 2017, Markus Jelsma wrote:
> > This is how Nutch gets the parser:
> > Parser parser = tikaConfig.getParser(MediaType.parse(mimeType));
> >
> > When no custom config is specified config is:
> > new TikaConfig(this.getClass().getClassLoader());
> >
> > When i specify a custom config, it is:
> > tikaConfig = new TikaConfig(conf.getResource(customConfFile));
> 
> I think you need to give both the classloader and the config file for your 
> setup
> 
> Can you try this constructor:
> https://tika.apache.org/1.16/api/org/apache/tika/config/TikaConfig.html#TikaConfig-java.net.URL-java.lang.ClassLoader-
> 
> With something like
>new TikaConfig(conf.getResource(customConfFile),
>   this.getClass().getClassLoader());
> 
> Nick
> 


RE: Using TikaConfig troubles

2017-11-03 Thread Markus Jelsma
Added: https://issues.apache.org/jira/browse/TIKA-2491

To be clear, it really works well outside of Nutch.

Thanks again!
Markus

 
 
-Original message-
> From:Allison, Timothy B. 
> Sent: Friday 3rd November 2017 16:40
> To: user@tika.apache.org
> Subject: RE: Using TikaConfig troubles
> 
> Ugh.  Sorry.  I'll take a look.  Can you share your custom config file?  This 
> sounds like a bug, so please hang it on a new issue. ☹
> 
> -----Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Friday, November 3, 2017 11:12 AM
> To: user@tika.apache.org
> Subject: Using TikaConfig troubles
> 
> Hello,
> 
> I need to use a custom tika-config.xml in Nutch, which has support for it but 
> i can't get it to work. 
> 
> This is how Nutch gets the parser:
> Parser parser = tikaConfig.getParser(MediaType.parse(mimeType));
> 
> When no custom config is specified config is:
> new TikaConfig(this.getClass().getClassLoader());
> 
> When i specify a custom config, it is:
> tikaConfig = new TikaConfig(conf.getResource(customConfFile));
> 
> getParser always returns null with a custom config file. There are no errors 
> or exceptions. The config is fine, it fixed the encoding problem in a parser 
> outside of Nutch (thanks again Timothy) but i need to get it to work in Nutch 
> too.
> 
> Our external project does:
> AutoDetectParser parser = new AutoDetectParser(tikaConfig); parser.parse(..);
> 
> and it just works! If i do this in Nutch, however, nothing is passed through 
> the content handlers, the parser result is completely empty? HUH?!?
> 
> Any tips would be great!
> 
> Many thanks,
> Markus 
> 
> 
> 


Using TikaConfig troubles

2017-11-03 Thread Markus Jelsma
Hello,

I need to use a custom tika-config.xml in Nutch, which has support for it but i 
can't get it to work. 

This is how Nutch gets the parser:
Parser parser = tikaConfig.getParser(MediaType.parse(mimeType));

When no custom config is specified config is:
new TikaConfig(this.getClass().getClassLoader());

When i specify a custom config, it is:
tikaConfig = new TikaConfig(conf.getResource(customConfFile));

getParser always returns null with a custom config file. There are no errors or 
exceptions. The config is fine, it fixed the encoding problem in a parser 
outside of Nutch (thanks again Timothy) but i need to get it to work in Nutch 
too.

Our external project does:
AutoDetectParser parser = new AutoDetectParser(tikaConfig);
parser.parse(..);

and it just works! If i do this in Nutch, however, nothing is passed through 
the content handlers, the parser result is completely empty? HUH?!?

Any tips would be great!

Many thanks,
Markus 



RE: [jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-11-03 Thread Markus Jelsma
No worries! Any idea when 1.17 gets voted? So instead of upgrading Nutch to 
1.16, i'll would prefer 1.17.

Thanks,
Markus

-Original message-
> From:Allison, Timothy B. 
> Sent: Friday 3rd November 2017 15:29
> To: user@tika.apache.org; Markus Jelsma 
> Subject: RE: [jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16
> 
> Markus,
>   Sorry for my delay on this.  See TIKA-2490.
> 
> -----Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Tuesday, October 17, 2017 5:01 PM
> To: user@tika.apache.org
> Subject: FW: [jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16
> 
> Hello,
> 
> I tried to update Nutch to 1.16, it works but we get these messages on 
> stderr. It works, all is well. But we would like to get rid of the messages 
> if possible.
> 
> Ideas?
> 
> Many thanks,
> Markus
> 
> fetching: https://www.sitesearch.io/
> robots.txt whitelist not configured.
> Oct 11, 2017 5:50:50 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored See 
> https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> TIFFImageWriter not loaded. tiff files will not be processed See 
> https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
> 
> Oct 11, 2017 5:50:50 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> parsing: https://www.sitesearch.io/
> 
>  
>  
> -Original message-
> > From:Sebastian Nagel (JIRA) 
> > Sent: Tuesday 17th October 2017 22:48
> > To: d...@nutch.apache.org
> > Subject: [jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16
> > 
> > 
> > [ 
> >https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jir
> >a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=162083
> >09#comment-16208309 ]
> > 
> > Sebastian Nagel commented on NUTCH-2439:
> > 
> > 
> > +1   Tika-core 1.16 already slept into as dependency of crawler-commons 0.8.
> > 
> > The Tika warnings to stderr are annoying. Looks like they cannot be 
> > supressed via Nutch's log4j.properties. Or is there a way?
> > 
> > > Upgrade to Apache Tika 1.16
> > > ---
> > >
> > > Key: NUTCH-2439
> > > URL: 
> > >https://issues.apache.org/jira/browse/NUTCH-2439
> > > Project: Nutch
> > >  Issue Type: Improvement
> > >    Affects Versions: 1.13
> > >    Reporter: Markus Jelsma
> > >    Assignee: Markus Jelsma
> > > Fix For: 1.14
> > >
> > > Attachments: NUTCH-2439.patch, NUTCH-2439.patch
> > >
> > >
> > 
> > 
> > 
> > 
> > --
> > This message was sent by Atlassian JIRA
> > (v6.4.14#64029)
> > 
> 


RE: Incorrect encoding detected

2017-11-02 Thread Markus Jelsma
Hello Tim,

Thanks! I will try the nightly build tomorrow!

Nutch probably already has support for tika-config. I couldn't find it in the 
config, but in the code i spotted support for tika.config.file. 

Many, many thanks!
Markus
 
 
-Original message-
> From:Allison, Timothy B. 
> Sent: Thursday 2nd November 2017 14:56
> To: user@tika.apache.org
> Subject: RE: Incorrect encoding detected
> 
> Hi Markus,
>   I just committed TIKA-2485.  See the issue for the commit, if you have to 
>make these changes on your local Tika build.
> 
> Looks like tika-config.xml was not added here: 
> https://issues.apache.org/jira/browse/NUTCH-577.  
> I wonder if it was added to Nutch later.  If it wasn't, I'd highly encourage 
> re-opening this issue and adding it back in!
> 
> To build an AutoDetectParser from a tika-config.xml file, do something like 
> this (but with correct exception handling/closing!!!):
> 
> TikaConfig tikaConfig = new TikaConfig( 
> getResourceAsStream("/org/apache/tika/config/TIKA-2485-encoding-detector-mark-limits.xml"));
> 
> AutoDetectParser p = new AutoDetectParser(tikaConfig);
> 
> Note that the order of the encoding detectors matters!  The first one that 
> returns a non-null result is the one that Tika uses.  The default encoding 
> detector order is as I specified it in 
> "TIKA-2485-encoding-detector-mark-limits.xml": HTML, Universal, ICU4j.  The 
> default order is specified via SPI here in 
> tika-parsers/src/main/resources/META-INF/services/o.a.t.detect.EncodingDetector
> 
> Let us know if there's anything else we can do.
> 
> Best,
> 
>    Tim
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Wednesday, November 1, 2017 5:32 PM
> To: user@tika.apache.org
> Subject: RE: Incorrect encoding detected
> 
> Alright, the Nutch list could not provide an answer and myself i don't know. 
> But, if Nutch can't, we can make it happen. Can you direct me to a page that 
> explains how tika-config has to be passed to Tika? We have full control over 
> what we put into the Parser, e.g. ContentHandler, Context, etc. 
> 
> If we can do that, i just need to know what to set to increase the limit. I 
> am unaware of Tika having @Field config methods, its new to me.
> 
> But you said it was not supported yet, so that would mean the content limit 
> would not adhere to @Field config?
> 
> That is fine too, but i really need a short time solution. I needed i can 
> manually patch Tika and have our (i am not speaking as a Nutch committer 
> right now) parser use in-house compiler Tika.
> 
> I checked the encoding package in tika-core. There are many detection classes 
> there but i really have no idea which detector Nutch (or Tika by default) 
> uses under the hood. I could not easily find the file in which i could 
> increase the limit.
> 
> I am happy with this hack for a brief time, until it is supported by a new 
> Tika version. Can you direct me to the class i should modify?
> 
> Many, many thanks!
> Markus
> 
> -Original message-
> > From:Allison, Timothy B. 
> > Sent: Tuesday 31st October 2017 13:11
> > To: user@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > For 1.17, the simplest solution, I think, is to allow users to configure 
> > extending the detection limit via our @Field config methods, that is, via 
> > tika-config.xml.
> > 
> > To confirm, Nutch will allow users to specify a tika-config file?  Will 
> > this work for you and Nutch?
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> > Sent: Tuesday, October 31, 2017 5:47 AM
> > To: user@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > Hello Timothy - what would be your preferred solution? Increase detection 
> > limit or skip inline styles and possibly other useless head information?
> > 
> > Thanks,
> > Markus
> > 
> >  
> >  
> > -Original message-
> > > From:Markus Jelsma 
> > > Sent: Friday 27th October 2017 15:37
> > > To: user@tika.apache.org
> > > Subject: RE: Incorrect encoding detected
> > > 
> > > Hi Tim,
> > > 
> > > I have opened TIKA-2485 to track the problem. 
> > > 
> > > Thank you very very much!
> > > Markus
> > > 
> > >  
> > >  
> > > -Original message-
> > > > From:Allison, Timothy B. 
> > > > Sent: Friday 27th October 2017 15:33
> > > > To: user@tika.apache.org
&

RE: Incorrect encoding detected

2017-11-01 Thread Markus Jelsma
Alright, the Nutch list could not provide an answer and myself i don't know. 
But, if Nutch can't, we can make it happen. Can you direct me to a page that 
explains how tika-config has to be passed to Tika? We have full control over 
what we put into the Parser, e.g. ContentHandler, Context, etc. 

If we can do that, i just need to know what to set to increase the limit. I am 
unaware of Tika having @Field config methods, its new to me.

But you said it was not supported yet, so that would mean the content limit 
would not adhere to @Field config?

That is fine too, but i really need a short time solution. I needed i can 
manually patch Tika and have our (i am not speaking as a Nutch committer right 
now) parser use in-house compiler Tika.

I checked the encoding package in tika-core. There are many detection classes 
there but i really have no idea which detector Nutch (or Tika by default) uses 
under the hood. I could not easily find the file in which i could increase the 
limit.

I am happy with this hack for a brief time, until it is supported by a new Tika 
version. Can you direct me to the class i should modify?

Many, many thanks!
Markus

-Original message-
> From:Allison, Timothy B. 
> Sent: Tuesday 31st October 2017 13:11
> To: user@tika.apache.org
> Subject: RE: Incorrect encoding detected
> 
> For 1.17, the simplest solution, I think, is to allow users to configure 
> extending the detection limit via our @Field config methods, that is, via 
> tika-config.xml.
> 
> To confirm, Nutch will allow users to specify a tika-config file?  Will this 
> work for you and Nutch?
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Tuesday, October 31, 2017 5:47 AM
> To: user@tika.apache.org
> Subject: RE: Incorrect encoding detected
> 
> Hello Timothy - what would be your preferred solution? Increase detection 
> limit or skip inline styles and possibly other useless head information?
> 
> Thanks,
> Markus
> 
>  
>  
> -Original message-
> > From:Markus Jelsma 
> > Sent: Friday 27th October 2017 15:37
> > To: user@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > Hi Tim,
> > 
> > I have opened TIKA-2485 to track the problem. 
> > 
> > Thank you very very much!
> > Markus
> > 
> >  
> >  
> > -Original message-
> > > From:Allison, Timothy B. 
> > > Sent: Friday 27th October 2017 15:33
> > > To: user@tika.apache.org
> > > Subject: RE: Incorrect encoding detected
> > > 
> > > Unfortunately there is no way to do this now.  _I think_ we could make 
> > > this configurable, though, fairly easily.  Please open a ticket.
> > > 
> > > The next RC for PDFBox might be out next week, and we'll try to release 
> > > Tika 1.17 shortly after that...so there should be time to get this in.
> > > 
> > > -Original Message-
> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> > > Sent: Friday, October 27, 2017 9:12 AM
> > > To: user@tika.apache.org
> > > Subject: RE: Incorrect encoding detected
> > > 
> > > Hello Tim,
> > > 
> > > Getting rid of script and style contents sounds plausible indeed. But to 
> > > work around the problem for now, can i instruct HTMLEncodingDetector from 
> > > within Nutch to look beyond the limit?
> > > 
> > > Thanks!
> > > Markus
> > > 
> > >  
> > >  
> > > -Original message-
> > > > From:Allison, Timothy B. 
> > > > Sent: Friday 27th October 2017 14:53
> > > > To: user@tika.apache.org
> > > > Subject: RE: Incorrect encoding detected
> > > > 
> > > > Hi Markus,
> > > >   
> > > > My guess is that the ~32,000 characters of mostly ascii-ish  
> > > > are what is actually being used for encoding detection.  The 
> > > > HTMLEncodingDetector only looks in the first 8,192 characters, and the 
> > > > other encoding detectors have similar (but longer?) restrictions.
> > > >  
> > > > At some point, I had a dev version of a stripper that removed contents 
> > > > of  and  before trying to detect the 
> > > > encoding[0]...perhaps it is time to resurrect that code and integrate 
> > > > it?
> > > > 
> > > > Or, given that HTML has been, um, blossoming, perhaps, more simply, we 
> > > > should expand how far we look into a stream for detection?
> > > > 
> > > > Cheers,
> > > > 
> >

RE: Incorrect encoding detected

2017-10-31 Thread Markus Jelsma
Hello Timothy - what would be your preferred solution? Increase detection limit 
or skip inline styles and possibly other useless head information?

Thanks,
Markus

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Friday 27th October 2017 15:37
> To: user@tika.apache.org
> Subject: RE: Incorrect encoding detected
> 
> Hi Tim,
> 
> I have opened TIKA-2485 to track the problem. 
> 
> Thank you very very much!
> Markus
> 
>  
>  
> -Original message-
> > From:Allison, Timothy B. 
> > Sent: Friday 27th October 2017 15:33
> > To: user@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > Unfortunately there is no way to do this now.  _I think_ we could make this 
> > configurable, though, fairly easily.  Please open a ticket.
> > 
> > The next RC for PDFBox might be out next week, and we'll try to release 
> > Tika 1.17 shortly after that...so there should be time to get this in.
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> > Sent: Friday, October 27, 2017 9:12 AM
> > To: user@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > Hello Tim,
> > 
> > Getting rid of script and style contents sounds plausible indeed. But to 
> > work around the problem for now, can i instruct HTMLEncodingDetector from 
> > within Nutch to look beyond the limit?
> > 
> > Thanks!
> > Markus
> > 
> >  
> >  
> > -Original message-
> > > From:Allison, Timothy B. 
> > > Sent: Friday 27th October 2017 14:53
> > > To: user@tika.apache.org
> > > Subject: RE: Incorrect encoding detected
> > > 
> > > Hi Markus,
> > >   
> > > My guess is that the ~32,000 characters of mostly ascii-ish  are 
> > > what is actually being used for encoding detection.  The 
> > > HTMLEncodingDetector only looks in the first 8,192 characters, and the 
> > > other encoding detectors have similar (but longer?) restrictions.
> > >  
> > > At some point, I had a dev version of a stripper that removed contents of 
> > >  and  before trying to detect the encoding[0]...perhaps 
> > > it is time to resurrect that code and integrate it?
> > > 
> > > Or, given that HTML has been, um, blossoming, perhaps, more simply, we 
> > > should expand how far we look into a stream for detection?
> > > 
> > > Cheers,
> > > 
> > >Tim
> > > 
> > > [0] https://issues.apache.org/jira/browse/TIKA-2038
> > >
> > > 
> > > -Original Message-
> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> > > Sent: Friday, October 27, 2017 8:39 AM
> > > To: user@tika.apache.org
> > > Subject: Incorrect encoding detected
> > > 
> > > Hello,
> > > 
> > > We have a problem with Tika, encoding and pages on this website: 
> > > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> > > 
> > > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that 
> > > the regular HTML parser does a fine job, but our TikaParser has a tough 
> > > job dealing with this HTML. For some reason Tika thinks 
> > > Content-Encoding=windows-1252 is what this webpage says it is, instead 
> > > the page identifies itself properly as UTF-8.
> > > 
> > > Of all websites we index, this is so far the only one giving trouble 
> > > indexing accents, getting fÃ¥ instead of a regular få.
> > > 
> > > Any tips to spare? 
> > > 
> > > Many many thanks!
> > > Markus
> > > 
> > 
> 


RE: Incorrect encoding detected

2017-10-27 Thread Markus Jelsma
Hi Tim,

I have opened TIKA-2485 to track the problem. 

Thank you very very much!
Markus

 
 
-Original message-
> From:Allison, Timothy B. 
> Sent: Friday 27th October 2017 15:33
> To: user@tika.apache.org
> Subject: RE: Incorrect encoding detected
> 
> Unfortunately there is no way to do this now.  _I think_ we could make this 
> configurable, though, fairly easily.  Please open a ticket.
> 
> The next RC for PDFBox might be out next week, and we'll try to release Tika 
> 1.17 shortly after that...so there should be time to get this in.
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Friday, October 27, 2017 9:12 AM
> To: user@tika.apache.org
> Subject: RE: Incorrect encoding detected
> 
> Hello Tim,
> 
> Getting rid of script and style contents sounds plausible indeed. But to work 
> around the problem for now, can i instruct HTMLEncodingDetector from within 
> Nutch to look beyond the limit?
> 
> Thanks!
> Markus
> 
>  
>  
> -Original message-
> > From:Allison, Timothy B. 
> > Sent: Friday 27th October 2017 14:53
> > To: user@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > Hi Markus,
> >   
> > My guess is that the ~32,000 characters of mostly ascii-ish  are 
> > what is actually being used for encoding detection.  The 
> > HTMLEncodingDetector only looks in the first 8,192 characters, and the 
> > other encoding detectors have similar (but longer?) restrictions.
> >  
> > At some point, I had a dev version of a stripper that removed contents of 
> >  and  before trying to detect the encoding[0]...perhaps it 
> > is time to resurrect that code and integrate it?
> > 
> > Or, given that HTML has been, um, blossoming, perhaps, more simply, we 
> > should expand how far we look into a stream for detection?
> > 
> > Cheers,
> > 
> >Tim
> > 
> > [0] https://issues.apache.org/jira/browse/TIKA-2038
> >
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> > Sent: Friday, October 27, 2017 8:39 AM
> > To: user@tika.apache.org
> > Subject: Incorrect encoding detected
> > 
> > Hello,
> > 
> > We have a problem with Tika, encoding and pages on this website: 
> > https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> > 
> > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the 
> > regular HTML parser does a fine job, but our TikaParser has a tough job 
> > dealing with this HTML. For some reason Tika thinks 
> > Content-Encoding=windows-1252 is what this webpage says it is, instead the 
> > page identifies itself properly as UTF-8.
> > 
> > Of all websites we index, this is so far the only one giving trouble 
> > indexing accents, getting fÃ¥ instead of a regular få.
> > 
> > Any tips to spare? 
> > 
> > Many many thanks!
> > Markus
> > 
> 


RE: Incorrect encoding detected

2017-10-27 Thread Markus Jelsma
Hello Tim,

Getting rid of script and style contents sounds plausible indeed. But to work 
around the problem for now, can i instruct HTMLEncodingDetector from within 
Nutch to look beyond the limit?

Thanks!
Markus

 
 
-Original message-
> From:Allison, Timothy B. 
> Sent: Friday 27th October 2017 14:53
> To: user@tika.apache.org
> Subject: RE: Incorrect encoding detected
> 
> Hi Markus,
>   
> My guess is that the ~32,000 characters of mostly ascii-ish  are 
> what is actually being used for encoding detection.  The HTMLEncodingDetector 
> only looks in the first 8,192 characters, and the other encoding detectors 
> have similar (but longer?) restrictions.
>  
> At some point, I had a dev version of a stripper that removed contents of 
>  and  before trying to detect the encoding[0]...perhaps it 
> is time to resurrect that code and integrate it?
> 
> Or, given that HTML has been, um, blossoming, perhaps, more simply, we should 
> expand how far we look into a stream for detection?
> 
> Cheers,
> 
>Tim
> 
> [0] https://issues.apache.org/jira/browse/TIKA-2038
>
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Friday, October 27, 2017 8:39 AM
> To: user@tika.apache.org
> Subject: Incorrect encoding detected
> 
> Hello,
> 
> We have a problem with Tika, encoding and pages on this website: 
> https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> 
> Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the 
> regular HTML parser does a fine job, but our TikaParser has a tough job 
> dealing with this HTML. For some reason Tika thinks 
> Content-Encoding=windows-1252 is what this webpage says it is, instead the 
> page identifies itself properly as UTF-8.
> 
> Of all websites we index, this is so far the only one giving trouble indexing 
> accents, getting fÃ¥ instead of a regular få.
> 
> Any tips to spare? 
> 
> Many many thanks!
> Markus
> 


Incorrect encoding detected

2017-10-27 Thread Markus Jelsma
Hello,

We have a problem with Tika, encoding and pages on this website: 
https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser

Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the 
regular HTML parser does a fine job, but our TikaParser has a tough job dealing 
with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is 
what this webpage says it is, instead the page identifies itself properly as 
UTF-8.

Of all websites we index, this is so far the only one giving trouble indexing 
accents, getting fÃ¥ instead of a regular få.

Any tips to spare? 

Many many thanks!
Markus


FW: [jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16

2017-10-17 Thread Markus Jelsma
Hello,

I tried to update Nutch to 1.16, it works but we get these messages on stderr. 
It works, all is well. But we would like to get rid of the messages if possible.

Ideas?

Many thanks,
Markus

fetching: https://www.sitesearch.io/
robots.txt whitelist not configured.
Oct 11, 2017 5:50:50 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Oct 11, 2017 5:50:50 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
parsing: https://www.sitesearch.io/

 
 
-Original message-
> From:Sebastian Nagel (JIRA) 
> Sent: Tuesday 17th October 2017 22:48
> To: d...@nutch.apache.org
> Subject: [jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.16
> 
> 
> [ 
>https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208309#comment-16208309
> ] 
> 
> Sebastian Nagel commented on NUTCH-2439:
> 
> 
> +1   Tika-core 1.16 already slept into as dependency of crawler-commons 0.8.
> 
> The Tika warnings to stderr are annoying. Looks like they cannot be supressed 
> via Nutch's log4j.properties. Or is there a way?
> 
> > Upgrade to Apache Tika 1.16
> > ---
> >
> > Key: NUTCH-2439
> > URL: https://issues.apache.org/jira/browse/NUTCH-2439
> > Project: Nutch
> >  Issue Type: Improvement
> >    Affects Versions: 1.13
> >    Reporter: Markus Jelsma
> >    Assignee: Markus Jelsma
> > Fix For: 1.14
> >
> > Attachments: NUTCH-2439.patch, NUTCH-2439.patch
> >
> >
> 
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.4.14#64029)
> 


ContentHandlers and CSS parsing

2017-09-19 Thread Markus Jelsma
Hello,

I am parsing HTML using Tika and collecting interesting pieces of it using a 
custom ContentHandler. In the ContentHandler i keep track of parent elements, 
classes etc using stacks, i know where i am in the mess.

Now, for each element in startElement i need to look-up the visibility, 
specifically check if display isn't none, and ignore it if so.

How? I have checked CSSParser, CSSBox and several other CSS parsing projects, 
but i haven't seen a simple straightforward API yet that lets me look-up its 
computed style for a given location.

Any suggestions?

Many thanks!
Markus


RE: extract from URL text

2017-09-06 Thread Markus Jelsma
You can use Boilerpipe which is supported by Tika. Check out Nutch' TikaParser 
as an example.

Regards,
Markus

https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
 
-Original message-
> From:Francesco Viscomi 
> Sent: Wednesday 6th September 2017 14:12
> To: user@tika.apache.org
> Subject: extract from URL text
> 
> Hi all,
> im new to Tika and im trying to extract 
 
> text from a web page, but i want only the text inside the body, every 
 
> other content i want strip off.
> Ive looking some example on 
 
> internet but every example i found so far isnt good because it do not 
 
> strip off some tag inside the menu for example, can someone help me?
> 
> thanks really much
> -- 
> Ing. Viscomi Francesco 


RE: HTML parsing, script tags,

2017-06-30 Thread Markus Jelsma
TagSoup is notorious for being utterly unmaintained, but i can be forced to do 
what, at least, i needed:

    // We'll change the schema to allow tables inside anchors!
    Schema schema = new HTMLSchema();
    
    // Have meta reported everywhere, also in the body
    schema.elementType("meta", HTMLSchema.M_EMPTY, 65535, 0);

    // https://issues.apache.org/jira/browse/TIKA-985
    String html5Elements[] = { "article", "aside", "audio", "bdi",
  "command", "datalist", "details", "embed", "summary", "figure",
  "figcaption", "footer", "header", "hgroup", "keygen", "mark",
  "meter", "nav", "output", "progress", "section", "source", "time",
  "track", "video", "figurecaption" };

    for (String html5Element : html5Elements) {
  schema.elementType(html5Element, HTMLSchema.M_ANY, 255, 0);
    }
    
    schema.elementType("a", HTMLSchema.M_ANY, 65535, 0);
    
    // Set up a parse context
    ParseContext context = new ParseContext();
    context.set(Schema.class, schema);
    context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);

The changed HTMLSchema and the usage of IdentityHtmlMapper makes it possible to 
return stuff that non-default TagSoup cannot.

Regards,
Markus
 
 
-Original message-
> From:Allison, Timothy B. 
> Sent: Friday 30th June 2017 17:13
> To: user@tika.apache.org
> Subject: RE: HTML parsing, script tags, 
> 
> Wait, Tagsoup is not returning the start element events in the same order as 
> the html?  I don’t know think we can fix that or your other points, but would 
> you be willing to share triggering documents and open an issue for each 
> problem. 
> We should include those issues in our ongoing conversation about swapping out 
> the underlying html parser for something more modern. 
> Sorry Tika isn’t working for you on this, and thank you! 
> From: Jim Idle [mailto:ji...@proofpoint.com] 
 
> Sent: Friday, June 30, 2017 1:23 AM
 
> To: user@tika.apache.org
 
> Subject: RE: HTML parsing, script tags,  
> Well I got a long way with the Tika wrapper around tag soup but then while 
> chasing down a bug I realized that I was not getting the startElement events 
> in the order that they are seen in the HTML file. It also ignores  
> and unknown
 
>  elements. 
> I can’t see anyway to change that and as knowing the structure of the 
> document is very important then I will have to stop using Tika for HTML I 
> guess and go back to validator.nu 
> Just posting this here for posterity really. 
> Jim 
> From: Ken Krugler [mailto:kkrugler_li...@transpac.com 
> ]
 
> 
 
> Sent: Wednesday, June 28, 2017 23:06
 
> To: user@tika.apache.org 
 
> Subject: Re: HTML parsing, script tags,  
> Hi Jim, 
> On Jun 28, 2017, at 12:07am, Jim Idle  > wrote: 
> So right now it looks the HTML parser only sends through script tags if the 
> hay a src attribute. Is this likely to change or should I use another parser 
> for HTML? I could submit a patch for this of course. 
> You can use a custom mapper if you want to alter which tags get passed 
> through. 
> E.g. check out IdentityHtmlMapper in Tika for a mapper that passes through 
> everything. 
> Also, does anyone have an opinion if the underlying tag soup stuff is 
> tolerant of HTML in a similar manner to browsers which will try to render 
> anything) or is expecting well-formed HTML. I can go look at the Tag Soup 
> stuff directly of
 
>  course, but just wondered if anyone has experience of using Tika to parse 
>HTML.  
> TagSoup (and JSoup and NekoHTML) are all Java libraries that try to fix up 
> broken HTML, with varying degrees of success, depending on the way that HTML 
> is broken. 
> — Ken 
> -- 
> Ken Krugler 
> 1 530-210-6378 
> http://www.scaleunlimited.com 
> custom
>  big data solutions & training 
> Hadoop, Cascading, Cassandra & Solr 


RE: CRC ContentHandler

2017-02-15 Thread Markus Jelsma
Hello - streaming hash functions are, in general, from a cryptographic point of 
view a bad idea, but if you are just interested in checking data integrity it 
might work for you. You will either have to collect all bits of data and hash 
it in the end, or feed it to a hashing function that allows for streaming data. 
The algorithm is up to you.

But, on the other hand, are the files you receive that large? Does your process 
at some point buffer the entire file? If so, hashing is it easy. I don't know 
if Tika supports ingesting streaming data but in Apache Nutch we buffer the 
entire file at some point before sending it to Apache Tika, hashing the data 
is, in this case, not a problem.

Markus
 
 
-Original message-
> From:Wshrdryr Corp 
> Sent: Thursday 16th February 2017 0:43
> To: user@tika.apache.org
> Subject: Re: CRC ContentHandler
> 
> Hello Markus,  
> 
> Thanks for replying.  
> 
> I was hoping not to have to buffer entire media files due to size. Is there a 
> way to get the content segment as a stream? The internal buffering of a 
> stream might be more efficient and less prone to spikes.  
> 
> Java is not my native tongue. Ive been able to hack through other API 
> challenges while doing this project. Googling has given me some suspicions 
> but not a clear answer.  
> 
> Cheers. 
> 
> On Wed, Feb 15, 2017 at 3:26 PM, Markus Jelsma  <mailto:markus.jel...@openindex.io>> wrote:
> Hello - i dont know if media files even produce SAX events, but if they do 
> you can catch them in your startElement, charachters, and endElement methods. 
> I would start collecting element names (qName and/or attribute values) and 
> stuff in the character method, and append those to a StringBuilder.
 
> 
 
> In the endDocument method you have collected every piece of information the 
> ContentHandler method receives. From thereon you just call 
> toString().hashCode() or whatever hashing algorithm you like on the contents 
> accumulated in your StringBuilder.
 
> 
 
> Regards,
 
> Markus
 
> 
 
> 
 
> 
 
> -Original message-
 
> > From:Wshrdryr Corp mailto:wshrd...@gmail.com>>
 
> > Sent: Wednesday 15th February 2017 23:22
 
> > To: user@tika.apache.org <mailto:user@tika.apache.org>
 
> > Subject: CRC ContentHandler
 
> >
 
> > Hello all, 
 
> >
 
> > I need to write a Tika ContentHandler which will return a CRC and/or hash 
> > of the non-metadata part of media files. 
 
> >
 
> > Can anyone point me in the right direction?
 
> >
 
> > Im new to Tika so please forgive me if this is an obvious question.
 
> >
 
> > TIA for any help.
 
> 


RE: CRC ContentHandler

2017-02-15 Thread Markus Jelsma
Hello - i don't know if media files even produce SAX events, but if they do you 
can catch them in your startElement, charachters, and endElement methods. I 
would start collecting element names (qName and/or attribute values) and stuff 
in the character method, and append those to a StringBuilder.

In the endDocument method you have collected every piece of information the 
ContentHandler method receives. From thereon you just call 
toString().hashCode() or whatever hashing algorithm you like on the contents 
accumulated in your StringBuilder. 

Regards,
Markus

 
 
-Original message-
> From:Wshrdryr Corp 
> Sent: Wednesday 15th February 2017 23:22
> To: user@tika.apache.org
> Subject: CRC ContentHandler
> 
> Hello all,  
> 
> I need to write a Tika ContentHandler which will return a CRC and/or hash of 
> the non-metadata part of media files.  
> 
> Can anyone point me in the right direction? 
> 
> Im new to Tika so please forgive me if this is an obvious question. 
> 
> TIA for any help. 


RE: Memory issues with the Tika Facade

2017-01-03 Thread Markus Jelsma
Hello - you should set Xmx yourself, 100 MB should be ok depending on the size 
of your documents. Finding the optimal Xmx is iterative, as long as no 
OutOfMemory occurs, your Xmx is either too high, or just spot on. If you hit an 
OutOfMemory regardless of Xmx there's probably a leak, but that rarely happens.

Having 8 GB of heap is not a good idea, the JVM can easily eat it all, whether 
it needs it or not.

Markus

-Original message-
> From:Will Jones 
> Sent: Tuesday 3rd January 2017 19:41
> To: user@tika.apache.org
> Subject: Re: Memory issues with the Tika Facade
> 
> Hello Both, 
> 
> Thanks for the reply. Using VisualVM it shows me that 8GB is being reserved 
> (8GB Xmx), the Used memory quickly climbs up to around 6GB and eventually to 
> 8GB at which point the program will crash. If I trigger Garbage Collections 
> it does not save any memory. 
> 
> The files themselves are a mixture of PDF, JPG, and Office. The largest PDF 
> file is 20MB, the largest DOCX is 600KB. I have done some testing and it is 
> the PDF files that cause the issue (only running the JPG and Office files 
> causes no memory problems). 
> 
> I am using Tika 1.14. I had thought by disposing of the Tika facade each loop 
> iteration this would have freed up any memory used by the previous parse 
> (sorry I am a bit new to Java)? 
> 
> Thank you 
> 
> 
> 
> On 3 January 2017 at 17:56, Allison, Timothy B.  <mailto:talli...@mitre.org>> wrote:
> Concur with Markus.
 
> 
 
> Also, what type of files are these?  We know that very large .docx (think 
> "War and Peace") and .pptx can use up a crazy amount of memory.  Weve added 
> new experimental parsers to handle those via SAX in trunk (coming in v 1.15), 
> and these parsers decrease memory usage dramatically.
 
> 
 
> 
 
> -Original Message-
 
> From: Markus Jelsma [mailto:markus.jel...@openindex.io 
> <mailto:markus.jel...@openindex.io>]
 
> Sent: Tuesday, January 3, 2017 12:23 PM
 
> To: user@tika.apache.org <mailto:user@tika.apache.org>
 
> Subject: RE: Memory issues with the Tika Facade
 
> 
 
> Hello - what is a large amount of memory, how do you determine it (make sure 
> you look at RES, not VIRT) and what are your JVM settings.
 
> 
 
> It is not uncommon for programs to allocate much memory if the default max 
> heap is used, 2 GB in my case. If your JVM eats too much, limit it by setting 
> Xmx to a lower level.
 
> 
 
> Markus
 
> 
 
> -Original message-
 
> > From:Will Jones mailto:systemdotf...@gmail.com>>
 
> > Sent: Tuesday 3rd January 2017 18:14
 
> > To: user@tika.apache.org <mailto:user@tika.apache.org>
 
> > Subject: Memory issues with the Tika Facade
 
> >
 
> > Hi,
 
> >
 
> > Big fan of what you are doing with Apache Tika. I have been using the Tika 
> > facade to fetch metadata on each file in a directory containing a large 
> > number of files. 
 
> >
 
> > It returns the data I need, but the running process very quickly consumes a 
> > large amount of memory as it proceeds through the files.
 
> >
 
> > What am I doing wrong? I have attached the code required to reproduce my 
> > problem below.
 
> >
 
> >
 
> > public class TikaTest {
 
> >
 
> >     public void tikaProcess(Path filePath) {
 
> >         Tika t = new Tika();
 
> >         try {
 
> >             Metadata metadata = new Metadata();
 
> >
 
> >             String result = t.parse(filePath, metadata).toString();
 
> >         }catch (Exception e){
 
> >             e.printStackTrace();
 
> >         }
 
> >     }
 
> >
 
> >     public static void main(String[] args) {
 
> >         TikaTest tt = new TikaTest();
 
> >         try {
 
> >             Files.list(Paths.get("g:/somedata/")).forEach(
 
> >                     path -> tt.tikaProcess(path)
 
> >             );
 
> >         }catch (Exception e) {
 
> >             e.printStackTrace();
 
> >         }
 
> >     }
 
> > }
 
> 
 
> 


RE: Memory issues with the Tika Facade

2017-01-03 Thread Markus Jelsma
Hello - what is a large amount of memory, how do you determine it (make sure 
you look at RES, not VIRT) and what are your JVM settings.

It is not uncommon for programs to allocate much memory if the default max heap 
is used, 2 GB in my case. If your JVM eats too much, limit it by setting Xmx to 
a lower level.

Markus
 
-Original message-
> From:Will Jones 
> Sent: Tuesday 3rd January 2017 18:14
> To: user@tika.apache.org
> Subject: Memory issues with the Tika Facade
> 
> Hi, 
> 
> Big fan of what you are doing with Apache Tika. I have been using the Tika 
> facade to fetch metadata on each file in a directory containing a large 
> number of files.  
> 
> It returns the data I need, but the running process very quickly consumes a 
> large amount of memory as it proceeds through the files. 
> 
> What am I doing wrong? I have attached the code required to reproduce my 
> problem below. 
> 
> 
> public class TikaTest {
> 
> public void tikaProcess(Path filePath) {
> Tika t = new Tika();
> try {
> Metadata metadata = new Metadata();
> 
> String result = t.parse(filePath, metadata).toString();
> }catch (Exception e){
> e.printStackTrace();
> }
> }
> 
> public static void main(String[] args) {
> TikaTest tt = new TikaTest();
> try {
> Files.list(Paths.get("g:/somedata/")).forEach(
> path -> tt.tikaProcess(path)
> );
> }catch (Exception e) {
> e.printStackTrace();
> }
> }
> }


RE: Tika-server: shutdown on exceptions (esp. OOME)?

2016-11-04 Thread Markus Jelsma
By the way, if you run Tika embedded in your application and you expect to pass 
it lots of trash - which is usual when crawling the web - it is a good idea to 
launch a single thread for the parse job. Your application can wait for 
completion and if necessary terminate the thread after a timeout period.

Regards,
Markus

 
 
-Original message-
> From:Egbert van der Wal 
> Sent: Friday 4th November 2016 9:18
> To: user@tika.apache.org
> Subject: Tika-server: shutdown on exceptions (esp. OOME)?
> 
> Hi,
> 
> In a web crawling application, we're using Tika to parse binary files 
> such as PDF that the crawler encounters to extract text from it.
> 
> However, due to the wide variety of garbage encountered on the internet, 
> this isn't always succesful, and sometimes Tika throws exceptions due to 
> this. For example the OutOfMemory exception I reported (and should be 
> fixed in the upcoming release): 
> https://issues.apache.org/jira/browse/TIKA-2045
> 
> This used to crash the entire application. I've recently separated this 
> by running Tika-server and sending the documents over HTTP to this 
> server. However, when sending such broken documents, the OutOfMemory 
> process is still thrown in the Tika server. However, it does not 
> terminate. It keeps running, but will either run *very* slow, doesn't 
> accept new connections or doesn't respond to them. The usual 
> 'undetermined state' after a OOME, I suppose.
> 
> Anyway, I'd like to fix this by having the server check regularly if the 
> server is still running and restart it if necessary. But for that to 
> happen, I need it to shutdown when a OOME occurs.
> 
> Is there anything I can use to make this happen? Do I need to change the 
> code or is there a possibility to configure this using a config file of 
> some sort?
> 
> Thanks!
> 
> Egbert van der Wal
> 


RE: Tika-server: shutdown on exceptions (esp. OOME)?

2016-11-04 Thread Markus Jelsma
Hello - you can set a JVM option (forgot its name) that is triggered when OOM 
occurs. It executes a shell script and passes the JVM's pid as the first 
argument. Your script can then kill the JVM and have systemd restart it, or 
restart it from within that script.

Regards,
Markus

 
 
-Original message-
> From:Egbert van der Wal 
> Sent: Friday 4th November 2016 9:18
> To: user@tika.apache.org
> Subject: Tika-server: shutdown on exceptions (esp. OOME)?
> 
> Hi,
> 
> In a web crawling application, we're using Tika to parse binary files 
> such as PDF that the crawler encounters to extract text from it.
> 
> However, due to the wide variety of garbage encountered on the internet, 
> this isn't always succesful, and sometimes Tika throws exceptions due to 
> this. For example the OutOfMemory exception I reported (and should be 
> fixed in the upcoming release): 
> https://issues.apache.org/jira/browse/TIKA-2045
> 
> This used to crash the entire application. I've recently separated this 
> by running Tika-server and sending the documents over HTTP to this 
> server. However, when sending such broken documents, the OutOfMemory 
> process is still thrown in the Tika server. However, it does not 
> terminate. It keeps running, but will either run *very* slow, doesn't 
> accept new connections or doesn't respond to them. The usual 
> 'undetermined state' after a OOME, I suppose.
> 
> Anyway, I'd like to fix this by having the server check regularly if the 
> server is still running and restart it if necessary. But for that to 
> happen, I need it to shutdown when a OOME occurs.
> 
> Is there anything I can use to make this happen? Do I need to change the 
> code or is there a possibility to configure this using a config file of 
> some sort?
> 
> Thanks!
> 
> Egbert van der Wal
> 


RE: Code parser?

2016-09-28 Thread Markus Jelsma
Hello Mark,

Would SourceCodeParser suit your needs?
https://tika.apache.org/1.13/api/org/apache/tika/parser/code/SourceCodeParser.html

Regards,
Markus

-Original message-
> From:Mark Kerzner 
> Sent: Wednesday 28th September 2016 7:22
> To: Tika User 
> Subject: Code parser?
> 
> Hi, 
> 
> I want Tika to parse source code files, primarily Java. Is there anything 
> special I need to do? 
> 
> Right now my code seems to recognize Java as Microsoft Office Document. 
> 
> Thank you, 
> Mark 
> 
> Mark Kerzner, SHMsoft ,  
> Book a call with me here 
> Mobile: 713-724-2534
> Skype: mark.kerzner1
>  


RE: script tags in LinkContentHandler

2016-04-06 Thread Markus Jelsma
Hello! Yes, please open a ticket for it.

As for 2, in Nutch, you can instruct the Tika parser to use a different 
HtmlMapper. Use IdentityHtmlMapper! I forgot the property, but look it up in 
TikaParser.java, it is near the bottom. The default mapper is bad indeed if you 
want to grab stuff from normal elements.

M.

 
 
-Original message-
> From:Joseph Naegele 
> Sent: Wednesday 6th April 2016 22:13
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
> 
> Great, sounds good. Would you like me to open a ticket?
> 
> With respect to parsing outlinks in Nutch, there's actually two problems:
> 
> 1) 

RE: script tags in LinkContentHandler

2016-04-06 Thread Markus Jelsma
Yes indeed! Script is missing and that's a mistake. See discussion at 
TIKA-1835. We should open a new ticket for it.
Markus

 
 
-Original message-
> From:Ken Krugler 
> Sent: Tuesday 5th April 2016 22:24
> To: user@tika.apache.org
> Subject: Re: script tags in LinkContentHandler
> 
> Hi Joe, 
> I was looking at the version of this file in the (git) 
> Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my 
> mistake. 
> I’d rolled in Markus’s patch directly to support these other 
> link types, but I wish I’d remembered the old TIKA-503 discussion, as it 
> would have been better to make that support conditional on using a different 
> constructor, as it’s usually not a good idea to surprise consumers of parse 
> output with new types of data (links). 
> I’ll take this discussion over to TIKA-1835 now. 
> — Ken  
> On Apr 5, 2016, at 12:53pm, Joseph Naegele 
> mailto:jnaeg...@grierforensics.com>> wrote: 
> Thanks Ken, 
> I'm confused though. The LinkContentHandler in 1.12 now collects , , 
>  and , since https://issues.apache.org/jira/browse/TIKA-1835 
> . In my opinion, 

RE: [ANNOUNCE] Apache Tika 1.12 release

2016-02-15 Thread Markus Jelsma
Thanks! We'll upgrade Apache Nutch as soon as possible and finally integrate 
Boilerpipe extraction.
Markus
 
-Original message-
> From:Mattmann, Chris A (3980) 
> Sent: Monday 15th February 2016 20:38
> To: user@tika.apache.org
> Subject: Re: [ANNOUNCE] Apache Tika 1.12 release
> 
> Thanks Markus.
> 
> Looks like it is part of the Tika 1.12 tag: https://git.io/vgHNH
> 
> Also it just looks like it was omitted in CHANGES.txt I’ve fixed
> it in the updated version.
> 
> gi[chipotle:~/tmp/tika1.13] mattmann% git commit -m "Record change for
> TIKA-1835."
> [master 542bebc] Record change for TIKA-1835.
>  1 file changed, 3 insertions(+)
> [chipotle:~/tmp/tika1.13] mattmann% git push -u origin master
> Counting objects: 21, done.
> Delta compression using up to 4 threads.
> Compressing objects: 100% (3/3), done.
> Writing objects: 100% (3/3), 420 bytes | 0 bytes/s, done.
> Total 3 (delta 2), reused 0 (delta 0)
> To https://git-wip-us.apache.org/repos/asf/tika.git
>c5b9cb7..542bebc  master -> master
> Branch master set up to track remote branch master from origin.
> [chipotle:~/tmp/tika1.13] mattmann%
> 
> 
> 
> Cheers,
> Chris
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
> 
> 
> 
> 
> 
> -Original Message-
> From: Markus Jelsma 
> Reply-To: "user@tika.apache.org" 
> Date: Monday, February 15, 2016 at 11:29 AM
> To: "user@tika.apache.org" 
> Subject: RE: [ANNOUNCE] Apache Tika 1.12 release
> 
> >Great work! But i think TIKA-1835 is missing, at least in the
> >CHANGES.txt. I can't verify whether it is missing in the source, all
> >mirrors are still 404.
> >
> >Markus
> >
> > 
> > 
> >-Original message-
> >> From:Chris Mattmann 
> >> Sent: Monday 15th February 2016 19:45
> >> To: d...@tika.apache.org
> >> Cc: user@tika.apache.org
> >> Subject: [ANNOUNCE] Apache Tika 1.12 release
> >> 
> >> The Apache Tika project is pleased to announce the release of Apache
> >> Tika 1.12. The release contents have been pushed out to the main
> >> Apache release site and to the Central sync, so the releases should
> >> be available as soon as the mirrors get the syncs.
> >> 
> >> Apache Tika is a toolkit for detecting and extracting metadata and
> >> structured text content from various documents using existing parser
> >> libraries.
> >> 
> >> Apache Tika 1.12 contains a number of improvements and bug fixes.
> >> Details can be found in the changes file:
> >> http://www.apache.org/dist/tika/CHANGES-1.12.txt
> >> <http://www.apache.org/dist/tika/CHANGES-1.12.txt>
> >> 
> >> Apache Tika is available in source form from the following download
> >> page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.12-src.zip
> >> <http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.12-src.zip>
> >> 
> >> Apache Tika is also available in binary form or for use using Maven
> >> 2 from the Central Repository:
> >> http://repo1.maven.org/maven2/org/apache/tika/
> >> <http://repo1.maven.org/maven2/org/apache/tika/>
> >> 
> >> In the initial 48 hours, the release may not be available on all
> >> mirrors. When downloading from a mirror site, please remember to
> >> verify the downloads using signatures found on the Apache site:
> >> https://people.apache.org/keys/group/tika.asc
> >> <https://people.apache.org/keys/group/tika.asc>
> >> 
> >> For more information on Apache Tika, visit the project home page:
> >> http://tika.apache.org/ <http://tika.apache.org/>
> >> 
> >> — Chris Mattmann, on behalf of the Apache Tika community
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> 
> 


RE: [ANNOUNCE] Apache Tika 1.12 release

2016-02-15 Thread Markus Jelsma
Great work! But i think TIKA-1835 is missing, at least in the CHANGES.txt. I 
can't verify whether it is missing in the source, all mirrors are still 404.

Markus

 
 
-Original message-
> From:Chris Mattmann 
> Sent: Monday 15th February 2016 19:45
> To: d...@tika.apache.org
> Cc: user@tika.apache.org
> Subject: [ANNOUNCE] Apache Tika 1.12 release
> 
> The Apache Tika project is pleased to announce the release of Apache
> Tika 1.12. The release contents have been pushed out to the main
> Apache release site and to the Central sync, so the releases should
> be available as soon as the mirrors get the syncs.
> 
> Apache Tika is a toolkit for detecting and extracting metadata and
> structured text content from various documents using existing parser
> libraries.
> 
> Apache Tika 1.12 contains a number of improvements and bug fixes.
> Details can be found in the changes file:
> http://www.apache.org/dist/tika/CHANGES-1.12.txt
> 
> 
> Apache Tika is available in source form from the following download
> page: http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.12-src.zip
> 
> 
> Apache Tika is also available in binary form or for use using Maven
> 2 from the Central Repository:
> http://repo1.maven.org/maven2/org/apache/tika/
> 
> 
> In the initial 48 hours, the release may not be available on all
> mirrors. When downloading from a mirror site, please remember to
> verify the downloads using signatures found on the Apache site:
> https://people.apache.org/keys/group/tika.asc
> 
> 
> For more information on Apache Tika, visit the project home page:
> http://tika.apache.org/ 
> 
> — Chris Mattmann, on behalf of the Apache Tika community
> 
> 
> 
> 
> 
> 
> 


RE: [VOTE] Apache Tika 1.12 Release Candidate #1

2016-01-26 Thread Markus Jelsma
+1
 
 
-Original message-
> From:Mattmann, Chris A (3980) 
> Sent: Monday 25th January 2016 20:58
> To: user@tika.apache.org; d...@tika.apache.org
> Subject: [VOTE] Apache Tika 1.12 Release Candidate #1
> 
> Hi Folks,
> 
> A first candidate for the Tika 1.12 release is available at:
> 
>   https://dist.apache.org/repos/dist/dev/tika/
> 
> The release candidate is a zip archive of the sources in:
> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db24
> 27f9e84bc4ff31e569ae661c
> 
> 
> The SHA1 checksum of the archive is:
> 30e64645af643959841ac3bb3c41f7e64eba7e5f
> 
> In addition, a staged maven repository is available here:
> 
> https://repository.apache.org/content/repositories/orgapachetika-1015/
> 
> 
> Please vote on releasing this package as Apache Tika 1.12.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Tika 1.12
> [ ] -1 Do not release this package because…
> 
> Cheers,
> Chris
> 
> P.S. Of course here is my +1.
> 
> 
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
> 
> 
> 


RE: [DISCUSS] Tika 1.12-rc1 (was Re: New Tika release)

2016-01-21 Thread Markus Jelsma
Chris - that would be awesome! Nutch 1.12 can then bundle Tika 1.12!
Markus
 
 
-Original message-
> From:Mattmann, Chris A (3980) 
> Sent: Thursday 21st January 2016 21:30
> To: user@tika.apache.org
> Subject: [DISCUSS] Tika 1.12-rc1 (was Re: New Tika release)
> 
> Fine by me. I can cut a 1.12-rc1 this weekend.
> 
> If I don’t hear objections from the other devs, I’ll go for it
> on Friday. Also this will be the first Git release, so should
> be fun! :)
> 
> Cheers,
> Chris
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++
> 
> -Original Message-
> From: Markus Jelsma 
> Reply-To: "user@tika.apache.org" 
> Date: Thursday, January 21, 2016 at 12:27 PM
> To: "user@tika.apache.org" 
> Subject: New Tika release
> 
> >Hello PMC,
> >
> >With TIKA-1835 committed Apache Nutch can finally fully support text and
> >link extraction via Boilerpipe, something many Nutch users (myself not
> >included) have been looking forward too for the last few years. We, as
> >Nutch PMC, cannot release Nutch with that support without Tika so our
> >users must wait until this is resolved and available. I do not want to
> >put additional burden to a Tika release manager or whatever, but i do
> >want to kindly beg the Tika PMC to discuss a possible early release of a
> >new Apache Tika.
> >
> >Please let me know what you think.
> >
> >Regards,
> >Markus
> 
> 


New Tika release

2016-01-21 Thread Markus Jelsma
Hello PMC,

With TIKA-1835 committed Apache Nutch can finally fully support text and link 
extraction via Boilerpipe, something many Nutch users (myself not included) 
have been looking forward too for the last few years. We, as Nutch PMC, cannot 
release Nutch with that support without Tika so our users must wait until this 
is resolved and available. I do not want to put additional burden to a Tika 
release manager or whatever, but i do want to kindly beg the Tika PMC to 
discuss a possible early release of a new Apache Tika.

Please let me know what you think.

Regards,
Markus


RE: [VOTE] Apache Tika 1.7 Release

2015-01-06 Thread Markus Jelsma
+1

 
 
-Original message-
> From:Sergey Beryozkin 
> Sent: Tuesday 6th January 2015 9:36
> To: user@tika.apache.org
> Subject: Re: [VOTE] Apache Tika 1.7 Release
> 
> +1
> Sergey
> On 06/01/15 09:59, Tyler Palsulich wrote:
> > Hi All,
> >
> > A candidate for the Tika 1.7 release is available at:
> > https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> > http://svn.apache.org/repos/asf/tika/tags/1.7-rc2/
> >
> > The SHA1 checksum of the archive is
> >  0307a8367ae6f8b1103824fd11337fd89e24e6a4.
> >
> > In addition, a staged maven repository is available here:
> > https://repository.apache.org/content/repositories/orgapachetika-1006/org/apache/tika/
> >
> > Please vote on releasing this package as Apache Tika 1.7.
> >
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> >  [ ] +1 Release this package as Apache Tika 1.7
> >  [ ] -1 Do not release this package because...
> >
> > Thanks!
> > Tyler
> >
> > P.S. Count this as my +1!
> 
> 


RE: [VOTE] Apache Tika 1.5 RC1

2014-02-05 Thread Markus Jelsma
+1

-Original message-
From: David Meikle
Sent: Wednesday 5th February 2014 2:59
To: d...@tika.apache.org
Cc: user@tika.apache.org
Subject: [VOTE] Apache Tika 1.5 RC1

Hi Guys,

A candidate for the Tika 1.5 release is now available at:

http://people.apache.org/~dmeikle/tika-1.5-rc1/ 


The release candidate is a zip archive of the sources in:
http://svn.apache.org/repos/asf/tika/tags/1.5-rc1/ 


The SHA1 checksum of the archive is:

66adb7e73058da73a055a823bd61af48129c1179

A staged M2 repository can also be found onrepository.apache.org 
here:

https://repository.apache.org/content/repositories/orgapachetika-1000 


Please vote on releasing this package as Apache Tika 1.5.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.5
[ ] -1 Do not release this package because...

Here is my +1 for the release.

Cheers,
Dave




RE: Script element not reported in custom handler

2013-10-09 Thread Markus Jelsma
Allright, i removed the HTML_SCHEMA stuff and inside the HtmlHandler, added an 
exception for script in startElement:

} else if ("SCRIPT".equals(name)) {
startElementWithSafeAttributes("script", atts);
}

This goes well, the element is reported but the characters. To do so i removed 
the bodyLevel check from characters():

if (bodyLevel > 0 && discardLevel == 0) {
super.characters(ch, start, length);
}

etc etc. This obviously breaks some unit tests:

  testElementOrdering(org.apache.tika.parser.html.HtmlParserTest)
  testBrokenFrameset(org.apache.tika.parser.html.HtmlParserTest)
  testBoilerplateDelegation(org.apache.tika.parser.html.HtmlParserTest)
  testLinkHrefResolution(org.apache.tika.parser.html.HtmlParserTest)
  testNewlineAndIndent(org.apache.tika.parser.html.HtmlParserTest)

Now, this is clearly not the approach to do this. I assume the best thing is to 
treat the script similar to bodyLevel and titleLevel? Add some scriptLevel move 
on if we're a script? 
 
-Original message-
> From:Markus Jelsma 
> Sent: Wednesday 9th October 2013 11:25
> To: user@tika.apache.org
> Subject: Script element not reported in custom handler
> 
> Hi,
> 
> I'm building a new ContentHandler that needs to do some work on script 
> elements as well. But they are not reported in my startElement method. The 
> context has the IdentityHtmlMapper set and script does not get discarded in 
> Tika's own HtmlHandler. Instead, the script element is reported in 
> HtmlHandler but not in my custom handler.
> 
> The confusing thing is that i am able to get it in my handler when adding the 
> script element to TagSoup inside HtmlParser's constructor:
> HTML_SCHEMA.elementType("script", HTMLSchema.M_EMPTY, 65535, 0);
> 
> Without this, script and it's characters are only reported inside 
> HtmlHandler, never in custom handlers.
> 
> Am must be doing something wrong here, any hints?
> 
> Thanks,
> Markus
> 


Script element not reported in custom handler

2013-10-09 Thread Markus Jelsma
Hi,

I'm building a new ContentHandler that needs to do some work on script elements 
as well. But they are not reported in my startElement method. The context has 
the IdentityHtmlMapper set and script does not get discarded in Tika's own 
HtmlHandler. Instead, the script element is reported in HtmlHandler but not in 
my custom handler.

The confusing thing is that i am able to get it in my handler when adding the 
script element to TagSoup inside HtmlParser's constructor:
HTML_SCHEMA.elementType("script", HTMLSchema.M_EMPTY, 65535, 0);

Without this, script and it's characters are only reported inside HtmlHandler, 
never in custom handlers.

Am must be doing something wrong here, any hints?

Thanks,
Markus


RE: [VOTE] Apache TIka 1.4 Release Candidate #2

2013-06-18 Thread Markus Jelsma
+1, i'm already very happy with with trunk and 1.4.

 
 
-Original message-
> From:Chris Mattmann 
> Sent: Tue 18-Jun-2013 06:47
> To: d...@tika.apache.org
> Cc: user@tika.apache.org
> Subject: Re: [VOTE] Apache TIka 1.4 Release Candidate #2
> 
> Hey Guys,
> 
> Just FYI on this, the VOTE is still going if folks have a
> chance to review, would appreciate it. So far, we've got
> 1 binding +1. :)
> 
> Cheers,
> Chris
> 
> 
> 
> -Original Message-
> From: jpluser 
> Reply-To: "d...@tika.apache.org" 
> Date: Sunday, June 16, 2013 11:06 AM
> To: "d...@tika.apache.org" 
> Cc: "user@tika.apache.org" 
> Subject: [VOTE] Apache TIka 1.4 Release Candidate #2
> 
> >Hi Guys,
> >
> >A second candidate for the Tika 1.4 release is available at:
> >
> >http://people.apache.org/~mattmann/apache-tika-1.4/rc2/
> >
> >The release candidate is a zip archive of the sources in:
> >
> >http://svn.apache.org/repos/asf/tika/tags/1.4-rc2/
> >
> >The SHA1 checksum of the archive is
> >84ce9ebc104ca348a3cd8e95ec31a96169548c13
> >
> >A staged M2 repository can also be found on repository.apache.org here:
> >
> >https://repository.apache.org/content/repositories/orgapachetika-022/
> >
> >
> >Please vote on releasing this package as Apache Tika 1.4.
> >The vote is open for the next 72 hours and passes if a majority of at
> >least three +1 Tika PMC votes are cast.
> >
> >[ ] +1 Release this package as Apache Tika 1.4
> >[ ] -1 Do not release this package because...
> >
> >Here is my +1 for the release.
> >
> >Cheers,
> >Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> 
> 
> 


IdentityHtmlMapper not used by Boilerpipe?

2013-03-01 Thread Markus Jelsma
Hi,

We need div elements returned when we pass the stream through Boilerpipe from 
Nutch. We enable includeMarkup to get markup returned in the first place, but 
divs are not returned. In the ParseContext we set context.set(HtmlMapper.class, 
IdentityHtmlMapper.INSTANCE) but this is not honored for some reason.

For some reason in the background DefaultHtmlMapper is being used, we know this 
because we do get divs returned if we add DIV,div to the SAFE_ELEMENTS Map. 
This is not very good because we prefer not to modify this parser class and 
because the unit test 
testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) fails if the div is 
added to the DefaultHtmlMapper.SAFE_ELEMENTS.

Any ideas on how we can force the IdentityMapper to be used instead?

Thanks,
Markus


RE: Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

2013-02-12 Thread Markus Jelsma
Hi

Can you try Tika 1.3? It upgraded PDFBox from 1.7.0 to 1.7.1 and that fixed 
many issues with PDF parsing.

Cheers,
 
 
-Original message-
> From:Phani Kumar Samudrala 
> Sent: Tue 12-Feb-2013 11:30
> To: user@tika.apache.org
> Subject: Tika 1.2 PDF parse error  -  org.apache.pdfbox.cos.COSString cannot 
> be cast to org.apache.pdfbox.cos.COSDictionary
> 
> 
> I am using Tika 1.2 JAVA API to extract text from a PDF, I am getting the 
> following exception. I am getting this error for some PDF documents only and 
> for some PDFs it is working fine. I couldn't figure it out a reason for this. 
> When I tried using Tika 1.1 it works fine. Please let me if any of you have 
> seen this error and how to fix this?
> 
> Here is the exception:
> 
> 
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.pdf.PDFParser@1fbfd6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
> Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString 
> cannot be cast to org.apache.pdfbox.cos.COSDictionary
>   at 
> org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
>   at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   ... 3 more
> 
> 
> Here is the code snippet in JAVA:
> 
> 
> String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
>  File file = new File(fileString );
>  URL url = file.toURI().toURL();
> 
>  ParseContext context = new 
> ParseContext();;
>  Detector detector = new 
> DefaultDetector();;
>  Parser parser =  new 
> AutoDetectParser(detector);;
>  Metadata metadata = new Metadata();
>  context.set(Parser.class, parser); 
> //PPt,word,xlsx-- pdf,html
>  ByteArrayOutputStream outputstream = new 
> ByteArrayOutputStream();
> InputStream input = 
> TikaInputStream.get(url, metadata);
> ContentHandler handler = new 
> BodyContentHandler(outputstream);
> parser.parse(input, handler, 
> metadata, context);
> 
> input.close();
> outputstream.close();
> 
> 
> Thanks
> 
> 
> 
> 
> Disclaimer: This transmission, including attachments, is confidential, 
> proprietary, and may be privileged. It is intended solely for the intended 
> recipient. If you are not the intended recipient, you have received this 
> transmission in error and you are hereby advised that any review, disclosure, 
> copying, distribution, or use of this transmission, or any of the information 
> included therein, is unauthorized and strictly prohibited. If you have 
> received this transmission in error, please immediately notify the sender by 
> reply and permanently delete all copies of this transmission and its 
> attachments.
> 
> 
> 
> 
> 
> Disclaimer: This transmission, including attachments, is confidential, 
> proprietary, and may be privileged. It is intended solely for the intended 
> recipient. If you are not the intended recipient, you have received this 
> transmission in error and you are hereby advised that any review, disclosure, 
> copying, distribution, or use of this transmission, or any of the information 
> included therein, is unauthorized and strictly prohibited. If you have 
> received this transmission in error, please immediately notify the sender by 
> reply and permanently delete all copies of this transmission and its 
> attachments.
> 
> 


RE: Tika Test File repository

2013-01-21 Thread Markus Jelsma
Hi Andrew,

Each parser in the tika-parser program has one or more unit tests attached. 
Just build the tika-parsers tree and test with $ mvn test. Test files are 
included in the source tree.

Cheers,
Markus
 
 
-Original message-
> From:Freeman, Andrew [USA] 
> Sent: Mon 21-Jan-2013 15:34
> To: user@tika.apache.org
> Subject: Tika Test File repository
> 
> Hello,
>   I am new to Tika and was wondering if anyone maintains a repository of 
> sample files to exercise ALL the datatypes that Tika can detect?
> 
> Andrew S. Freeman
> 
> BAH
> 
> 


RE: Meta tag in body, what does Tika do to them?

2012-08-30 Thread Markus Jelsma
It actually seems (to become) legitimate to have it in the body but it would 
break existing tests. Should the tests be updated if the meta and link tags are 
going to be allowed in the body?

http://dev.w3.org/html5/md/#content-models
If the itemprop attribute is present on link or meta, they are flow content and 
phrasing content. The link and meta elements may be used where phrasing content 
is expected if the itemprop attribute is present.

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Thu 30-Aug-2012 18:59
> To: user@tika.apache.org
> Subject: RE: Meta tag in body, what does Tika do to them?
> 
> Apparently meta tags get thrown out because they are mapped to the HEAD group 
> in html.tssl in TagSoup.  If i replace the META element in the schema with a 
> group of 255 (belongs to anything) the unit test passes but other HtmlParser 
> tests fail. Although bad practice we still want to support these kind of 
> microdata elements.
> 
> We could try to expose the schema to so external code can override TagSoup's 
> schema elementType's. This would allow my unit test and others to pass.
> 
> Any advice on what to do next? 
> 
>  type="junit.framework.ComparisonFailure">junit.framework.ComparisonFailure: 
> expected: but was:
> at junit.framework.Assert.assertEquals(Assert.java:85)
> at junit.framework.Assert.assertEquals(Assert.java:91)
> at 
> org.apache.tika.parser.html.HtmlParserTest.testParseAscii(HtmlParserTest.java:81)
>  
> -Original message-
> > From:Markus Jelsma 
> > Sent: Tue 28-Aug-2012 14:48
> > To: user@tika.apache.org
> > Subject: Meta tag in body, what does Tika do to them?
> > 
> > Hi,
> > 
> > We're testing TIKA-980 (MicrodataContentHandler for Apache Tika) and a lot 
> > of URL's work out just fine if microdata is implemented properly.  But 
> > we're also seeing a lot of webmasters putting meta tags with microdata 
> > properties right in the body! They apparently read Google's webmaster page 
> > [1] about invisible microdata and went along adding meta tags to the body 
> > as if it's normal practice.
> > 
> > Whenever the webmaster has for example:
> > 
> > 
> > 17.50
> > 
> > ..the MicrodataContentHandler trips over it and cannot assign price to an 
> > itemscope because the DOM seems to become reordered/normalized,  even when 
> > i (in a test) properly close the meta tag. What does Tika do to meta tags 
> > in the content when using the IdentityHtmlMapper? How can we read the meta 
> > tag as if it's just another tag? Is there some switch or setting i've 
> > missed?
> > 
> > [1]: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=146750
> > 
> > Thanks,
> > Markus
> > 
> 


RE: Meta tag in body, what does Tika do to them?

2012-08-30 Thread Markus Jelsma
Apparently meta tags get thrown out because they are mapped to the HEAD group 
in html.tssl in TagSoup.  If i replace the META element in the schema with a 
group of 255 (belongs to anything) the unit test passes but other HtmlParser 
tests fail. Although bad practice we still want to support these kind of 
microdata elements.

We could try to expose the schema to so external code can override TagSoup's 
schema elementType's. This would allow my unit test and others to pass.

Any advice on what to do next? 

junit.framework.ComparisonFailure: 
expected: but was:
at junit.framework.Assert.assertEquals(Assert.java:85)
at junit.framework.Assert.assertEquals(Assert.java:91)
at 
org.apache.tika.parser.html.HtmlParserTest.testParseAscii(HtmlParserTest.java:81)
 
-Original message-
> From:Markus Jelsma 
> Sent: Tue 28-Aug-2012 14:48
> To: user@tika.apache.org
> Subject: Meta tag in body, what does Tika do to them?
> 
> Hi,
> 
> We're testing TIKA-980 (MicrodataContentHandler for Apache Tika) and a lot of 
> URL's work out just fine if microdata is implemented properly.  But we're 
> also seeing a lot of webmasters putting meta tags with microdata properties 
> right in the body! They apparently read Google's webmaster page [1] about 
> invisible microdata and went along adding meta tags to the body as if it's 
> normal practice.
> 
> Whenever the webmaster has for example:
> 
> 
> 17.50
> 
> ..the MicrodataContentHandler trips over it and cannot assign price to an 
> itemscope because the DOM seems to become reordered/normalized,  even when i 
> (in a test) properly close the meta tag. What does Tika do to meta tags in 
> the content when using the IdentityHtmlMapper? How can we read the meta tag 
> as if it's just another tag? Is there some switch or setting i've missed?
> 
> [1]: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=146750
> 
> Thanks,
> Markus
> 


RE: Article and section tags

2012-08-30 Thread Markus Jelsma
Jira issue:
https://issues.apache.org/jira/browse/TIKA-985 
 
-Original message-
> From:Jukka Zitting 
> Sent: Thu 30-Aug-2012 14:09
> To: user@tika.apache.org
> Subject: Re: Article and section tags
> 
> Hi,
> 
> On Thu, Aug 30, 2012 at 2:05 PM, Markus Jelsma
>  wrote:
> > The issue is with TagSoup's schema where some HTML5 elements are missing.
> > I fixed it for now by adding some elements to the schema in the (newly 
> > added)
> > constructor of Tika's HtmlParser.
> 
> Looks like a reasonable workaround. Can you file a TIKA issue for this
> and attach a patch with your changes?
> 
> > I used 255 as memberOf value because the group constants are not defined in
> > the schema and i couldn't find their integer repr. in the html.tssl file in 
> > TagSoup.
> > This is not a very elegant solution so how should it be solved?
> 
> I think the idea solution would be to have these changes included
> directly in TagSoup.
> 
> BR,
> 
> Jukka Zitting
> 


RE: Article and section tags

2012-08-30 Thread Markus Jelsma
The issue is with TagSoup's schema where some HTML5 elements are missing. I 
fixed it for now by adding some elements to the schema in the (newly added) 
constructor of Tika's HtmlParser.

public HtmlParser() {
super();

// Add some HTML5 elements
HTML_SCHEMA.elementType("section", HTMLSchema.M_ANY, 255, 0);
HTML_SCHEMA.elementType("article", HTMLSchema.M_ANY, 255, 0);
HTML_SCHEMA.elementType("time", HTMLSchema.M_ANY, 255, 0);
}

I used 255 as memberOf value because the group constants are not defined in the 
schema and i couldn't find their integer repr. in the html.tssl file in 
TagSoup. This is not a very elegant solution so how should it be solved? Having 
these elements returned is very important for the MicrodataContentHandler as 
many websites that implement microdata use it on HTML5 elements so the 
underlying parser must not throw them away.

Thanks,
Markus 
 
-Original message-
> From:Markus Jelsma 
> Sent: Wed 29-Aug-2012 14:35
> To: user@tika.apache.org
> Subject: RE: Article and section tags
> 
> I checked TagSoup's properties [1] and tried disabling ignoreBogonsFeature 
> that was introduced with TIKA-599. My unit test using the  element 
> instead of  now passes correctly. However, i cannot build Tika because 
> TestChmExtraction fails [3] and TestChmExtractor runs indefinately so i have 
> to terminate the build.
> 
> It seems to that TagSoup treats the  and  elements as 
> unknown elements but for some reason it does allow other HTML5 elements such 
> as  and likely others. What can i do? Is this an issue that should be 
> solved in TagSoup (how)?  Should we make the ignoreBogonsFeature configurable 
> via ParseContext? Other clever ideas?
> 
> Thanks!
> 
> [1]: http://mercury.ccil.org/~cowan/XML/tagsoup/#properties
> [2]: https://issues.apache.org/jira/browse/TIKA-599
> [3]: Running org.apache.tika.parser.chm.TestChmExtraction
> java.lang.NullPointerException
> at org.ccil.cowan.tagsoup.Element.(Element.java:39)
> at org.ccil.cowan.tagsoup.Parser.gi(Parser.java:970)
> at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:561)
> at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:104)
> at 
> org.apache.tika.parser.chm.CHMDocumentInformation.extract(CHMDocumentInformation.java:163)
> at 
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:74)
> at 
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
> at 
> org.apache.tika.parser.chm.TestChmExtraction$1.run(TestChmExtraction.java:58)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> java.lang.NullPointerException
> at org.ccil.cowan.tagsoup.Element.(Element.java:39)
> at org.ccil.cowan.tagsoup.Parser.gi(Parser.java:970)
> at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:561)
> at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:104)
> at 
> org.apache.tika.parser.chm.CHMDocumentInformation.extract(CHMDocumentInformation.java:163)
> at 
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:74)
> at 
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
> at 
> org.apache.tika.parser.chm.TestChmExtraction$1.run(TestChmExtraction.java:58)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> 
>  
> -Original message-
> > From:Markus Jelsma 
> > Sent: Wed 29-Aug-2012 13:40
> > To: user@tika.apache.org
> > Subject: Article and section tags
> > 
> > Hi,
> > 
> > I'm still testing internet pages for TIKA-980 and to my surprise it cannot 
> > deal with  and  tags. Whenever i print the tag's name in 
> > startElement i never see those elements and therefore i cannot extract 
> > microdata. Where are those elements going? How can i get them? I use the 
> > IdentityHtmlMapper in the unit test.
> > 
> > Thanks,
> > Markus
> > 
> 


RE: Article and section tags

2012-08-29 Thread Markus Jelsma
I checked TagSoup's properties [1] and tried disabling ignoreBogonsFeature that 
was introduced with TIKA-599. My unit test using the  element instead 
of  now passes correctly. However, i cannot build Tika because 
TestChmExtraction fails [3] and TestChmExtractor runs indefinately so i have to 
terminate the build.

It seems to that TagSoup treats the  and  elements as unknown 
elements but for some reason it does allow other HTML5 elements such as  
and likely others. What can i do? Is this an issue that should be solved in 
TagSoup (how)?  Should we make the ignoreBogonsFeature configurable via 
ParseContext? Other clever ideas?

Thanks!

[1]: http://mercury.ccil.org/~cowan/XML/tagsoup/#properties
[2]: https://issues.apache.org/jira/browse/TIKA-599
[3]: Running org.apache.tika.parser.chm.TestChmExtraction
java.lang.NullPointerException
at org.ccil.cowan.tagsoup.Element.(Element.java:39)
at org.ccil.cowan.tagsoup.Parser.gi(Parser.java:970)
at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:561)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:104)
at 
org.apache.tika.parser.chm.CHMDocumentInformation.extract(CHMDocumentInformation.java:163)
at 
org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:74)
at 
org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
at 
org.apache.tika.parser.chm.TestChmExtraction$1.run(TestChmExtraction.java:58)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
java.lang.NullPointerException
at org.ccil.cowan.tagsoup.Element.(Element.java:39)
at org.ccil.cowan.tagsoup.Parser.gi(Parser.java:970)
at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:561)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:104)
at 
org.apache.tika.parser.chm.CHMDocumentInformation.extract(CHMDocumentInformation.java:163)
at 
org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:74)
at 
org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
at 
org.apache.tika.parser.chm.TestChmExtraction$1.run(TestChmExtraction.java:58)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

 
-Original message-
> From:Markus Jelsma 
> Sent: Wed 29-Aug-2012 13:40
> To: user@tika.apache.org
> Subject: Article and section tags
> 
> Hi,
> 
> I'm still testing internet pages for TIKA-980 and to my surprise it cannot 
> deal with  and  tags. Whenever i print the tag's name in 
> startElement i never see those elements and therefore i cannot extract 
> microdata. Where are those elements going? How can i get them? I use the 
> IdentityHtmlMapper in the unit test.
> 
> Thanks,
> Markus
> 


Article and section tags

2012-08-29 Thread Markus Jelsma
Hi,

I'm still testing internet pages for TIKA-980 and to my surprise it cannot deal 
with  and  tags. Whenever i print the tag's name in 
startElement i never see those elements and therefore i cannot extract 
microdata. Where are those elements going? How can i get them? I use the 
IdentityHtmlMapper in the unit test.

Thanks,
Markus


Body element has no attributes in startElement()

2012-08-29 Thread Markus Jelsma
Hello,

We have a unit test (TIKA-980) where we want to read the attributes for the 
body element just as we read attributes of all other elements. The body 
element, however, always yields zero attributes! It's very empty.

public void startElement(String uri, String local, String name, Attributes 
attributes) throws SAXException {
  System.out.print(local + ": " + Integer.toString(attributes.getLength()));
  ..
}

The HTML is very simple: http://schema.org/WebPage";> 
but it always prints "body: 0". I can read attributes for all other elements 
and building the microdata works well, except when i have attributes in the 
body element. 

Any hints to share?

Thanks,
Markus


Meta tag in body, what does Tika do to them?

2012-08-28 Thread Markus Jelsma
Hi,

We're testing TIKA-980 (MicrodataContentHandler for Apache Tika) and a lot of 
URL's work out just fine if microdata is implemented properly.  But we're also 
seeing a lot of webmasters putting meta tags with microdata properties right in 
the body! They apparently read Google's webmaster page [1] about invisible 
microdata and went along adding meta tags to the body as if it's normal 
practice.

Whenever the webmaster has for example:


17.50

..the MicrodataContentHandler trips over it and cannot assign price to an 
itemscope because the DOM seems to become reordered/normalized,  even when i 
(in a test) properly close the meta tag. What does Tika do to meta tags in the 
content when using the IdentityHtmlMapper? How can we read the meta tag as if 
it's just another tag? Is there some switch or setting i've missed?

[1]: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=146750

Thanks,
Markus


RE: Logging in Tika

2012-08-27 Thread Markus Jelsma
Alright! I found it!
Thanks

 
 
-Original message-
> From:Jukka Zitting 
> Sent: Mon 27-Aug-2012 13:18
> To: user@tika.apache.org
> Subject: Re: Logging in Tika
> 
> Hi,
> 
> On Mon, Aug 27, 2012 at 12:32 PM, Markus Jelsma
>  wrote:
> > Are there any logging facilities in Tika? I'd like to log some warnings
> > but rather not throw an exception that terminates my parse.
> 
> You can use whatever logging framework you like.
> 
> Tika itself intentionally doesn't use or require any specific logging
> framework, but some of the parser libraries do, so in a typical Tika
> deployment you'd already have at least the Commons Logging, SLF4J and
> JUL interfaces available for logging.
> 
> BR,
> 
> Jukka Zitting
> 


Logging in Tika

2012-08-27 Thread Markus Jelsma
Hi,

Are there any logging facilities in Tika? I'd like to log some warnings but 
rather not throw an exception that terminates my parse. 

Thanks,
Markus


How does Tika put whitespace between tags

2012-07-19 Thread Markus Jelsma
Hi,

We're having an issue with Boilerpipe and the lack of whitespace between tags 
and terms. The ordinary Tika HTML parser does the job right. Take the following 
HTML for example:

abcdefxyz

becomes without BP: abc def xyz
becomes with BP: abcdefxyz

How does the Tika parser determine when to put whitespace between tags? What 
about languages without whitespace? When testing with ordinary chinese pages i 
see whitespace being added here too.
Also, any hints as where to look for the problem in the Boilerpipe code is 
appreciated.

Thanks,
Markus


RE: Surpluss whitespace in outlink anchors not collapsed

2012-07-09 Thread Markus Jelsma
Yes, it makes sense. We'll collapse it in Nutch.

Thanks
Markus

 
 
-Original message-
> From:Jukka Zitting 
> Sent: Mon 09-Jul-2012 17:17
> To: user@tika.apache.org
> Subject: Re: Surpluss whitespace in outlink anchors not collapsed
> 
> Hi,
> 
> On Thu, Jul 5, 2012 at 7:51 PM, Markus Jelsma
>  wrote:
> > Is this a feature of Tika or a bug?
> 
> It's a feature at least until someone comes up with a compelling
> enough rationale why anchor text should be handled differently.
> 
> Note that deciding what to do with cases like "foo bar" or
> "foobar" can be quite tricky. A client like an indexer that
> simply ignores all markup should ideally see those as "foo bar" and
> "foobar" respectively. It may be difficult to make a parser
> implementation that normalizes whitespace in and around anchors work
> correctly in all such cases.
> 
> > Do we have to remove surpluss whitespace in Nutch ourselves?
> 
> I think that's the easiest solution here.
> 
> BR,
> 
> Jukka Zitting
> 


Surpluss whitespace in outlink anchors not collapsed

2012-07-05 Thread Markus Jelsma
Hello,

With NUTCH-1233 we are going to rely on Tika for outlink extraction, it works 
nicely except for one small issue: consecutive whitespace in an anchor is not 
collapsed to a single character. The anchor text is identical to the HTML 
source and can have surpluss spaces, newlines or tabulators:

i am an anchor \n\t\t bla bla does not become "i am an 
anchor bla bla".

Is this a feature of Tika or a bug? Do we have to remove surpluss whitespace in 
Nutch ourselves?

Thanks!
Markus


Re: [VOTE] Apache Tika 1.1 release rc #1

2012-03-08 Thread Markus Jelsma
+1



On Wednesday 07 March 2012 22:35:27 Mattmann, Chris A (388J) wrote:
> Hi Folks,
> 
> A candidate for the Tika 1.1 release is available at:
> 
>   http://people.apache.org/~mattmann/apache-tika-1.1/rc1/
> 
> The release candidate is a zip archive of the sources in:
> 
>http://svn.apache.org/repos/asf/tika/tags/1.1/
> 
> The SHA1 checksum of the archive is
> d3185bb22fa3c7318488838989aff0cc9ee025df.
> 
> Please vote on releasing this package as Apache Tika 1.1.
> The vote is open for at least the next 72 hours and passes if a majority of
> at least three +1 Tika PMC votes are cast.
> 
>[ ] +1 Release this package as Apache Tika 1.1
>[ ] -1 Do not release this package because...
> 
> Thanks!
> 
> Cheers,
> Chris
> 
> P.S. Here's my +1.
> 
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++

-- 
Markus Jelsma - CTO - Openindex


Re: tika-core, tika-parser?

2012-02-08 Thread Markus Jelsma
Thanks! 

And about the deps, it seems Nutch does load them from the local jar caches. 
That solves the earlier confusion.

Cheers

On Wednesday 08 February 2012 15:26:42 Nick Burch wrote:
> On Wed, 8 Feb 2012, Markus Jelsma wrote:
> > When i start commenting out the parsers listed Tika won't build anymore
> > as a lot of unit tests begin to fail. Is this supposed to happen?
> 
> Yup, that's to be expected. If you remove a parser from that list, it
> won't show up in things like the AutoDetectParser, so many of the tests
> won't be able to find it and will then fail. (Tests that use the parser
> explicitly will still work, but most use AutoDetectParser or DefaultParser
> and won't)
> 
> I still think your best bet is probably to just include all of tika (core,
> parsers and dependencies) and go from there. If you do want to disable a
> few parsers, the simplest way is likely to build the full jar, then unpack
> and edit the services file
> 
> Nick


Re: tika-core, tika-parser?

2012-02-08 Thread Markus Jelsma


On Wednesday 08 February 2012 13:47:10 Nick Burch wrote:
> On Wed, 8 Feb 2012, Markus Jelsma wrote:
> > Interesting. We build Tika with Maven and copy the core jar to our Nutch
> > libs. That is the only Tika jar Nutch has, there are no parser libs
> > anywhere in Nutch but parsing works.
> 
> There must be other ones elsewhere, otherwise you wouldn't have any
> parsers!
> 
> I can only suggest you try getting a parser object, fetch its classloader
> and print the resource location of the class - that'll tell you where it
> came from
> 
> >> You've got a Tika parsers config file that says that the DWG parser is
> >> present, but you haven't included it. You should either include all the
> >> tika parsers, or not include the default
> >> org.apache.tika.parsers.Parsers config file that lists them

When i start commenting out the parsers listed Tika won't build anymore as a 
lot of unit tests begin to fail. Is this supposed to happen?

> > 
> > Hmm, i've been looking everywhere but i don't seem to find a config file
> > in either Nutch or Tika. Should it be included in Tika when i build it
> > through Maven as usual?
> 
> You're looking for occurences of
> META-INF/services/org.apache.tika.parser.Parser in your jars, by default
> it's in the tika-parsers jar, and in the jars of any third party parsers
> you have auto-loading
> 
> Nick

-- 
Markus Jelsma - CTO - Openindex


Re: tika-core, tika-parser?

2012-02-08 Thread Markus Jelsma


On Wednesday 08 February 2012 13:47:10 Nick Burch wrote:
> On Wed, 8 Feb 2012, Markus Jelsma wrote:
> > Interesting. We build Tika with Maven and copy the core jar to our Nutch
> > libs. That is the only Tika jar Nutch has, there are no parser libs
> > anywhere in Nutch but parsing works.
> 
> There must be other ones elsewhere, otherwise you wouldn't have any
> parsers!
> 
> I can only suggest you try getting a parser object, fetch its classloader
> and print the resource location of the class - that'll tell you where it
> came from
> 

I'll check the Nutch dev list about this. I must be missing things here.

> >> You've got a Tika parsers config file that says that the DWG parser is
> >> present, but you haven't included it. You should either include all the
> >> tika parsers, or not include the default
> >> org.apache.tika.parsers.Parsers config file that lists them
> > 
> > Hmm, i've been looking everywhere but i don't seem to find a config file
> > in either Nutch or Tika. Should it be included in Tika when i build it
> > through Maven as usual?
> 
> You're looking for occurences of
> META-INF/services/org.apache.tika.parser.Parser in your jars, by default
> it's in the tika-parsers jar, and in the jars of any third party parsers
> you have auto-loading

Yes i found it! I'm now uncommented by trial and error to get the thing 
working as i don't know which parsers we have ;)
src/main/resources/META-INF/services/org.apache.tika.parser.Parser

Thanks!
> 
> Nick



Re: tika-core, tika-parser?

2012-02-08 Thread Markus Jelsma
Hi,

On Wednesday 08 February 2012 13:23:14 you wrote:
> On Wed, 8 Feb 2012, Markus Jelsma wrote:
> > In Nutch we have a copy of Tika-core. But with just that lib we also
> > have access to the Tika.parser API from the other module. How does this
> > all work because i have had confusing results in the past (and now).
> 
> Tika Core comes with the core of Tika, which includes a definition of how
> parsers work, but not any parsers
> 
> All the parsers themselves are in the Tika Parsers module. Most of the
> parsers have dependencies on third party libraries, it's normally
> recommended to use one of Maven or the OSGi Bundle to have these pulled in
> for you

Interesting. We build Tika with Maven and copy the core jar to our Nutch libs. 
That is the only Tika jar Nutch has, there are no parser libs anywhere in 
Nutch but parsing works.
My first guess would be they are all embedded in the tika core jar but this 
isn't true as you say. Have you got any idea how this all does work then?

> 
> > Right now we've added a class to org.apache.tika.parser.html but we get a
> > ClassNotFound with a newly compiled Tika. Our code compiles when we add
> > tika- parsers to the classpath, but when we run we get some obscure
> > exception:
> > 
> > Exception in thread "main" java.lang.NoClassDefFoundError: Could not
> > initialize class org.apache.tika.parser.dwg.DWGParser
> > 
> >at java.lang.Class.forName0(Native Method)
> >at java.lang.Class.forName(Class.java:247)
> >at sun.misc.Service$LazyIterator.next(Service.java:271)
> >at
> >org.apache.nutch.parse.tika.TikaConfig.(TikaConfig.java:149
> >) at
> > 
> > org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:2
> > 11)
> 
> You've got a Tika parsers config file that says that the DWG parser is
> present, but you haven't included it. You should either include all the
> tika parsers, or not include the default org.apache.tika.parsers.Parsers
> config file that lists them

Hmm, i've been looking everywhere but i don't seem to find a config file in 
either Nutch or Tika. Should it be included in Tika when i build it through 
Maven as usual?

> 
> Nick

Thanks
--- Begin Message ---
Hi,

On Wednesday 08 February 2012 13:23:14 you wrote:
> On Wed, 8 Feb 2012, Markus Jelsma wrote:
> > In Nutch we have a copy of Tika-core. But with just that lib we also
> > have access to the Tika.parser API from the other module. How does this
> > all work because i have had confusing results in the past (and now).
> 
> Tika Core comes with the core of Tika, which includes a definition of how
> parsers work, but not any parsers
> 
> All the parsers themselves are in the Tika Parsers module. Most of the
> parsers have dependencies on third party libraries, it's normally
> recommended to use one of Maven or the OSGi Bundle to have these pulled in
> for you

Interesting. We build Tika with Maven and copy the core jar to our Nutch libs. 
That is the only Tika jar Nutch has, there are no parser libs anywhere in 
Nutch but parsing works.
My first guess would be they are all embedded in the tika core jar but this 
isn't true as you say. Have you got any idea how this all does work then?

> 
> > Right now we've added a class to org.apache.tika.parser.html but we get a
> > ClassNotFound with a newly compiled Tika. Our code compiles when we add
> > tika- parsers to the classpath, but when we run we get some obscure
> > exception:
> > 
> > Exception in thread "main" java.lang.NoClassDefFoundError: Could not
> > initialize class org.apache.tika.parser.dwg.DWGParser
> > 
> >at java.lang.Class.forName0(Native Method)
> >at java.lang.Class.forName(Class.java:247)
> >at sun.misc.Service$LazyIterator.next(Service.java:271)
> >at
> >org.apache.nutch.parse.tika.TikaConfig.(TikaConfig.java:149
> >) at
> > 
> > org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:2
> > 11)
> 
> You've got a Tika parsers config file that says that the DWG parser is
> present, but you haven't included it. You should either include all the
> tika parsers, or not include the default org.apache.tika.parsers.Parsers
> config file that lists them

Hmm, i've been looking everywhere but i don't seem to find a config file in 
either Nutch or Tika. Should it be included in Tika when i build it through 
Maven as usual?

> 
> Nick

Thanks
--- End Message ---


tika-core, tika-parser?

2012-02-08 Thread Markus Jelsma
Hi,

In Nutch we have a copy of Tika-core. But with just that lib we also have 
access to the Tika.parser API from the other module. How does this all work 
because i have had confusing results in the past (and now).

Right now we've added a class to org.apache.tika.parser.html but we get a 
ClassNotFound with a newly compiled Tika. Our code compiles when we add tika-
parsers to the classpath, but when we run we get some obscure exception:

Exception in thread "main" java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.tika.parser.dwg.DWGParser
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at sun.misc.Service$LazyIterator.next(Service.java:271)
at org.apache.nutch.parse.tika.TikaConfig.(TikaConfig.java:149)
at 
org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211)
at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:255)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)

When we previously patched Tika in the core module all went perfectly well but 
patching the parser module and getting it all compiled in tike-core.jar seems 
tricky. Any advice? What am i missing? How do the parser libs end up in the 
core jar?

Thanks


Re: Using BP ImageExtractor

2012-02-07 Thread Markus Jelsma
Hi,

We build both BP and Tika from trunk for usage in Nutch. However, i am unsure 
how to use BP's ImageExtractor with Tika's API's. BP's API asks for a 
TextDocument object which we don't have. Is there another API i am unaware of 
we can use with Tika's TeeContentHandler? Or do you happen to have some 
example for this?
We can succesfully extract images with BP standalone but we need to do it with 
Tika.

Thanks

> I've used BP for this purpose.
> You need to build from trunk.
> 
> --
> Dotan, @jondot <http://twitter.com/jondot>
> 
> On Tue, Feb 7, 2012 at 5:58 PM, Markus Jelsma 
wrote:
> > Hi,
> > 
> > For Apache Nutch we'd like to see if we can use Boilerpipe to extract a
> > meaningful image for a given document. The BP API provides a method to
> > return
> > a set of images for a given TextDocument object and an extractor.
> > 
> > Tika does not return us a TextDocument object after parsing so it seems i
> > cannot use the API with Tika as-is.
> > 
> > Right now Nutch is about to use the TeeContentHandler for retrieving
> > hyperlinks of the whole document plus parsed content by Boilerpipe (this
> > will
> > be committed when we upgrade to Tika 1.1). Is there an easy way to use
> > that ImageExtractor with Tika? If so, how and if not, what can we do?
> > 
> > Thanks


Using BP ImageExtractor

2012-02-07 Thread Markus Jelsma
Hi,

For Apache Nutch we'd like to see if we can use Boilerpipe to extract a 
meaningful image for a given document. The BP API provides a method to return 
a set of images for a given TextDocument object and an extractor.

Tika does not return us a TextDocument object after parsing so it seems i 
cannot use the API with Tika as-is. 

Right now Nutch is about to use the TeeContentHandler for retrieving 
hyperlinks of the whole document plus parsed content by Boilerpipe (this will 
be committed when we upgrade to Tika 1.1). Is there an easy way to use that 
ImageExtractor with Tika? If so, how and if not, what can we do?

Thanks


Re: LinkCH need Link.getMethod() and .getRel()

2011-12-21 Thread Markus Jelsma
Issue with patch. I omitted the method as this applies only to forms and we 
might actually not need it.

https://issues.apache.org/jira/browse/TIKA-824

On Wednesday 21 December 2011 11:56:12 Markus Jelsma wrote:
> Hi,
> 
> For Apache Nutch we require the method and rel of hyperlinks. The rel is,
> of course, used to be polite and don't follow rel="nofollow" and the
> method is used to enable used to fetch those outlinks too.
> 
> The combination of TeeCH and LinkCH with some other CH works very well but
> this is some thing i'm stuck with. I've peeked in o.a.t.sax.Link*java and i
> think i could simply add those fields and default to null if not available.
> Would that be the way to go?
> 
> Thanks!

-- 
Markus Jelsma - CTO - Openindex


LinkCH need Link.getMethod() and .getRel()

2011-12-21 Thread Markus Jelsma
Hi,

For Apache Nutch we require the method and rel of hyperlinks. The rel is, of 
course, used to be polite and don't follow rel="nofollow" and the method is 
used to enable used to fetch those outlinks too.

The combination of TeeCH and LinkCH with some other CH works very well but 
this is some thing i'm stuck with. I've peeked in o.a.t.sax.Link*java and i 
think i could simply add those fields and default to null if not available. 
Would that be the way to go?

Thanks!


Re: Boilerpipe and getting all URL's

2011-12-20 Thread Markus Jelsma
Excellent! I'll look into it!

Thanks!

> On Tue, Dec 20, 2011 at 9:48 AM, Markus Jelsma
> 
> wrote:
> > Hi,
> > 
> > How can i parse documents with the Boilerpipe content handler and still
> > be able to read all hyperlinks? Right now we parse twice, once to get
> > the text without boilerplate text and once to get all hyperlinks.
> 
> Use the TeeContentHandler and give it your BoilerpipContentHandler and your
> LinkContentHandler. Then use that to pass into the parser.


Boilerpipe and getting all URL's

2011-12-20 Thread Markus Jelsma
Hi,

How can i parse documents with the Boilerpipe content handler and still be 
able to read all hyperlinks? Right now we parse twice, once to get the text 
without boilerplate text and once to get all hyperlinks.

Any advice?
Thanks
-- 
Markus Jelsma - CTO - Openindex


Re: [ANNOUNCE] Welcome Jerome Charron as Tika committer + PMC member

2011-12-12 Thread Markus Jelsma
cheers!

> Welcome Jerome!
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Mon, Dec 12, 2011 at 1:26 PM, Mattmann, Chris A (388J)
> 
>  wrote:
> > Hi Folks,
> > 
> > Please welcome Jerome Charron to the ranks of the Tika PMC and as a Tika
> > committer. He's just been VOTEd in and we're really happy to have him
> > around.
> > 
> > Jerome, please feel free to say a bit about yourself. Thanks and welcome
> > aboard!
> > 
> > Cheers,
> > Chris
> > 
> > ++
> > Chris Mattmann, Ph.D.
> > Senior Computer Scientist
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 171-266B, Mailstop: 171-246
> > Email: chris.a.mattm...@nasa.gov
> > WWW:   http://sunset.usc.edu/~mattmann/
> > ++
> > Adjunct Assistant Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++


Re: [ANNOUNCE] Apache Tika 1.0 released

2011-11-08 Thread Markus Jelsma
Great stuff!
Cheers!

> (...apologies for the cross posting...)
> 
> The Apache Tika project is pleased to announce the release of Apache Tika
> 1.0. The release contents have been pushed out to the main Apache release
> site and to the Maven Central sync, so the releases should be available as
> soon as the mirrors get the syncs.
> 
> Apache Tika is a toolkit for detecting and extracting metadata and
> structured text content from various documents using existing parser
> libraries.
> 
> Apache Tika 1.0 contains a number of improvements and bug fixes. Details
> can be found in the changes file:
> 
> http://www.apache.org/dist/tika/CHANGES-1.0.txt
> 
> Apache Tika is available in source form from the following download page:
> http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
> 
> Apache Tika is also available in binary form or for use using Maven 2 from
> the Central Maven Repository:
> 
> http://repo1.maven.org/maven2/org/apache/tika/
> 
> In the initial 48 hours, the release may not be available on all mirrors.
> When downloading from a mirror site, please remember to verify the
> downloads using signatures found on the Apache site:
> 
> http://www.apache.org/dist/tika/KEYS
> 
> For more information on Apache Tika, visit the project home page:
> 
> http://tika.apache.org/
> 
> -- Chris Mattmann (on behalf of the Apache Tika community)
> 
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++


Re: Resolving of relative URL's

2011-09-19 Thread Markus Jelsma
Jukka and others,

There are now several cases known to us where we would like to control URL 
resolving. All cases share one similarity, URL's being relative in the 
original source. How could we instruct the parser or modify the code to do so?

Right now we need to come up with regular expressions to detect commonalities 
in URI segments and throw them away.

Thanks

> Hi,
> 
> On Mon, Sep 12, 2011 at 6:00 PM, Markus Jelsma
> 
>  wrote:
> > Yes! Nutch extracts all outlinks but there is a tedious crawler trap
> > regarding to self-referring relative URL's. Consider
> > http://example.org/content/ with a list of relative links (menu on each
> > page) of which one or more is actually incorrect:
> > 
> > ../more-content/
> > ../other-content/
> > wrong-link/
> > ../even-more/content/
> > 
> > For pages without base href the wrong-link/ is resolved to
> > http://example.org/content/wrong-link/. The new page also contains the
> > same url list as above so the next wrong link is resolved as
> > http://example.org/content/wrong-link/wrong-link/..
> > 
> > An endless nightmare for a crawler :)
> 
> How would not resolving the links in Tika help in this case? To crawl
> the site, the crawler would in any case have to resolve the links, and
> come up with the exact same resolved URLs.
> 
> BR,
> 
> Jukka Zitting


Re: Resolving of relative URL's

2011-09-12 Thread Markus Jelsma


On Monday 12 September 2011 18:08:50 Jukka Zitting wrote:
> > For pages without base href the wrong-link/ is resolved to
> > http://example.org/content/wrong-link/. The new page also contains the
> > same url list as above so the next wrong link is resolved as
> > http://example.org/content/wrong-link/wrong-link/..
> > 
> > An endless nightmare for a crawler :)
> 
> How would not resolving the links in Tika help in this case? To crawl
> the site, the crawler would in any case have to resolve the links, and
> come up with the exact same resolved URLs.
> 

I could choose not to collect those relative URL's as outlink. Right now i 
cannot determine whether a URL was originally a relative URL.

> BR,
> 
> Jukka Zitting

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Resolving of relative URL's

2011-09-12 Thread Markus Jelsma
Hi,

On Monday 12 September 2011 17:35:49 Jukka Zitting wrote:
> Hi,
> 
> On Mon, Sep 12, 2011 at 4:58 PM, Markus Jelsma
> 
>  wrote:
> > Since TIKA-287 all relative URL's are resolved to absolutes regardless of
> > the presence of the base element. This is not always desired behaviour.
> 
> Can you describe a use case where that's not the desired behaviour? I
> would assume that a resolved URL is always preferred to an unresolved
> one.

Yes! Nutch extracts all outlinks but there is a tedious crawler trap regarding 
to self-referring relative URL's. Consider http://example.org/content/ with a 
list of relative links (menu on each page) of which one or more is actually 
incorrect:

../more-content/
../other-content/
wrong-link/
../even-more/content/

For pages without base href the wrong-link/ is resolved to 
http://example.org/content/wrong-link/. The new page also contains the same 
url list as above so the next wrong link is resolved as 
http://example.org/content/wrong-link/wrong-link/..

An endless nightmare for a crawler :)

> 
> > Would it be possible to use some setting to instruct the parser not to
> > resolve URL's if the base element doesn't exist or does not have an href
> > attribute with a valid absolute URL?
> 
> Currently Tika looks at the CONTENT_LOCATION and RESOURCE_NAME_KEY
> metadata keys for the default base URL. If neither is present and
> there is no  element, then URLs in the document will
> not be resolved.

Hm, testing with Nutch i see that URL's are always extracted. Seems at least 
one meta data key is present although i'm not too sure. In the Nutch code an 
empty org.apache.tika.metadata.Metadata object is passed to the parse() 
method.

> 
> BR,
> 
> Jukka Zitting

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Resolving of relative URL's

2011-09-12 Thread Markus Jelsma
Hi,

Since TIKA-287 all relative URL's are resolved to absolutes regardless of the 
presence of the base element. This is not always desired behaviour.

Would it be possible to use some setting to instruct the parser not to resolve 
URL's if the base element doesn't exist or does not have an href attribute 
with a valid absolute URL?


Thanks,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


  1   2   >