Re: XMLReaderUtils Contention

2021-12-06 Thread Sebastian Nagel
Hi Cristian, hi Tim, >> org.apache.tika.utils.XMLReaderUtils Contention waiting for a >> SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE I regularly count these messages in the log files of a large and highly concurrent web crawl with 160 threads fetching data and performing content t

Re: DcXMLParser to parse XML files

2021-11-16 Thread Sebastian Nagel
e XMLParser. Let me know > if I can help with this temporary workaround. > > Thank you for identifying this problem! > > Cheers, > > Tim > > On Thu, Nov 11, 2021 at 7:21 AM Sebastian Nagel > wrote: >> >> Hi, >> >> when is the Dublin C

DcXMLParser to parse XML files

2021-11-11 Thread Sebastian Nagel
Hi, when is the Dublin Core XML parser used to parse XML files? Is there a configuration required to enable the DcXMLParser? There is a difference between 1.27 and 2.1.0: $> java -jar tika-app-1.27.jar -J \ https://news.haltonhills.halinet.on.ca/dc.xml \ | jq '.[0]."dc:title"' "Deaths"

Re: [VOTE] Release Apache Tika 1.25 Candidate #2

2020-11-27 Thread Sebastian Nagel
+1 Integrated release candidate into Nutch: - successfully run Nutch unit tests - verified parsing of test documents (PDFs, images, HTML, RSS, tar/zip) Thanks! Sebastian On 11/25/20 1:20 PM, Tim Allison wrote: > A candidate for the Tika 1.25 release is available at: >   https://dist.apache.or

Re: [VOTE] Release Apache Tika 1.24.1 Candidate #1

2020-04-21 Thread Sebastian Nagel
+1 integrated release candidate into Nutch: tests pass and successfully run a sample crawl including also PDFs, MP3s, etc. On 4/17/20 11:38 PM, Tim Allison wrote: > > A candidate for the Tika 1.24.1 release is available at: >   https://dist.apache.org/repos/dist/dev/tika/ > > The release ca

Thread-safety and locking of methods Tika.detect(...) and MimeType.detect(...)

2018-05-17 Thread Sebastian Nagel
Hi, two questions regarding thread-safety and locking in Tika's MIME type detectors while investigating global locks in NUTCH-2578 (multi-threaded fetcher) [1]. First, are the methods Tika.detect(...) and MimeType.detect(...) thread-safe? I've found an answer from 2011 about Tika.detect(...) h

Re: Tika content detection and crawled "remote" content

2017-08-10 Thread Sebastian Nagel
and open issues for the problems with HTML and scripting languages. Thanks, Sebastian On 07/04/2017 12:18 PM, Sebastian Nagel wrote: > Hi, > > recently I've plugged in Tika's content detection into Common Crawl's crawler > (modified Nutch) with > the target to g

Re: Adding a WARC parser to Tika

2017-07-11 Thread Sebastian Nagel
FYI, for a similar task - testing crawler-commons sitemaps.org parser - I've started a small test tools which reads the sitemaps from WARC files: https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/pOLsCVwRsxY https://github.com/sebastian-nagel/sitemap-performance-test

Re: Tika content detection and crawled "remote" content

2017-07-06 Thread Sebastian Nagel
JIRA with small samples would be fantastic. I think >> working in desc order of >> most common to least would be best...php, asp, coldfusion. >> >> I'm about to cut 1.16, but I look forward to improving Tika with this >> tremendously useful data. >> >&g

Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Sebastian Nagel
now a new > field "mime-detected" which makes it easy to search or grep for confusion > pairs. > > This is an amazing step forward for our regression corpus. We used to rely > on the http headers and/or file suffix to oversample non-html. This will > allow far cleaner

Tika content detection and crawled "remote" content

2017-07-04 Thread Sebastian Nagel
Hi, recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1]. For the June 2017 crawl I've prepared a comparison of content types sen

encrypted PDF created with PDFMaker failed to parse

2013-05-23 Thread Sebastian Nagel
Hi, I have a bunch of PDF files - encrypted to prohibit changes and annotations (this matters because documents are forms) - created by Acrobat PDFMaker Tika (1.3/trunk) fails to parse these documents. A trial using NonSequentialParser (see PDFBOX-1554 and PDFBOX-1387) looks promising: text is