from:"Sebastian Nagel"

Re: XMLReaderUtils Contention

2021-12-06 Thread Sebastian Nagel

Hi Cristian, hi Tim, >> org.apache.tika.utils.XMLReaderUtils Contention waiting for a >> SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE I regularly count these messages in the log files of a large and highly concurrent web crawl with 160 threads fetching data and performing content t

Re: DcXMLParser to parse XML files

2021-11-16 Thread Sebastian Nagel

e XMLParser. Let me know > if I can help with this temporary workaround. > > Thank you for identifying this problem! > > Cheers, > > Tim > > On Thu, Nov 11, 2021 at 7:21 AM Sebastian Nagel > wrote: >> >> Hi, >> >> when is the Dublin C

DcXMLParser to parse XML files

2021-11-11 Thread Sebastian Nagel

Hi, when is the Dublin Core XML parser used to parse XML files? Is there a configuration required to enable the DcXMLParser? There is a difference between 1.27 and 2.1.0: $> java -jar tika-app-1.27.jar -J \ https://news.haltonhills.halinet.on.ca/dc.xml \ | jq '.[0]."dc:title"' "Deaths"

Re: [VOTE] Release Apache Tika 1.25 Candidate #2

2020-11-27 Thread Sebastian Nagel

+1 Integrated release candidate into Nutch: - successfully run Nutch unit tests - verified parsing of test documents (PDFs, images, HTML, RSS, tar/zip) Thanks! Sebastian On 11/25/20 1:20 PM, Tim Allison wrote: > A candidate for the Tika 1.25 release is available at: > https://dist.apache.or

Re: [VOTE] Release Apache Tika 1.24.1 Candidate #1

2020-04-21 Thread Sebastian Nagel

+1 integrated release candidate into Nutch: tests pass and successfully run a sample crawl including also PDFs, MP3s, etc. On 4/17/20 11:38 PM, Tim Allison wrote: > > A candidate for the Tika 1.24.1 release is available at: > https://dist.apache.org/repos/dist/dev/tika/ > > The release ca

Thread-safety and locking of methods Tika.detect(...) and MimeType.detect(...)

2018-05-17 Thread Sebastian Nagel

Hi, two questions regarding thread-safety and locking in Tika's MIME type detectors while investigating global locks in NUTCH-2578 (multi-threaded fetcher) [1]. First, are the methods Tika.detect(...) and MimeType.detect(...) thread-safe? I've found an answer from 2011 about Tika.detect(...) h

Re: Tika content detection and crawled "remote" content

2017-08-10 Thread Sebastian Nagel

and open issues for the problems with HTML and scripting languages. Thanks, Sebastian On 07/04/2017 12:18 PM, Sebastian Nagel wrote: > Hi, > > recently I've plugged in Tika's content detection into Common Crawl's crawler > (modified Nutch) with > the target to g

Re: Adding a WARC parser to Tika

2017-07-11 Thread Sebastian Nagel

FYI, for a similar task - testing crawler-commons sitemaps.org parser - I've started a small test tools which reads the sitemaps from WARC files: https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/pOLsCVwRsxY https://github.com/sebastian-nagel/sitemap-performance-test

Re: Tika content detection and crawled "remote" content

2017-07-06 Thread Sebastian Nagel

JIRA with small samples would be fantastic. I think >> working in desc order of >> most common to least would be best...php, asp, coldfusion. >> >> I'm about to cut 1.16, but I look forward to improving Tika with this >> tremendously useful data. >> >&g

Re: Tika content detection and crawled "remote" content

2017-07-05 Thread Sebastian Nagel

now a new > field "mime-detected" which makes it easy to search or grep for confusion > pairs. > > This is an amazing step forward for our regression corpus. We used to rely > on the http headers and/or file suffix to oversample non-html. This will > allow far cleaner

Tika content detection and crawled "remote" content

2017-07-04 Thread Sebastian Nagel

Hi, recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1]. For the June 2017 crawl I've prepared a comparison of content types sen

encrypted PDF created with PDFMaker failed to parse

2013-05-23 Thread Sebastian Nagel

Hi, I have a bunch of PDF files - encrypted to prohibit changes and annotations (this matters because documents are forms) - created by Acrobat PDFMaker Tika (1.3/trunk) fails to parse these documents. A trial using NonSequentialParser (see PDFBOX-1554 and PDFBOX-1387) looks promising: text is

Re: XMLReaderUtils Contention

Re: DcXMLParser to parse XML files

DcXMLParser to parse XML files

Re: [VOTE] Release Apache Tika 1.25 Candidate #2

Re: [VOTE] Release Apache Tika 1.24.1 Candidate #1

Thread-safety and locking of methods Tika.detect(...) and MimeType.detect(...)

Re: Tika content detection and crawled "remote" content

Re: Adding a WARC parser to Tika

Re: Tika content detection and crawled "remote" content

Re: Tika content detection and crawled "remote" content

Tika content detection and crawled "remote" content

encrypted PDF created with PDFMaker failed to parse

12 matches

Site Navigation

Mail list logo

Footer information