Hi Cristian, hi Tim,
>> org.apache.tika.utils.XMLReaderUtils Contention waiting for a
>> SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE
I regularly count these messages in the log files of a large and highly
concurrent web crawl with 160 threads fetching data and performing
content t
e XMLParser. Let me know
> if I can help with this temporary workaround.
>
> Thank you for identifying this problem!
>
> Cheers,
>
> Tim
>
> On Thu, Nov 11, 2021 at 7:21 AM Sebastian Nagel
> wrote:
>>
>> Hi,
>>
>> when is the Dublin C
Hi,
when is the Dublin Core XML parser used to parse XML files?
Is there a configuration required to enable the DcXMLParser?
There is a difference between 1.27 and 2.1.0:
$> java -jar tika-app-1.27.jar -J \
https://news.haltonhills.halinet.on.ca/dc.xml \
| jq '.[0]."dc:title"'
"Deaths"
+1
Integrated release candidate into Nutch:
- successfully run Nutch unit tests
- verified parsing of test documents
(PDFs, images, HTML, RSS, tar/zip)
Thanks!
Sebastian
On 11/25/20 1:20 PM, Tim Allison wrote:
> A candidate for the Tika 1.25 release is available at:
> https://dist.apache.or
+1 integrated release candidate into Nutch: tests pass and
successfully run a sample crawl including also PDFs, MP3s, etc.
On 4/17/20 11:38 PM, Tim Allison wrote:
>
> A candidate for the Tika 1.24.1 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/
>
> The release ca
Hi,
two questions regarding thread-safety and locking in Tika's MIME type detectors
while investigating global locks in NUTCH-2578 (multi-threaded fetcher) [1].
First, are the methods Tika.detect(...) and MimeType.detect(...) thread-safe?
I've found an answer from 2011 about Tika.detect(...)
h
and open
issues for the problems with HTML and scripting languages.
Thanks,
Sebastian
On 07/04/2017 12:18 PM, Sebastian Nagel wrote:
> Hi,
>
> recently I've plugged in Tika's content detection into Common Crawl's crawler
> (modified Nutch) with
> the target to g
FYI, for a similar task - testing crawler-commons sitemaps.org parser - I've
started a small test
tools which reads the sitemaps from WARC files:
https://groups.google.com/forum/?fromgroups#!topic/crawler-commons/pOLsCVwRsxY
https://github.com/sebastian-nagel/sitemap-performance-test
JIRA with small samples would be fantastic. I think
>> working in desc order of
>> most common to least would be best...php, asp, coldfusion.
>>
>> I'm about to cut 1.16, but I look forward to improving Tika with this
>> tremendously useful data.
>>
>&g
now a new
> field "mime-detected" which makes it easy to search or grep for confusion
> pairs.
>
> This is an amazing step forward for our regression corpus. We used to rely
> on the http headers and/or file suffix to oversample non-html. This will
> allow far cleaner
Hi,
recently I've plugged in Tika's content detection into Common Crawl's crawler
(modified Nutch) with
the target to get clean and correct MIME type - the HTTP Content-Type may
contain garbage and isn't
always correct [1].
For the June 2017 crawl I've prepared a comparison of content types sen
Hi,
I have a bunch of PDF files
- encrypted to prohibit changes and annotations
(this matters because documents are forms)
- created by Acrobat PDFMaker
Tika (1.3/trunk) fails to parse these documents.
A trial using NonSequentialParser (see PDFBOX-1554 and PDFBOX-1387) looks
promising:
text is
12 matches
Mail list logo