RE: [COMPRESS] TIFF file identified as TAR
As always, thank you, Stefan! We might add a kluge at the Tika level to check for TIFF first...unless you'd like that kluge in your code? 😉 The reporter recommended one option: a conditional that checked the tarHeader variable to see if it started with one of the TIFF magic numbers (II/MM 49 49 2A 00 / 4D 4D 00 2A). -Original Message- From: Stefan Bodewig [mailto:bode...@apache.org] Sent: Tuesday, February 27, 2018 3:46 PM To: Stefan Bodewig Cc: Allison, Timothy B. ; Commons Developers List Subject: Re: [COMPRESS] TIFF file identified as TAR On 2018-02-27, Stefan Bodewig wrote: > On 2018-02-27, Allison, Timothy B. wrote: >>On TIKA-2591[0], a user reports that a specific type of TIFF is >>being identified as a TAR file. Is this something we should try to >>fix at the Tika level, or is this something that would be better >>fixed in COMPRESS? > TAR auto-detection is, erm, clumsy. But this is due to the format not > being built for being detected. > This is how it works right now: > * read the first candidate header of 512 bytes > * look at the eight bytes that contain the "ustar" string and the > version and verify they look like something we support. > * verify the checksum of the candidate tar header Actually I was mis-reading the code. It is either "ustar and version look good" or "parses as tar header with correct checksum". So the chance for false positives is bigger. Unfortunately this has proven necessary to detect all valid TAR archives: https://issues.apache.org/jira/browse/COMPRESS-117 Stefan - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
[COMPRESS] TIFF file identified as TAR
COMPRESS colleagues, On TIKA-2591[0], a user reports that a specific type of TIFF is being identified as a TAR file. Is this something we should try to fix at the Tika level, or is this something that would be better fixed in COMPRESS? Thank you! Best, Tim [0] https://issues.apache.org/jira/browse/TIKA-2591 - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
[compress] differences in implementation of Zip ibm vs. oracle?
Compress colleagues, Over on https://bz.apache.org/bugzilla/show_bug.cgi?id=61275, a user submitted two .xlsx files generated with Apache POI, one by IBM's jvm and one by Oracle's jvm. The file generated with Oracle's jvm opens without issue; however, MSOffice complains but can fix the file generated by IBM's jvm. Winzip opens both without complaining. Does this ring a bell? Have you seen this before? Anything we can do on our (POI's) side to fix this? Thank you. Best, Tim
[compress] FW: Tika content detection and crawled "remote" content
Fellow file-philes on [compress], Sebastian Nagel has added file type id via Apache Tika to Common Crawl. While Tika is not 100% accurate, this means that we have far better clarity on mime type than relying on the http header+file suffix. So, for testing purposes, you (or we over on Tika) can much more easily gather a small test corpus of files by mime type. Many, many thanks to Sebastian and Common Crawl! Cheers, Tim -Original Message- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: Tuesday, July 4, 2017 6:18 AM To: u...@tika.apache.org Subject: Tika content detection and crawled "remote" content Hi, recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1]. For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers). A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all: Tika-1.15HTTP-Content-Type 1001968023 application/xhtml+xmltext/html 2298146 application/rss+xml text/xml 617435 application/rss+xml application/xml 613525 text/htmlunk 361525 application/xhtml+xmlunk 297707 application/rdf+xml application/xml However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.): Tika-1.15 HTTP-Content-Type 2047739 text/x-phptext/html 681629 text/asp text/html 193095 text/x-coldfusion text/html 172318 text/aspdotnettext/html 139033 text/x-jsptext/html 38415 text/x-cgitext/html 32092 text/x-phptext/xml 18021 text/x-perl text/html Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs: - HTML fragment (no declaration of or opening tag) https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0 http://www.privi.com/product-details.asp?cno=C10910011 http://mental-ray.de/Root_alt/Default.asp http://ekyrs.org/support/index.php?action=profile http://cwmorse.eu5.org/lineal/mostrar.php?contador=200 - (overlong) comment block at start of HTML which "masks" the HTML declaration http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24 http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6 https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php https://de.e-stories.org/categories.php?&lan=nl&art=p - HTML with some scripting fragments ("") present: http://www.eco-ani-yao.org/shien/ - others are clearly HTML (looks more like a bug, at least, there is no simple explanation) http://www.proedinc.com/customer/content.aspx?redid=9 http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79 http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79 Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server. Now my question: where's the best place to fix this: in the crawler [3] or in Tika? If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs. Thanks and best, Sebastian [1] https://github.com/commoncrawl/nutch/issues/3 [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152 [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
RE: [COMPRESS] zip-bomb prevention for Z?
>enum wouldn't work for formats added via ServiceLoader. LZO supports a couple >of names of its own and you couldn't inject them into the enum. Doh! Got it. New code base...Sorry. - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
RE: [COMPRESS] zip-bomb prevention for Z?
>> If there is anything COMPRESS can do to detect and avoid the situation, then >> please open an issue over here. Done: COMPRESS-385, PR submitted >> If we wanted to add such a method, what would the return value be? One of >> the String constants contained inside the *Factory classes, likely. Tika >> would have to be prepared for new strings popping up when using a newer >> version of Compress (1.14 will add "lz4-framed" for example). Y, I'm ok with a String...perhaps longer term or for 2.0, move to an enum? Thank you for the heads-up! I opened COMPRESS-386 to discuss adding a threshold for the table size. As always, thank you! Cheers, Tim - To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org
[COMPRESS] zip-bomb prevention for Z?
On TIKA-1631 [1], users have observed that a corrupt Z file can cause an OOM at Internal_.InternalLZWStream.initializeTable. Should we try to protect against this at the Tika level, or should we open an issue on commons-compress's JIRA? A second question, we're creating a stream with the CompressorStreamFactory when all we want to do is detect. Is there a recommended way to detect the type of compressor without creating a stream? Thank you! Best, Tim [1] https://issues.apache.org/jira/browse/TIKA-1631
[COMPRESS and others] FW: Any interest in running Apache Tika as part of CommonCrawl?
All, We just heard back from a very active member of Common Crawl. I don’t want to clog up our dev lists with this discussion (more than I have!), but I do want to invite all to participate in the discussion, planning and potential patches. If you’d like to participate, please join us here: https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0 I’ve tried to follow Commons’ vernacular, and I’ve added [COMPRESS] to the Subject line. Please invite others who might have an interest in this work. Best, Tim From: Allison, Timothy B. Sent: Tuesday, April 07, 2015 8:39 AM To: 'Stephen Merity'; common-cr...@googlegroups.com Subject: RE: Any interest in running Apache Tika as part of CommonCrawl? Stephen, Thank you very much for responding so quickly and for all of your work on Common Crawl. I don’t want to speak for all of us, but given the feedback I’ve gotten so far from some of the dev communities, I think we would very much appreciate the chance to be tested on a monthly basis as part of the regular Common Crawl process. I think we’ll still want to run more often in our own sandbox(es) on the slice of CommonCrawl we have, but the monthly testing against new data, from my perspective at least, would be a huge win for all of us. In addition to parsing binaries and extracting text, Tika (via PDFBox, POI and many others) can also offer metadata (e.g. exif from images), which users of CommonCrawl might find of use. I’ll forward this to some of the relevant dev lists to invite others to participate in the discussion on the common-crawl list. Thank you, again. I very much look forward to collaborating. Best, Tim From: Stephen Merity [mailto:step...@commoncrawl.org] Sent: Tuesday, April 07, 2015 3:57 AM To: common-cr...@googlegroups.com<mailto:common-cr...@googlegroups.com> Cc: mattm...@apache.org<mailto:mattm...@apache.org>; talli...@apache.org<mailto:talli...@apache.org>; dmei...@apache.org<mailto:dmei...@apache.org>; til...@apache.org<mailto:til...@apache.org>; n...@apache.org<mailto:n...@apache.org> Subject: Re: Any interest in running Apache Tika as part of CommonCrawl? Hi Tika team! We'd certainly be interested in working with Apache Tika on such an undertaking. At the very least, we're glad that Julien has provided you with content to battle test Tika with! As you've noted, the text extraction performed to produce WET files are focused primarily on HTML files, leaving many other file types not covered. The existing text extraction is quite efficient and part of the same process that generates the WAT file, meaning there's next to no overhead. Performing extraction with Tika at the scale of Common Crawl would be an interesting challenge. Running it as a once off wouldn't likely be too much of a challenge and would also give Tika the benefit of a wider variety of documents (both well formed and malformed) to test against. Running it on a frequent basis or as part of the crawl pipeline would be more challenging but something we can certainly discuss, especially if there's strong community desire for it! On Fri, Apr 3, 2015 at 5:23 AM, mailto:tallison314...@gmail.com>> wrote: CommonCrawl currently has the WET format that extracts plain text from web pages. My guess is that this is text stripping from text-y formats. Let me know if I'm wrong! Would there be any interest in adding another format: WETT (WET-Tika) or supplementing the current WET by using Tika to extract contents from binary formats too: PDF, MSWord, etc. Julien Nioche kindly carved out 220 GB for us to experiment with on TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm. But, I'm wondering now if it would make more sense to have CommonCrawl run Tika as part of its regular process and make the output available in one of your standard formats. CommonCrawl consumers would get Tika output, and the Tika dev community (including its dependencies, PDFBox, POI, etc.) could get the stacktraces to help prioritize bug fixes. Cheers, Tim -- You received this message because you are subscribed to the Google Groups "Common Crawl" group. To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscr...@googlegroups.com<mailto:common-crawl+unsubscr...@googlegroups.com>. To post to this group, send email to common-cr...@googlegroups.com<mailto:common-cr...@googlegroups.com>. Visit this group at http://groups.google.com/group/common-crawl. For more options, visit https://groups.google.com/d/optout. -- Regards, Stephen Merity Data Scientist @ Common Crawl