Submission to ApacheCon on Tika
Hey Guys, I submitted the below talk on Apache Tika, Nutch and Solr to ApacheCon NA 2014: Real Data Science: Exploring the FBI's Vault dataset with Apache Tika, Nutch and Solr Event ApacheCon North America Submission Type Lightning Talk Category Developer Biography Chris Mattmann has a wealth of experience in software design, and in the construction of large-scale data-intensive systems. His work has infected a broad set of communities, ranging from helping NASA unlock data from its next generation of earth science system satellites, to assisting graduate students at the University of Southern California (his Alma mater) in the study of software architecture, all the way to helping industry and open source as a member of the Apache Software Foundation. When he's not busy being busy, he's spending time with his lovely wife and son braving the mean streets of Southern California. Abstract Apache Tika is a content detection and analysis toolkit allowing automated MIME type identification and rapid parsing of text and metadata from over 1200 types of files including all major file types from the Internet Assigned Number Authority's MIME database. In this talk I'll show you how to practically use Apache Tika to explore the FBI's vault of declassified PDF documents, and to use Apache Nutch to pull down the dataset, and how to use Solr to ingest, and geoclassify the documents so that can build a map of FBI PDF documents corresponding to your favorite conspiracies throughout the USA. I've taught this material in my CSCI 572 Search Engines class at USC and it's a big hit. These are normally three assignments, so I will do my best to boil down their essence into a 45min-60 min talk replete with danger and excitement. Audience Developers interested in using Tika, Nutch and Solr. Folks interested in the FBI vault dataset. GIS wonks. The like. Experience Level Intermediate Benefits to the Ecosystem The core of the talk will be Tika, but there will be some Nutch magic, and some Solr magic at very basic levels. The benefits of the ecosystem will be the real display of data science involved and on a real dataset. Technical Requirements I need an internet connection, and a projector. Status New Cheers, Chris
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886677#comment-13886677 ] Tejas Patil commented on NUTCH-1465: Interesting comments [~wastl-nagel]. Re "filters and normalizers" : By default I have kept those ON but can be disabled by using "-noFilter" and "-noNormalize". Re "default content limits" and "fetch timeout": +1. Agree with you. Re "Processing sitemap indexes fails" : +1. Nice catch. Re "Fetch intervals of 1 second or 1 hour may cause troubles" : Currently, Injector allows users to provide a custom fetch interval with any value eg. 1 sec. It makes sense not the correct it as user wants Nutch use that custom fetch interval. If we view sitemaps as custom seed list given by a content owner, then it would make sense to follow the intervals. But as you said that sitemaps can be wrongly set or outdated, the intervals might be incorrect. The question bolis down to: We are blindly accepting user's custom information in inject. Should we blindly assume that sitemaps are correct or not ? I have no strong opinion about either side of the argument. (PS : Default 'db.fetch.schedule.adaptive.min_interval' is 1 min so would allow 1 hr as per db.fetch.schedule.adaptive.min_interval <= interval) Re "SitemapReducer overwriting" : >> _"If a sitemap does not specify one of score, modified time, or fetch >> interval this values is set to zero. "_ Nope. See [SiteMapURL.java|https://code.google.com/p/crawler-commons/source/browse/trunk/src/main/java/crawlercommons/sitemaps/SiteMapURL.java] (a) score : Crawler commons assigns a default score of 0.5 if there was none provided in sitemap. We can do this: If an old entry has score other than 0.5, it can be preserved else update. For new entry, use scoring plugins for score equal to 0.5, else preserve the same. Limitation: Its not possible to distinguish if the score of 0.5 is from sitemap or the default one if was absent. (b) fetch interval : Crawler commons does NOT set fetch interval if there was none provided in sitemap. So we are sure that whatever value is used is coming from . Validation might be needed as per comments above. (c) modified time : Same as fetch interval, unless parsed from sitemap file, modified time is set to NULL. Only possible validation is to drop values greater than current time. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class
[ https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sertac TURKEL updated NUTCH-1645: - Attachment: (was: NUTCH-1645-v4.patch) > Junit Test Case for Adaptive Fetch Schedule class > - > > Key: NUTCH-1645 > URL: https://issues.apache.org/jira/browse/NUTCH-1645 > Project: Nutch > Issue Type: Test >Affects Versions: 2.2.1 >Reporter: Talat UYARER >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1645-v2.patch, NUTCH-1645-v3.patch, > NUTCH-1645.patch > > > Currently there is not Test Case for Adaptive Fetch Schedule. Junit test > Writes for its. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class
[ https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sertac TURKEL updated NUTCH-1645: - Attachment: NUTCH-1645-v4.patch Hi [~lewismc], I took into [~wastl-nagel] 's comment and I updated the patch file. Could you review again? > Junit Test Case for Adaptive Fetch Schedule class > - > > Key: NUTCH-1645 > URL: https://issues.apache.org/jira/browse/NUTCH-1645 > Project: Nutch > Issue Type: Test >Affects Versions: 2.2.1 >Reporter: Talat UYARER >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1645-v2.patch, NUTCH-1645-v3.patch, > NUTCH-1645-v4.patch, NUTCH-1645.patch > > > Currently there is not Test Case for Adaptive Fetch Schedule. Junit test > Writes for its. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1719) DomainStatistics fails in 2.x because URL is not unreversed
[ https://issues.apache.org/jira/browse/NUTCH-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886511#comment-13886511 ] Hudson commented on NUTCH-1719: --- SUCCESS: Integrated in Nutch-nutchgora #905 (See [https://builds.apache.org/job/Nutch-nutchgora/905/]) NUTCH-1719 DomainStatistics fails in 2.x because URL is not unreversed (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1562774) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/java/org/apache/nutch/util/domain/DomainStatistics.java > DomainStatistics fails in 2.x because URL is not unreversed > --- > > Key: NUTCH-1719 > URL: https://issues.apache.org/jira/browse/NUTCH-1719 > Project: Nutch > Issue Type: Bug >Reporter: Gerhard Gossen > Fix For: 2.3 > > Attachments: domainstats.patch > > > With Nutch 2.x, {{org.apache.nutch.util.domain.DomainStatistics}} always > returns the counts only for {{FETCHED}}/{{NOT_FETCHED}}. The reason is that > the mapper tries to create a java.net.URL directly from the row key without > unreversing it first and silently ignores the thrown exception. > The attached patch calls TableUtil.unreverseUtil first. In my test (against > current 2.x-trunk) it produces correct results. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886489#comment-13886489 ] Sebastian Nagel commented on NUTCH-1465: SitemapReducer overwrites score, modified time, and fetch interval of existing CrawlDb entries with the values from sitemap. Is this the desired behavior? What about forgotten, hopeless outdated sitemap? Or bogus values (last mod in the future)? If a sitemap does not specify one of score, modified time, or fetch interval this values is set to zero. In this case, we should definitely not overwrite existing values. Newly added entries should get assigned db.fetch.interval.default and a reasonable score, eg. 0.5 as recommended by [[2|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]]. But that may depend on scoring plugins. Comments? > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (NUTCH-1719) DomainStatistics fails in 2.x because URL is not unreversed
[ https://issues.apache.org/jira/browse/NUTCH-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1719. - Resolution: Fixed Committed @revision 1562774 in 2.x HEAD Thank you [~gerhard.gossen] for the patch :) > DomainStatistics fails in 2.x because URL is not unreversed > --- > > Key: NUTCH-1719 > URL: https://issues.apache.org/jira/browse/NUTCH-1719 > Project: Nutch > Issue Type: Bug >Reporter: Gerhard Gossen > Fix For: 2.3 > > Attachments: domainstats.patch > > > With Nutch 2.x, {{org.apache.nutch.util.domain.DomainStatistics}} always > returns the counts only for {{FETCHED}}/{{NOT_FETCHED}}. The reason is that > the mapper tries to create a java.net.URL directly from the row key without > unreversing it first and silently ignores the thrown exception. > The attached patch calls TableUtil.unreverseUtil first. In my test (against > current 2.x-trunk) it produces correct results. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886453#comment-13886453 ] Sebastian Nagel commented on NUTCH-1465: Thanks, [~tejasp] for the improvements! Testings continued... Sitemaps are treated same as ordinary URLs/docs. But there are some differences. Shouldn't we relax default limits and filters and trust the restrictions specified in sitemap protocol? * URL filters and normalizers: maybe you want to exclude .gz docs per suffix filter but still fetch gzipped sitemaps. That's not possible. Is it really necessary to normalize/filter sitemap URLs? If yes, this should be optional. * default content limits {http,ftp,file}.content.limit (64 kB) are quite small even for mid-size sitemaps. Ok, you could set it per {{-D...}} but why not increase it to SiteMapParser.MAX_BYTES_ALLOWED? * maybe we want also increase the fetch timeout Processing siitemap indexes fails: * the check sitemap.isIndex() skips all referenced sitemaps * protocol for sitemap index and referenced sub-sitemaps may be different (eg., one sub-sitemap could be https while others are http) * if processing one of the referenced sitemaps fails, the remaining sub-sitemaps are not processed Fetch intervals are taken unchecked from . Should we llimit them to reasonable values (db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max). Fetch intervals of 1 second or 1 hour may cause troubles. [[1|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]] explicitely says that "is considered a hint and not a command". > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)