Submission to ApacheCon on Tika

2014-01-30 Thread Chris Mattmann
Hey Guys,

I submitted the below talk on Apache Tika, Nutch and Solr to ApacheCon NA
2014:

Real Data Science: Exploring the FBI's Vault dataset with Apache Tika,
Nutch and Solr
Event ApacheCon North America
Submission Type Lightning Talk
Category Developer
Biography Chris Mattmann has a wealth of experience in software design,
and in the construction of large-scale data-intensive systems. His work
has infected a broad set of communities, ranging from helping NASA unlock
data from its next generation of earth science system satellites, to
assisting graduate students at the University of Southern California (his
Alma mater) in the study of software architecture, all the way to helping
industry and open source as a member of the Apache Software Foundation.
When he's not busy being busy, he's spending time with his lovely wife and
son braving the mean streets of Southern California.
Abstract Apache Tika is a content detection and analysis toolkit allowing
automated MIME type identification and rapid parsing of text and metadata
from over 1200 types of files including all major file types from the
Internet Assigned Number Authority's MIME database. In this talk I'll show
you how to practically use Apache Tika to explore the FBI's vault of
declassified PDF documents, and to use Apache Nutch to pull down the
dataset, and how to use Solr to ingest, and geoclassify the documents so
that can build a map of FBI PDF documents corresponding to your favorite
conspiracies throughout the USA. I've taught this material in my CSCI 572
Search Engines class at USC and it's a big hit. These are normally three
assignments, so I will do my best to boil down their essence into a
45min-60 min talk replete with danger and excitement.
Audience Developers interested in using Tika, Nutch and Solr. Folks
interested in the FBI vault dataset. GIS wonks. The like.
Experience Level Intermediate
Benefits to the Ecosystem The core of the talk will be Tika, but there
will be some Nutch magic, and some Solr magic at very basic levels. The
benefits of the ecosystem will be the real display of data science
involved and on a real dataset.
Technical Requirements I need an internet connection, and a projector.
Status New




Cheers,
Chris




[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-30 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886677#comment-13886677
 ] 

Tejas Patil commented on NUTCH-1465:


Interesting comments [~wastl-nagel].

Re "filters and normalizers" : By default I have kept those ON but can be 
disabled by using "-noFilter" and "-noNormalize".
Re "default content limits" and "fetch timeout": +1. Agree with you.
Re "Processing sitemap indexes fails" : +1. Nice catch.
Re "Fetch intervals of 1 second or 1 hour may cause troubles" : Currently, 
Injector allows users to provide a custom fetch interval with any value eg. 1 
sec. It makes sense not the correct it as user wants Nutch use that custom 
fetch interval. If we view sitemaps as custom seed list given by a content 
owner, then it would make sense to follow the intervals. But as you said that 
sitemaps can be wrongly set or outdated, the intervals might be incorrect. The 
question bolis down to: We are blindly accepting user's custom information in 
inject. Should we blindly assume that sitemaps are correct or not ? I have no 
strong opinion about either side of the argument. 

(PS : Default 'db.fetch.schedule.adaptive.min_interval' is 1 min so would allow 
1 hr as per db.fetch.schedule.adaptive.min_interval <= interval)

Re "SitemapReducer overwriting" : 
>> _"If a sitemap does not specify one of score, modified time, or fetch 
>> interval this values is set to zero. "_
Nope. See 
[SiteMapURL.java|https://code.google.com/p/crawler-commons/source/browse/trunk/src/main/java/crawlercommons/sitemaps/SiteMapURL.java]

 (a) score : Crawler commons assigns a default score of 0.5 if there was none 
provided in sitemap. 
We can do this: If an old entry has score other than 0.5, it can be preserved 
else update. For new entry, use scoring plugins for score equal to 0.5, else 
preserve the same. 
Limitation: Its not possible to distinguish if the score of 0.5 is from sitemap 
or the default one if  was absent.
 (b) fetch interval : Crawler commons does NOT set fetch interval if there was 
none provided in sitemap. So we are sure that whatever value is used is coming 
from . Validation might be needed as per comments above.
 (c) modified time : Same as fetch interval, unless parsed from sitemap file, 
modified time is set to NULL. Only possible validation is to drop values 
greater than current time.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class

2014-01-30 Thread Sertac TURKEL (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sertac TURKEL updated NUTCH-1645:
-

Attachment: (was: NUTCH-1645-v4.patch)

> Junit Test Case for Adaptive Fetch Schedule class
> -
>
> Key: NUTCH-1645
> URL: https://issues.apache.org/jira/browse/NUTCH-1645
> Project: Nutch
>  Issue Type: Test
>Affects Versions: 2.2.1
>Reporter: Talat UYARER
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1645-v2.patch, NUTCH-1645-v3.patch, 
> NUTCH-1645.patch
>
>
> Currently there is not Test Case for Adaptive Fetch Schedule. Junit test 
> Writes for its. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class

2014-01-30 Thread Sertac TURKEL (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sertac TURKEL updated NUTCH-1645:
-

Attachment: NUTCH-1645-v4.patch

Hi [~lewismc], I took into [~wastl-nagel] 's comment and I updated the patch 
file. Could you review again? 

> Junit Test Case for Adaptive Fetch Schedule class
> -
>
> Key: NUTCH-1645
> URL: https://issues.apache.org/jira/browse/NUTCH-1645
> Project: Nutch
>  Issue Type: Test
>Affects Versions: 2.2.1
>Reporter: Talat UYARER
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1645-v2.patch, NUTCH-1645-v3.patch, 
> NUTCH-1645-v4.patch, NUTCH-1645.patch
>
>
> Currently there is not Test Case for Adaptive Fetch Schedule. Junit test 
> Writes for its. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1719) DomainStatistics fails in 2.x because URL is not unreversed

2014-01-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886511#comment-13886511
 ] 

Hudson commented on NUTCH-1719:
---

SUCCESS: Integrated in Nutch-nutchgora #905 (See 
[https://builds.apache.org/job/Nutch-nutchgora/905/])
NUTCH-1719 DomainStatistics fails in 2.x because URL is not unreversed 
(lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1562774)
* /nutch/branches/2.x/CHANGES.txt
* 
/nutch/branches/2.x/src/java/org/apache/nutch/util/domain/DomainStatistics.java


> DomainStatistics fails in 2.x because URL is not unreversed
> ---
>
> Key: NUTCH-1719
> URL: https://issues.apache.org/jira/browse/NUTCH-1719
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerhard Gossen
> Fix For: 2.3
>
> Attachments: domainstats.patch
>
>
> With Nutch 2.x, {{org.apache.nutch.util.domain.DomainStatistics}} always 
> returns the counts only for {{FETCHED}}/{{NOT_FETCHED}}. The reason is that 
> the mapper tries to create a java.net.URL directly from the row key without 
> unreversing it first and silently ignores the thrown exception.
> The attached patch calls TableUtil.unreverseUtil first. In my test (against 
> current 2.x-trunk) it produces correct results.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-30 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886489#comment-13886489
 ] 

Sebastian Nagel commented on NUTCH-1465:


SitemapReducer overwrites score, modified time, and fetch interval of existing 
CrawlDb entries with the values from sitemap. Is this the desired behavior? 
What about forgotten, hopeless outdated sitemap? Or bogus values (last mod in 
the future)?
If a sitemap does not specify one of score, modified time, or fetch interval 
this values is set to zero. In this case, we should definitely not overwrite 
existing values. Newly added entries should get assigned 
db.fetch.interval.default and a reasonable score, eg. 0.5 as recommended by 
[[2|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]]. But that may 
depend on scoring plugins. Comments?

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1719) DomainStatistics fails in 2.x because URL is not unreversed

2014-01-30 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1719.
-

Resolution: Fixed

Committed @revision 1562774 in 2.x HEAD
Thank you [~gerhard.gossen] for the patch :)

> DomainStatistics fails in 2.x because URL is not unreversed
> ---
>
> Key: NUTCH-1719
> URL: https://issues.apache.org/jira/browse/NUTCH-1719
> Project: Nutch
>  Issue Type: Bug
>Reporter: Gerhard Gossen
> Fix For: 2.3
>
> Attachments: domainstats.patch
>
>
> With Nutch 2.x, {{org.apache.nutch.util.domain.DomainStatistics}} always 
> returns the counts only for {{FETCHED}}/{{NOT_FETCHED}}. The reason is that 
> the mapper tries to create a java.net.URL directly from the row key without 
> unreversing it first and silently ignores the thrown exception.
> The attached patch calls TableUtil.unreverseUtil first. In my test (against 
> current 2.x-trunk) it produces correct results.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-30 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886453#comment-13886453
 ] 

Sebastian Nagel commented on NUTCH-1465:


Thanks, [~tejasp] for the improvements! Testings continued...

Sitemaps are treated same as ordinary URLs/docs. But there are some 
differences. Shouldn't we relax default limits and filters and trust the 
restrictions specified in sitemap protocol?
* URL filters and normalizers: maybe you want to exclude .gz docs per suffix 
filter but still fetch gzipped sitemaps. That's not possible. Is it really 
necessary to normalize/filter sitemap URLs? If yes, this should be optional.
* default content limits {http,ftp,file}.content.limit (64 kB) are quite small 
even for mid-size sitemaps. Ok, you could set it per {{-D...}} but why not 
increase it to SiteMapParser.MAX_BYTES_ALLOWED?
* maybe we want also increase the fetch timeout

Processing siitemap indexes fails:
* the check sitemap.isIndex() skips all referenced sitemaps
* protocol for sitemap index and referenced sub-sitemaps may be different (eg., 
one sub-sitemap could be https while others are http)
* if processing one of the referenced sitemaps fails, the remaining 
sub-sitemaps are not processed

Fetch intervals are taken unchecked from . Should we llimit them to 
reasonable values (db.fetch.schedule.adaptive.min_interval <= interval <= 
db.fetch.interval.max). Fetch intervals of 1 second or 1 hour may cause 
troubles. [[1|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]] 
explicitely says that  "is considered a hint and not a command".


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)