Component fetching during parsing. (vertical crawling)

2010-07-20 Thread Ferdy

Hello,

We are currently using a heavily modified version of nutch. The main 
reason for this is the fact that we do not only fetch the urls that the 
QueueFeeder submits, but also additional resources from urls that are 
constructed during parsing. So for example let's say the QueueFeeder 
submits a html page to the fetcher, and after the fetch the page gets 
parsed. Nothing special so far. However the parser decides it also needs 
some images on the page. Perhaps these images link to other html pages, 
and we might want to fetch these too. All this is needed to parse 
information about this particular url we started with. These extra fetch 
urls we like to call Components, because they are additional resources 
required to do the parsing of our initial html page that was selected 
for fetching.


At first we tried to solve this vertical crawling problem by using 
multiple crawl cycles. Each crawl simply selects outlinks that are 
needed for the parsing of the initial html page. A single inspection can 
possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph 
depth). There are several problems with this approach, for one that the 
crawldb is cluttered with all these component urls and secondly that 
inspection completion times can be very long.


As an alternative we decided to let the parser fetch needed components 
on-the-fly, so that additional urls are instantly added to the fetcher 
lists. Every fetched url can be either a non-component (the QueueFeeder 
fed it; start parsing this resource) or as a component (the fetcher 
hands the resource over to the parser that requested it). In order to 
keep parsers alive we always try to fetch components first, with respect 
to fetch politeness. A downside of this solution is that your fetch task 
total running time will be more difficult to anticipate to. For example, 
if you inject and generate 100 urls and they will be fetched in a single 
task, you might end up fetching a total of 1100 urls (in the assumption 
each inspection needs 10 components). We found this behaviour to be 
acceptable.


Because of our custom version of nutch we cannot upgrade easily to newer 
versions (we're still using modified fetcher classes from nutch 0.9). 
Often we end up fixing bugs that have already been fixed by the 
community. Also, other users might benefit from our changes too.


Therefore we propose to redesign our vertical crawling system from 
scratch for the newer nutch versions, should there be any interest from 
the community. Perhaps we are not the only one to implement such a 
system with nutch. So, what are your thoughts about this?


Ferdy.


Re: Component fetching during parsing. (vertical crawling)

2010-07-20 Thread Andrzej Bialecki
On 2010-07-20 14:30, Ferdy wrote:
 Hello,
 
 We are currently using a heavily modified version of nutch. The main
 reason for this is the fact that we do not only fetch the urls that the
 QueueFeeder submits, but also additional resources from urls that are
 constructed during parsing. So for example let's say the QueueFeeder
 submits a html page to the fetcher, and after the fetch the page gets
 parsed. Nothing special so far. However the parser decides it also needs
 some images on the page. Perhaps these images link to other html pages,
 and we might want to fetch these too. All this is needed to parse
 information about this particular url we started with. These extra fetch
 urls we like to call Components, because they are additional resources
 required to do the parsing of our initial html page that was selected
 for fetching.
 
 At first we tried to solve this vertical crawling problem by using
 multiple crawl cycles. Each crawl simply selects outlinks that are
 needed for the parsing of the initial html page. A single inspection can
 possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph
 depth). There are several problems with this approach, for one that the
 crawldb is cluttered with all these component urls and secondly that
 inspection completion times can be very long.
 
 As an alternative we decided to let the parser fetch needed components
 on-the-fly, so that additional urls are instantly added to the fetcher
 lists. Every fetched url can be either a non-component (the QueueFeeder
 fed it; start parsing this resource) or as a component (the fetcher
 hands the resource over to the parser that requested it). In order to
 keep parsers alive we always try to fetch components first, with respect
 to fetch politeness. A downside of this solution is that your fetch task
 total running time will be more difficult to anticipate to. For example,
 if you inject and generate 100 urls and they will be fetched in a single
 task, you might end up fetching a total of 1100 urls (in the assumption
 each inspection needs 10 components). We found this behaviour to be
 acceptable.
 
 Because of our custom version of nutch we cannot upgrade easily to newer
 versions (we're still using modified fetcher classes from nutch 0.9).
 Often we end up fixing bugs that have already been fixed by the
 community. Also, other users might benefit from our changes too.
 
 Therefore we propose to redesign our vertical crawling system from
 scratch for the newer nutch versions, should there be any interest from
 the community. Perhaps we are not the only one to implement such a
 system with nutch. So, what are your thoughts about this?

If I understand your use case properly, this is really a custom Fetcher
that you are talking about - a strategy to fetch complete pages
(together with its resources that relate to the display of the page)
should be possible to implement in a custom fetcher without changing
other Nutch areas.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

2010-07-20 Thread Julien Nioche

  Now that you mention upgrade solutions from 1.x to 2.0 I suggest that we
 open
  a JIRA to discuss this. IMHO we probably don't want to keep the 'old'
 code in
  src/java when we merge but could have the code for the conversion
 utilities
  and the Nutch 1.x jars in a the contrib/ directory instead.

 I wouldn't favor a Nutch contrib going forward. Contribs lead to
 umbrella-projects which Apache is moving away from b/c it typically creates
 different committer lists (those who can commit to contrib and those with
 commit privs to the full source code base, etc.), different lifecycles and
 ultimately incubates/grows mini-projects within larger ones.


I meant putting the migration code and 1.x Nutch jars in the contrib
directory of the trunk - that shouldn't require a different committers list
or should it?



 If someone needs Nutch 1.x jars they can grab them from the Apache distros
 or we can publish them to Maven central. As for conversion and removal of
 src/java, I'm not sure I get that? Why should we remove src/java? Merge
 means adapt existing rather than replace.


I was talking about removing deprecated Nutch objects (old Writables which
we needed for storing things in Hadoop MapFiles) from the src after the
merge once they are not used by Nutch2.0.

The point made by Dogacan was that they would be needed if we want to
provide conversion tools so that people could convert their old crawldbs and
segments into our shiny new Gora-based architecture.


 
 
  Also, I realize that I am the last person to talk about this, but can we
 get
  some reviews for these changes?
 
  I could have filed a JIRA for the branch NutchBase indeed (but haven't).
  Again, NutchBase is a transitional / test / development repository before
 we
  merge things into trunk. Changes to the trunk are made properly i.e.
 through
  JIRA with patches and peer review. Or maybe I should indeed open a JIRA
 for
  NutchBase every time I do a bit of cleanup or port new plugins to the 2.0
 API?

 Nah, IMHO I think it's OK to muck around in the branch, so long as when the
 branch gets merged (incrementally rather than wholesale), we can review
 those. So, the way it would work is this:

 A. branch cleaned up, SVN commits, etc., stable working
 B. at some point, branch ready to be merged (assumption: branch devel
 stops)
 C. define branch merge into 3-5 patches
 D. foreach patch in C:
create JIRA issue for patch
call for review of patch
if no objections, then commit in 24-48 hours

 E. trunk now ready for 2.0 development
 F. schedule current open issues for 2.0, grab any low hanging fruit (1-2
 days)
 G. all other issues pushed out to 2.1
 H. release 2.0


Andrzej and myself are in the process of porting the last missing tests in
NutchBase and debugging Gora along the way. There is just a handful of
plugins which have not been ported and I should have finished that pretty
quickly. Hopefully we'll get to (A) soonish and can then follow the plan
above.

However we still need to address the issue raise by Dogacan i.e shall we
provide tools to convert from 1.x structures to 2.0 and if so how shall we
organise it. Again - some things have been removed fom NutchBase for the
sake of clarity but since they are in the trunk they are not lost and we can
decide what to do with them later.

J.

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com


[jira] Resolved: (NUTCH-856) Use Tika for parsing feed

2010-07-20 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-856.
-

Resolution: Fixed

thanks Chris for reviewing and committing TIKA-466. I will mark the issue as 
closed as soon as Tika 0.8 is released and used in Nutch

 Use Tika for parsing feed
 -

 Key: NUTCH-856
 URL: https://issues.apache.org/jira/browse/NUTCH-856
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0


 We currently have 2 plugins for dealing with feeds : 
 * feeds
 * parse-rss
 I have proposed https://issues.apache.org/jira/browse/TIKA-466 which would at 
 least cover the functionalities of parse-rss. If/when this is added to Tika 
 then we should be able to remove parse-rss and rely on Tika instead

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

2010-07-20 Thread Julien Nioche
Thanks for your comments Chris


  However we still need to address the issue raise by Dogacan i.e shall we
  provide tools to convert from 1.x structures to 2.0 and if so how shall
 we
  organise it. Again - some things have been removed fom NutchBase for the
 sake
  of clarity but since they are in the trunk they are not lost and we can
 decide
  what to do with them later.

 Maybe we can provide a couple of encapsulated upgradetools that contain
 internal versions of the necessary Nutch1.x classes that live inside of the
 Tool class. This way they are hidden and not cluttering the sources, but
 the
 point is still accomplished.



+1

Jul

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com


[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.

2010-07-20 Thread Scott Gonyea (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Gonyea updated NUTCH-855:
---

Fix Version/s: 2.0
  Description: 
This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
1. Meta Tags that are supplied with your Crawl URLs, during injection, will be 
propagated throughout the outlinks of those Crawl URLs.
2. When you index your URLs, the meta tags that you specified with your URLs 
will be indexed alongside those URLs--and can be directly queried, assuming you 
have done everything else correctly.

The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited 
in the form of:
www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
or:
http://slashdot.org/corp_owner=Geeknet  will_it_blend=indubitably
http://engadget.com/corp_owner=Weblogs  genre=geeksquad_thriller

To activate this plugin, you must modify two properties in your nutch-sites.xml:
1. plugin.includes
   add: urlmeta
   to:   value.../value
   ie: valueurlmeta|parse-tika|scoring-opic|.../value
2. urlmeta.tags
   Insert a comma-delimited list of metatags. Using the above example:
   valuecorp_owner, will_it_blend, genre/value
   Note that you do not need to include the tag with every URL. However, you 
must specify each tag if you want it to be propagated and later indexed.


  was:
This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
1. Meta Tags that are supplied with your Crawl URLs, during injection, will be 
propagated throughout the outlinks of those Crawl URLs.
2. When you index your URLs, the meta tags that you specified with your URLs 
will be indexed alongside those URLs--and can be directly queried, assuming you 
have done everything else correctly.

The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited 
in the form of:
[www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
or:
http://slashdot.org/corp_owner=Geeknet  will_it_blend=indubitably
http://engadget.com/corp_owner=Weblogs  genre=geeksquad_thriller

To activate this plugin, you must modify two properties in your nutch-sites.xml:
1. plugin.includes
   from: index-(basic|anchor)
   to:   index-(basic|anchor|urlmeta)
2. urlmeta.tags
   Insert a comma-delimited list of metatags. Using the above example:
   valuecorp_owner, will_it_blend, genre/value
   Note that you do not need to include the tag with every URL. However, you 
must specify each tag if you want it to be propagated and later indexed.



Updated comments, revised patch is now available. It's more robust to the 
nefarious null and his NullPointerException cabal.

 ScoringFilter and IndexingFilter: To allow for the propagation of URL 
 Metatags and their subsequent indexing.
 -

 Key: NUTCH-855
 URL: https://issues.apache.org/jira/browse/NUTCH-855
 Project: Nutch
  Issue Type: New Feature
  Components: generator, indexer
Affects Versions: 1.1
Reporter: Scott Gonyea
 Fix For: 1.2, 2.0

 Attachments: nutch-855.txt

   Original Estimate: 168h
  Remaining Estimate: 168h

 This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
 1. Meta Tags that are supplied with your Crawl URLs, during injection, will 
 be propagated throughout the outlinks of those Crawl URLs.
 2. When you index your URLs, the meta tags that you specified with your URLs 
 will be indexed alongside those URLs--and can be directly queried, assuming 
 you have done everything else correctly.
 The flat-file of URLs you are injecting should, per NUTCH-655, be 
 tab-delimited in the form of:
 www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN
 or:
 http://slashdot.org/  corp_owner=Geeknet  will_it_blend=indubitably
 http://engadget.com/  corp_owner=Weblogs  genre=geeksquad_thriller
 To activate this plugin, you must modify two properties in your 
 nutch-sites.xml:
 1. plugin.includes
add: urlmeta
to:   value.../value
ie: valueurlmeta|parse-tika|scoring-opic|.../value
 2. urlmeta.tags
Insert a comma-delimited list of metatags. Using the above example:
valuecorp_owner, will_it_blend, genre/value
Note that you do not need to include the tag with every URL. However, you 
 must specify each tag if you want it to be propagated and later indexed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.