Component fetching during parsing. (vertical crawling)
Hello, We are currently using a heavily modified version of nutch. The main reason for this is the fact that we do not only fetch the urls that the QueueFeeder submits, but also additional resources from urls that are constructed during parsing. So for example let's say the QueueFeeder submits a html page to the fetcher, and after the fetch the page gets parsed. Nothing special so far. However the parser decides it also needs some images on the page. Perhaps these images link to other html pages, and we might want to fetch these too. All this is needed to parse information about this particular url we started with. These extra fetch urls we like to call Components, because they are additional resources required to do the parsing of our initial html page that was selected for fetching. At first we tried to solve this vertical crawling problem by using multiple crawl cycles. Each crawl simply selects outlinks that are needed for the parsing of the initial html page. A single inspection can possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph depth). There are several problems with this approach, for one that the crawldb is cluttered with all these component urls and secondly that inspection completion times can be very long. As an alternative we decided to let the parser fetch needed components on-the-fly, so that additional urls are instantly added to the fetcher lists. Every fetched url can be either a non-component (the QueueFeeder fed it; start parsing this resource) or as a component (the fetcher hands the resource over to the parser that requested it). In order to keep parsers alive we always try to fetch components first, with respect to fetch politeness. A downside of this solution is that your fetch task total running time will be more difficult to anticipate to. For example, if you inject and generate 100 urls and they will be fetched in a single task, you might end up fetching a total of 1100 urls (in the assumption each inspection needs 10 components). We found this behaviour to be acceptable. Because of our custom version of nutch we cannot upgrade easily to newer versions (we're still using modified fetcher classes from nutch 0.9). Often we end up fixing bugs that have already been fixed by the community. Also, other users might benefit from our changes too. Therefore we propose to redesign our vertical crawling system from scratch for the newer nutch versions, should there be any interest from the community. Perhaps we are not the only one to implement such a system with nutch. So, what are your thoughts about this? Ferdy.
Re: Component fetching during parsing. (vertical crawling)
On 2010-07-20 14:30, Ferdy wrote: Hello, We are currently using a heavily modified version of nutch. The main reason for this is the fact that we do not only fetch the urls that the QueueFeeder submits, but also additional resources from urls that are constructed during parsing. So for example let's say the QueueFeeder submits a html page to the fetcher, and after the fetch the page gets parsed. Nothing special so far. However the parser decides it also needs some images on the page. Perhaps these images link to other html pages, and we might want to fetch these too. All this is needed to parse information about this particular url we started with. These extra fetch urls we like to call Components, because they are additional resources required to do the parsing of our initial html page that was selected for fetching. At first we tried to solve this vertical crawling problem by using multiple crawl cycles. Each crawl simply selects outlinks that are needed for the parsing of the initial html page. A single inspection can possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph depth). There are several problems with this approach, for one that the crawldb is cluttered with all these component urls and secondly that inspection completion times can be very long. As an alternative we decided to let the parser fetch needed components on-the-fly, so that additional urls are instantly added to the fetcher lists. Every fetched url can be either a non-component (the QueueFeeder fed it; start parsing this resource) or as a component (the fetcher hands the resource over to the parser that requested it). In order to keep parsers alive we always try to fetch components first, with respect to fetch politeness. A downside of this solution is that your fetch task total running time will be more difficult to anticipate to. For example, if you inject and generate 100 urls and they will be fetched in a single task, you might end up fetching a total of 1100 urls (in the assumption each inspection needs 10 components). We found this behaviour to be acceptable. Because of our custom version of nutch we cannot upgrade easily to newer versions (we're still using modified fetcher classes from nutch 0.9). Often we end up fixing bugs that have already been fixed by the community. Also, other users might benefit from our changes too. Therefore we propose to redesign our vertical crawling system from scratch for the newer nutch versions, should there be any interest from the community. Perhaps we are not the only one to implement such a system with nutch. So, what are your thoughts about this? If I understand your use case properly, this is really a custom Fetcher that you are talking about - a strategy to fetch complete pages (together with its resources that relate to the display of the page) should be possible to implement in a custom fetcher without changing other Nutch areas. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java
Now that you mention upgrade solutions from 1.x to 2.0 I suggest that we open a JIRA to discuss this. IMHO we probably don't want to keep the 'old' code in src/java when we merge but could have the code for the conversion utilities and the Nutch 1.x jars in a the contrib/ directory instead. I wouldn't favor a Nutch contrib going forward. Contribs lead to umbrella-projects which Apache is moving away from b/c it typically creates different committer lists (those who can commit to contrib and those with commit privs to the full source code base, etc.), different lifecycles and ultimately incubates/grows mini-projects within larger ones. I meant putting the migration code and 1.x Nutch jars in the contrib directory of the trunk - that shouldn't require a different committers list or should it? If someone needs Nutch 1.x jars they can grab them from the Apache distros or we can publish them to Maven central. As for conversion and removal of src/java, I'm not sure I get that? Why should we remove src/java? Merge means adapt existing rather than replace. I was talking about removing deprecated Nutch objects (old Writables which we needed for storing things in Hadoop MapFiles) from the src after the merge once they are not used by Nutch2.0. The point made by Dogacan was that they would be needed if we want to provide conversion tools so that people could convert their old crawldbs and segments into our shiny new Gora-based architecture. Also, I realize that I am the last person to talk about this, but can we get some reviews for these changes? I could have filed a JIRA for the branch NutchBase indeed (but haven't). Again, NutchBase is a transitional / test / development repository before we merge things into trunk. Changes to the trunk are made properly i.e. through JIRA with patches and peer review. Or maybe I should indeed open a JIRA for NutchBase every time I do a bit of cleanup or port new plugins to the 2.0 API? Nah, IMHO I think it's OK to muck around in the branch, so long as when the branch gets merged (incrementally rather than wholesale), we can review those. So, the way it would work is this: A. branch cleaned up, SVN commits, etc., stable working B. at some point, branch ready to be merged (assumption: branch devel stops) C. define branch merge into 3-5 patches D. foreach patch in C: create JIRA issue for patch call for review of patch if no objections, then commit in 24-48 hours E. trunk now ready for 2.0 development F. schedule current open issues for 2.0, grab any low hanging fruit (1-2 days) G. all other issues pushed out to 2.1 H. release 2.0 Andrzej and myself are in the process of porting the last missing tests in NutchBase and debugging Gora along the way. There is just a handful of plugins which have not been ported and I should have finished that pretty quickly. Hopefully we'll get to (A) soonish and can then follow the plan above. However we still need to address the issue raise by Dogacan i.e shall we provide tools to convert from 1.x structures to 2.0 and if so how shall we organise it. Again - some things have been removed fom NutchBase for the sake of clarity but since they are in the trunk they are not lost and we can decide what to do with them later. J. -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com
[jira] Resolved: (NUTCH-856) Use Tika for parsing feed
[ https://issues.apache.org/jira/browse/NUTCH-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-856. - Resolution: Fixed thanks Chris for reviewing and committing TIKA-466. I will mark the issue as closed as soon as Tika 0.8 is released and used in Nutch Use Tika for parsing feed - Key: NUTCH-856 URL: https://issues.apache.org/jira/browse/NUTCH-856 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 2.0 We currently have 2 plugins for dealing with feeds : * feeds * parse-rss I have proposed https://issues.apache.org/jira/browse/TIKA-466 which would at least cover the functionalities of parse-rss. If/when this is added to Tika then we should be able to remove parse-rss and rely on Tika instead -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java
Thanks for your comments Chris However we still need to address the issue raise by Dogacan i.e shall we provide tools to convert from 1.x structures to 2.0 and if so how shall we organise it. Again - some things have been removed fom NutchBase for the sake of clarity but since they are in the trunk they are not lost and we can decide what to do with them later. Maybe we can provide a couple of encapsulated upgradetools that contain internal versions of the necessary Nutch1.x classes that live inside of the Tool class. This way they are hidden and not cluttering the sources, but the point is still accomplished. +1 Jul -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com
[jira] Updated: (NUTCH-855) ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing.
[ https://issues.apache.org/jira/browse/NUTCH-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Gonyea updated NUTCH-855: --- Fix Version/s: 2.0 Description: This plugin is designed to enhance the NUTCH-655 patch, by doing two things: 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs. 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly. The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of: www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN or: http://slashdot.org/corp_owner=Geeknet will_it_blend=indubitably http://engadget.com/corp_owner=Weblogs genre=geeksquad_thriller To activate this plugin, you must modify two properties in your nutch-sites.xml: 1. plugin.includes add: urlmeta to: value.../value ie: valueurlmeta|parse-tika|scoring-opic|.../value 2. urlmeta.tags Insert a comma-delimited list of metatags. Using the above example: valuecorp_owner, will_it_blend, genre/value Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed. was: This plugin is designed to enhance the NUTCH-655 patch, by doing two things: 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs. 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly. The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of: [www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN] or: http://slashdot.org/corp_owner=Geeknet will_it_blend=indubitably http://engadget.com/corp_owner=Weblogs genre=geeksquad_thriller To activate this plugin, you must modify two properties in your nutch-sites.xml: 1. plugin.includes from: index-(basic|anchor) to: index-(basic|anchor|urlmeta) 2. urlmeta.tags Insert a comma-delimited list of metatags. Using the above example: valuecorp_owner, will_it_blend, genre/value Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed. Updated comments, revised patch is now available. It's more robust to the nefarious null and his NullPointerException cabal. ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags and their subsequent indexing. - Key: NUTCH-855 URL: https://issues.apache.org/jira/browse/NUTCH-855 Project: Nutch Issue Type: New Feature Components: generator, indexer Affects Versions: 1.1 Reporter: Scott Gonyea Fix For: 1.2, 2.0 Attachments: nutch-855.txt Original Estimate: 168h Remaining Estimate: 168h This plugin is designed to enhance the NUTCH-655 patch, by doing two things: 1. Meta Tags that are supplied with your Crawl URLs, during injection, will be propagated throughout the outlinks of those Crawl URLs. 2. When you index your URLs, the meta tags that you specified with your URLs will be indexed alongside those URLs--and can be directly queried, assuming you have done everything else correctly. The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited in the form of: www.url.com\tkey1=value1\tkey2=value2\t...\tkeyN=valueN or: http://slashdot.org/ corp_owner=Geeknet will_it_blend=indubitably http://engadget.com/ corp_owner=Weblogs genre=geeksquad_thriller To activate this plugin, you must modify two properties in your nutch-sites.xml: 1. plugin.includes add: urlmeta to: value.../value ie: valueurlmeta|parse-tika|scoring-opic|.../value 2. urlmeta.tags Insert a comma-delimited list of metatags. Using the above example: valuecorp_owner, will_it_blend, genre/value Note that you do not need to include the tag with every URL. However, you must specify each tag if you want it to be propagated and later indexed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.