[jira] [Commented] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time
[ https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596832#comment-14596832 ] Michael Joyce commented on NUTCH-2045: -- +1 this is great > index-basic incorrect assignment of next fetch time (page.getFetchTime()) as > page fetch time > > > Key: NUTCH-2045 > URL: https://issues.apache.org/jira/browse/NUTCH-2045 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 2.3, 1.10 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > Attachments: NUTCH-2045.patch > > > The issue here as flagged up when using indexer-elastic plugin where the page > fetch time is incorrectly assigned as the NEXT fetch time as oppose to the > time at which the page was actually fetched (prevFetchTime). > The ML thread for this issue can be found below > http://www.mail-archive.com/user%40nutch.apache.org/msg13661.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time
[ https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2045: Attachment: NUTCH-2045.patch Patch for 2.X HEAD > index-basic incorrect assignment of next fetch time (page.getFetchTime()) as > page fetch time > > > Key: NUTCH-2045 > URL: https://issues.apache.org/jira/browse/NUTCH-2045 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 2.3, 1.10 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > Attachments: NUTCH-2045.patch > > > The issue here as flagged up when using indexer-elastic plugin where the page > fetch time is incorrectly assigned as the NEXT fetch time as oppose to the > time at which the page was actually fetched (prevFetchTime). > The ML thread for this issue can be found below > http://www.mail-archive.com/user%40nutch.apache.org/msg13661.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time
[ https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2045: Patch Info: Patch Available > index-basic incorrect assignment of next fetch time (page.getFetchTime()) as > page fetch time > > > Key: NUTCH-2045 > URL: https://issues.apache.org/jira/browse/NUTCH-2045 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 2.3, 1.10 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > Attachments: NUTCH-2045.patch > > > The issue here as flagged up when using indexer-elastic plugin where the page > fetch time is incorrectly assigned as the NEXT fetch time as oppose to the > time at which the page was actually fetched (prevFetchTime). > The ML thread for this issue can be found below > http://www.mail-archive.com/user%40nutch.apache.org/msg13661.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time
Lewis John McGibbney created NUTCH-2045: --- Summary: index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time Key: NUTCH-2045 URL: https://issues.apache.org/jira/browse/NUTCH-2045 Project: Nutch Issue Type: Bug Components: plugin Affects Versions: 1.10, 2.3 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11, 2.3.1 The issue here as flagged up when using indexer-elastic plugin where the page fetch time is incorrectly assigned as the NEXT fetch time as oppose to the time at which the page was actually fetched (prevFetchTime). The ML thread for this issue can be found below http://www.mail-archive.com/user%40nutch.apache.org/msg13661.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/weeklyreport" by CihadGuzel
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "GoogleSummerOfCode/SitemapCrawler/weeklyreport" page has been changed by CihadGuzel: https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport?action=diff&rev1=8&rev2=9 '''Title :''' Sitemap detection is done. - Robot.txt file is checked while fetcher job is run. If robot.txt file have any sitemap urls, these are written to database. A column called sitemap(stm) for sitemap is added to db schema. The urls in stm column from db will be parsed at the next time. + Robot.txt is a file on the website. The file has sitemap url list. So, sitemap url list of a website can be accessed from this file. + + Nutch Project reads robot.txt file while fetcher job is running. The file is checked from new code block of sitemap crawler. If it has any sitemap urls, these are written to stm(sitemap) column in the webpage table on the database. + + The stm(sitemap)column is added to webpage schema for sitemap crawler. The urls in stm column from db will be parsed at the next time. || '''Week :''' 3 & 4 (8 June 2015 - 21 June 2015) ||
Re: Unsubscribe
Send an email to dev-unsubscr...@nutch.apache.org All of this information is available on the main ASF website at: https://www.apache.org/foundation/mailinglists.html Additionally, the appropriate emails are listed on the Nutch site with mailto links to do all the heavy lifting for you: https://nutch.apache.org/mailing_lists.html If the documentation on how to properly unsubscribe isn't clear please let us know how we can improve it. -- Jimmy On Sun, Jun 21, 2015 at 10:37 AM, Aaplavi Sharma wrote: > >
[jira] [Comment Edited] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596239#comment-14596239 ] Asitang Mishra edited comment on NUTCH-2038 at 6/22/15 5:09 PM: >From what I understand the problem is that a url filter in nutch has a very >simple interface (has no provision for content) and is only "fired" in the >generator step. problems: [~chrismattmann]: 1> Cannot make it a part of the core, should be a plugin and be called as a general plugin from the core (right now in my patch, it is more visible than a general plugin). 2>Should be a url filter and not a scoring filter to preserve the simplicity and transparency of the methodology. [~wastl-nagel]: 1>Should not read content or call tika in the plugin as it will be a hadoop job and also not lightweight. 2> Should be a scoring filter as the interface in place already supports such an improvement. I may suggest that if we all agree to let it be a url filter (and that's completely up to you guys) then what I can do is either enhance the already present urlfilter interface or make an abstract class (which will very generic and has a filter function that takes some args and a string) And call all the url filters from parser as well, but this time not fire the original filter() function (keep it for the generator). Fire the new filter function from the parser. That way the only viable change in NUTCH will be that now parser will also be calling urlfilters (And this will be very generic). That way we also don't need to read the crawl db or call tika for my specific filter. was (Author: asitang): >From what I understand the problem is that a url filter in nutch has a very >simple interface (has no provision for content) and is only "fired" in the >generator step. problems: [~chrismattmann]: 1> Cannot make it a part of the core, should be a plugin and be called as a general plugin from the core. 2>Should be a url filter and not a scoring filter to preserve the simplicity and transparency of the methodology. [~wastl-nagel]: 1>Should not read content or call tika in the plugin as it will be a hadoop job and also not lightweight. 2> Should be a scoring filter as the interface in place already supports such an improvement. I may suggest that if we all agree to let it be a url filter (and that's completely up to you guys) then what I can do is either enhance the already present urlfilter interface or make an abstract class (which will very generic and has a filter function that takes some args and a string) And call all the url filters from parser as well, but this time not fire the original filter() function (keep it for the generator). Fire the new filter function from the parser. That way the only viable change in NUTCH will be that now parser will also be calling urlfilters (And this will be very generic). That way we also don't need to read the crawl db or call tika for my specific filter. > Naive Bayes classifier based url filter > --- > > Key: NUTCH-2038 > URL: https://issues.apache.org/jira/browse/NUTCH-2038 > Project: Nutch > Issue Type: New Feature > Components: fetcher, injector, parser >Reporter: Asitang Mishra >Assignee: Chris A. Mattmann > Labels: memex, nutch > Fix For: 1.11 > > > A url filter that will filter out the urls (after the parsing stage, will > keep only those urls that contain some "hot words" provided again in a list.) > from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596239#comment-14596239 ] Asitang Mishra commented on NUTCH-2038: --- >From what I understand the problem is that a url filter in nutch has a very >simple interface (has no provision for content) and is only "fired" in the >generator step. problems: [~chrismattmann]: 1> Cannot make it a part of the core, should be a plugin and be called as a general plugin from the core. 2>Should be a url filter and not a scoring filter to preserve the simplicity and transparency of the methodology. [~wastl-nagel]: 1>Should not read content or call tika in the plugin as it will be a hadoop job and also not lightweight. 2> Should be a scoring filter as the interface in place already supports such an improvement. I may suggest that if we all agree to let it be a url filter (and that's completely up to you guys) then what I can do is either enhance the already present urlfilter interface or make an abstract class (which will very generic and has a filter function that takes some args and a string) And call all the url filters from parser as well, but this time not fire the original filter() function (keep it for the generator). Fire the new filter function from the parser. That way the only viable change in NUTCH will be that now parser will also be calling urlfilters (And this will be very generic). That way we also don't need to read the crawl db or call tika for my specific filter. > Naive Bayes classifier based url filter > --- > > Key: NUTCH-2038 > URL: https://issues.apache.org/jira/browse/NUTCH-2038 > Project: Nutch > Issue Type: New Feature > Components: fetcher, injector, parser >Reporter: Asitang Mishra >Assignee: Chris A. Mattmann > Labels: memex, nutch > Fix For: 1.11 > > > A url filter that will filter out the urls (after the parsing stage, will > keep only those urls that contain some "hot words" provided again in a list.) > from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/weeklyreport" by CihadGuzel
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "GoogleSummerOfCode/SitemapCrawler/weeklyreport" page has been changed by CihadGuzel: https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport?action=diff&rev1=7&rev2=8 Comment: reproduce report1 '''Title :''' Sitemap url injection is done. - Sitemap list injection is provided on this week working. The Url path of sitemap files wanted to be injected can be defined from seed file. + Sitemap list injection is provided on this week working. The url path of sitemap files wanted to be injected can be defined from seed file. In addition, Some preparatory work related to following works is being done. Example: - If you have two sitemap files, you can define them in the seed file as follow: + If you have two sitemap files for "http://www.example.com/"; , you can define them in the seed file as follow: * ''http://www.example.com/ sitemaps: sitemap1.xml sitemap2.xml'' * ''http://www.example2.com/ * ''http://www.example3.com/'' + + Then you can run InjecterJob. So the sitemaps urls are injected to the db. The urls injected are signed as sitemap. || '''Week :''' 2 (1 June 2015 - 7 June 2015) ||
[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler" by CihadGuzel
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "GoogleSummerOfCode/SitemapCrawler" page has been changed by CihadGuzel: https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler?action=diff&rev1=8&rev2=9 === Documentation === Documents will be added here. + + === Source Code === + + [[https://github.com/cguzel/nutch-sitemapCrawler|source code on github]] === Jira Issues === * https://issues.apache.org/jira/browse/NUTCH-1741
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595470#comment-14595470 ] Sebastian Nagel commented on NUTCH-2038: The scoring filter interface is complex, you're right, and not easy to understand. But scoring filters are powerful and can do a lot of "magic" aside from pure "scoring", e.g., limiting crawl by linkage depth and focused crawling. The ScoringFilter interface is complex because it must fit into the Nutch workflow. In 2.x the interface is simpler because the workflow and the underlying data structures are simpler (one web table vs. segments with multiple subdirectories). Plugins should be lightweight in terms of using resources and it's surely not ideal if they run MapReduce jobs (findDatumForUrl must do this in 1.x) or fetch content again via Tika. > Naive Bayes classifier based url filter > --- > > Key: NUTCH-2038 > URL: https://issues.apache.org/jira/browse/NUTCH-2038 > Project: Nutch > Issue Type: New Feature > Components: fetcher, injector, parser >Reporter: Asitang Mishra >Assignee: Chris A. Mattmann > Labels: memex, nutch > Fix For: 1.11 > > > A url filter that will filter out the urls (after the parsing stage, will > keep only those urls that contain some "hot words" provided again in a list.) > from that pages that are classified irrelevant by the classifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)