[jira] [Commented] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time

2015-06-22 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596832#comment-14596832
 ] 

Michael Joyce commented on NUTCH-2045:
--

+1 this is great

> index-basic incorrect assignment of next fetch time (page.getFetchTime()) as 
> page fetch time
> 
>
> Key: NUTCH-2045
> URL: https://issues.apache.org/jira/browse/NUTCH-2045
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 2.3, 1.10
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-2045.patch
>
>
> The issue here as flagged up when using indexer-elastic plugin where the page 
> fetch time is incorrectly assigned as the NEXT fetch time as oppose to the 
> time at which the page was actually fetched (prevFetchTime).
> The ML thread for this issue can be found below
> http://www.mail-archive.com/user%40nutch.apache.org/msg13661.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time

2015-06-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2045:

Attachment: NUTCH-2045.patch

Patch for 2.X HEAD

> index-basic incorrect assignment of next fetch time (page.getFetchTime()) as 
> page fetch time
> 
>
> Key: NUTCH-2045
> URL: https://issues.apache.org/jira/browse/NUTCH-2045
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 2.3, 1.10
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-2045.patch
>
>
> The issue here as flagged up when using indexer-elastic plugin where the page 
> fetch time is incorrectly assigned as the NEXT fetch time as oppose to the 
> time at which the page was actually fetched (prevFetchTime).
> The ML thread for this issue can be found below
> http://www.mail-archive.com/user%40nutch.apache.org/msg13661.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time

2015-06-22 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2045:

Patch Info: Patch Available

> index-basic incorrect assignment of next fetch time (page.getFetchTime()) as 
> page fetch time
> 
>
> Key: NUTCH-2045
> URL: https://issues.apache.org/jira/browse/NUTCH-2045
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 2.3, 1.10
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-2045.patch
>
>
> The issue here as flagged up when using indexer-elastic plugin where the page 
> fetch time is incorrectly assigned as the NEXT fetch time as oppose to the 
> time at which the page was actually fetched (prevFetchTime).
> The ML thread for this issue can be found below
> http://www.mail-archive.com/user%40nutch.apache.org/msg13661.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time

2015-06-22 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2045:
---

 Summary: index-basic incorrect assignment of next fetch time 
(page.getFetchTime()) as page fetch time
 Key: NUTCH-2045
 URL: https://issues.apache.org/jira/browse/NUTCH-2045
 Project: Nutch
  Issue Type: Bug
  Components: plugin
Affects Versions: 1.10, 2.3
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11, 2.3.1


The issue here as flagged up when using indexer-elastic plugin where the page 
fetch time is incorrectly assigned as the NEXT fetch time as oppose to the time 
at which the page was actually fetched (prevFetchTime).
The ML thread for this issue can be found below
http://www.mail-archive.com/user%40nutch.apache.org/msg13661.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/weeklyreport" by CihadGuzel

2015-06-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler/weeklyreport" page has been changed by 
CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport?action=diff&rev1=8&rev2=9

  
  '''Title :''' Sitemap detection is done. 
  
- Robot.txt file is checked while fetcher job is run. If robot.txt file have 
any sitemap urls, these are written to database. A column called sitemap(stm) 
for sitemap is added to db schema. The urls in stm column from db will be 
parsed at the next time.
+ Robot.txt is a file on the website. The file has sitemap url list. So, 
sitemap url list of a website can be accessed from this file. 
+ 
+ Nutch Project reads robot.txt file while fetcher job is running. The file is 
checked from new code block of sitemap crawler. If it has any sitemap urls, 
these are written to stm(sitemap) column in the webpage table on the database.
+ 
+ The stm(sitemap)column is added to webpage schema for sitemap crawler. The 
urls in stm column from db will be parsed at the next time.
  
  
  || '''Week :''' 3 & 4 (8 June 2015 - 21 June 2015) ||


Re: Unsubscribe

2015-06-22 Thread Michael Joyce
Send an email to dev-unsubscr...@nutch.apache.org

All of this information is available on the main ASF website at:
https://www.apache.org/foundation/mailinglists.html

Additionally, the appropriate emails are listed on the Nutch site with
mailto links to do all the heavy lifting for you:
https://nutch.apache.org/mailing_lists.html

If the documentation on how to properly unsubscribe isn't clear please let
us know how we can improve it.


-- Jimmy

On Sun, Jun 21, 2015 at 10:37 AM, Aaplavi Sharma  wrote:

>
>


[jira] [Comment Edited] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-22 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596239#comment-14596239
 ] 

Asitang Mishra edited comment on NUTCH-2038 at 6/22/15 5:09 PM:


>From what I understand the problem is that a url filter in nutch has a very 
>simple interface (has no provision for content) and is only "fired" in the 
>generator step.

problems:
[~chrismattmann]: 
1> Cannot make it a part of the core, should be a plugin and be called as a 
general plugin from the core (right now in my patch, it is more visible than a 
general plugin).
2>Should be a url filter and not a scoring filter to preserve the simplicity 
and transparency of the methodology.
 [~wastl-nagel]: 
1>Should not read content or call tika in the plugin as it will be a hadoop job 
and also not lightweight. 
2> Should be a scoring filter as the interface in place already supports such 
an improvement.


I may suggest that if we all agree to let it be a url filter (and that's 
completely up to you guys) then what I can do is either enhance the already 
present urlfilter interface or make an abstract class (which will very generic 
and has a filter function that takes some args and a string)
And call all the url filters from parser as well, but this time not fire the 
original filter() function (keep it for the generator). Fire the new filter 
function from the parser. That way the only viable change in NUTCH will be that 
now parser will also be calling urlfilters (And this will be very generic). 
That way we also don't need to read the crawl db or call tika for my specific 
filter.


was (Author: asitang):
>From what I understand the problem is that a url filter in nutch has a very 
>simple interface (has no provision for content) and is only "fired" in the 
>generator step.

problems:
[~chrismattmann]: 
1> Cannot make it a part of the core, should be a plugin and be called as a 
general plugin from the core.
2>Should be a url filter and not a scoring filter to preserve the simplicity 
and transparency of the methodology.
 [~wastl-nagel]: 
1>Should not read content or call tika in the plugin as it will be a hadoop job 
and also not lightweight. 
2> Should be a scoring filter as the interface in place already supports such 
an improvement.


I may suggest that if we all agree to let it be a url filter (and that's 
completely up to you guys) then what I can do is either enhance the already 
present urlfilter interface or make an abstract class (which will very generic 
and has a filter function that takes some args and a string)
And call all the url filters from parser as well, but this time not fire the 
original filter() function (keep it for the generator). Fire the new filter 
function from the parser. That way the only viable change in NUTCH will be that 
now parser will also be calling urlfilters (And this will be very generic). 
That way we also don't need to read the crawl db or call tika for my specific 
filter.

> Naive Bayes classifier based url filter
> ---
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, injector, parser
>Reporter: Asitang Mishra
>Assignee: Chris A. Mattmann
>  Labels: memex, nutch
> Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will 
> keep only those urls that contain some "hot words" provided again in a list.) 
> from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-22 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596239#comment-14596239
 ] 

Asitang Mishra commented on NUTCH-2038:
---

>From what I understand the problem is that a url filter in nutch has a very 
>simple interface (has no provision for content) and is only "fired" in the 
>generator step.

problems:
[~chrismattmann]: 
1> Cannot make it a part of the core, should be a plugin and be called as a 
general plugin from the core.
2>Should be a url filter and not a scoring filter to preserve the simplicity 
and transparency of the methodology.
 [~wastl-nagel]: 
1>Should not read content or call tika in the plugin as it will be a hadoop job 
and also not lightweight. 
2> Should be a scoring filter as the interface in place already supports such 
an improvement.


I may suggest that if we all agree to let it be a url filter (and that's 
completely up to you guys) then what I can do is either enhance the already 
present urlfilter interface or make an abstract class (which will very generic 
and has a filter function that takes some args and a string)
And call all the url filters from parser as well, but this time not fire the 
original filter() function (keep it for the generator). Fire the new filter 
function from the parser. That way the only viable change in NUTCH will be that 
now parser will also be calling urlfilters (And this will be very generic). 
That way we also don't need to read the crawl db or call tika for my specific 
filter.

> Naive Bayes classifier based url filter
> ---
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, injector, parser
>Reporter: Asitang Mishra
>Assignee: Chris A. Mattmann
>  Labels: memex, nutch
> Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will 
> keep only those urls that contain some "hot words" provided again in a list.) 
> from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/weeklyreport" by CihadGuzel

2015-06-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler/weeklyreport" page has been changed by 
CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport?action=diff&rev1=7&rev2=8

Comment:
reproduce report1

  
  '''Title :''' Sitemap url injection is done.
  
- Sitemap list injection is provided on this week working. The Url path of 
sitemap files wanted to be injected can be defined from seed file. 
+ Sitemap list injection is provided on this week working. The url path of 
sitemap files wanted to be injected can be defined from seed file. 
  
  In addition, Some preparatory work related to following works is being done. 
  
  
  Example:
  
- If you have two sitemap files, you can define them in the seed file as follow:
+ If you have two sitemap files for "http://www.example.com/"; , you can define 
them in the seed file as follow:
  
   * ''http://www.example.com/ sitemaps: sitemap1.xml sitemap2.xml'' 
   * ''http://www.example2.com/
   * ''http://www.example3.com/''
+ 
+ Then you can run InjecterJob. So the sitemaps urls are injected to the db. 
The urls injected are signed as sitemap.
  
  || '''Week :''' 2 (1 June 2015 - 7 June 2015) ||
  


[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler" by CihadGuzel

2015-06-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler" page has been changed by CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler?action=diff&rev1=8&rev2=9

  === Documentation ===
  Documents will be added here.
  
+ 
+ === Source Code ===
+ 
+ [[https://github.com/cguzel/nutch-sitemapCrawler|source code on github]]
  === Jira Issues ===
  
   * https://issues.apache.org/jira/browse/NUTCH-1741


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

2015-06-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595470#comment-14595470
 ] 

Sebastian Nagel commented on NUTCH-2038:


The scoring filter interface is complex, you're right, and not easy to 
understand. But scoring filters are powerful and can do a lot of "magic" aside 
from pure "scoring", e.g., limiting crawl by linkage depth and focused 
crawling. The ScoringFilter interface is complex because it must fit into the 
Nutch workflow. In 2.x the interface is simpler because the workflow and the 
underlying data structures are simpler (one web table vs. segments with 
multiple subdirectories). Plugins should be lightweight in terms of using 
resources and it's surely not ideal if they run MapReduce jobs (findDatumForUrl 
must do this in 1.x) or fetch content again via Tika.

> Naive Bayes classifier based url filter
> ---
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, injector, parser
>Reporter: Asitang Mishra
>Assignee: Chris A. Mattmann
>  Labels: memex, nutch
> Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will 
> keep only those urls that contain some "hot words" provided again in a list.) 
> from that pages that are classified irrelevant by the classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)