[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler" by CihadGuzel

2015-08-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler" page has been changed by CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler?action=diff&rev1=10&rev2=11

* Week2 (1June-7June): Sitemap detection will be done. FetcherJob will be 
updated for   sitemap.
* Week3-4 (8June-21June): The parser process is updated for sitemap file 
parser. New parser plugins can be developed. 
* Week5 (22June-28June): DbUpdaterJob is updated for sitemap.
-   * Midterm(26June-3 July): Bu kısma kadar sİtemap yaşam döngüsü ana 
hatlarıyla implemente edilmiş olacaktır. Yapılanlar ve yapılacaklar 
değerlendirilecektir. Sitemap crawlera basitçe çalışır hale getirilmiş 
olacaktır.
+   * Midterm(26June-3 July): Up to this stage, sitemap life cycle has been 
developed according to the outline. Sitemap crawler runs simply. The process 
until now and from now on will be evaluated.
* Week6-7 (29June-12July): Sitemap ranking mechanism will be developed.
* Week8 (13July-19July): Sitemap black list, sitemap file detection 
yapılacak ve error detection yapılacak
* Week9 (20July-26July): Frequent mechanism will be developed


[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/weeklyreport" by CihadGuzel

2015-08-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler/weeklyreport" page has been changed by 
CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport?action=diff&rev1=11&rev2=12

Comment:
Weekly repor have updated

  
  
  = Week : 5 (22 June 2015 - 28 June 2015) =
- ...
  
+ '''Title :''' DbUpdater is updated 
+ 
+ DbUpdaterJob is updated for sitemap. Detected sitemaps are written to crawldb 
as a new line. Then the sitemaps will be crawled at the new crawl cycle.
+ 
+ = Week : 6 & 7 (29 June 2015 - 12 July 2015) =
+ 
+ '''Title :''' Sitemap parse plugin was abondoned. 
+ 
+ Parser plugin was abandoned after consultation with mentors. The parse 
process was embedded instead of plugin. Sitemap parser will be activated 
according to the parameters given as "sitemap".
+ Also midterm report is prepared. Up to this stage, sitemap life cycle has 
been developed according to the outline. Sitemap crawler runs simply. The 
process until now and from now on have evaluated.
+ 
+ 
+ = Week : 8 (13 July 2015 - 19 July 2015) =
+ 
+ '''Title :''' Sitemap file detection 
+ 
+ Sitemap file detection is implemented. The detection is activated according 
to the parameters given  at instant of fetch.
+ 
+ = Week : 9 (20 July 2015 - 26 July 2015) =
+ 
+ '''Title :''' frequency & priority
+ 
+ Create processSitemapParse function on ParseUtil. Parser process is updated 
for sitemap. Fetch interval time is updated acording to frequency value from 
sitemap.
+ Also priority field is added to crawldb for priority value from sitemap.
+ 
+ 
+ = Week : 10 & 11 (27 July 2015 - 9 August 2015) =
+ 
+ '''Title :''' Review & code cleaning
+ 
+ Some improvements were made according to the review of my mentor. Code 
cleaning is done. Sitemap score logic isn't developed, because current nutch 
score logic is affected. It can be done  according to the evaluation about it 
later.
+ 
+ = Week : 12 (10 August 2015 - 17 August 2015) =
+ 
+ '''Title :''' Testing
+ 
+ Some of problems have been fixed in the nutch test classes. Sitemap Tests 
were prepared.  Documents of sitemap crawler were prepared.
+ 
+ = Week : 13 (18 August 2015 - 21 August 2015) =
+ 
+ '''Title :''' Final evaluation
+ 
+ The final document were prepared.
+ 


[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/weeklyreport" by CihadGuzel

2015-08-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler/weeklyreport" page has been changed by 
CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport?action=diff&rev1=12&rev2=13

  
  '''Title :''' Final evaluation
  
- The final document were prepared.
+ The final document were prepared. Nutch wiki is updated.
  


[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler" by CihadGuzel

2015-08-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler" page has been changed by CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler?action=diff&rev1=11&rev2=12

Comment:
Final report link

  === Reports ===
   *  
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport|Weekly
 Report]]
   *  
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/midtermreport|Midterm
 Report]]
+  *  
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/midtermreport|Final
 Report]]
  
  === Documentation ===
  Documents will be added here.


[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler" by CihadGuzel

2015-08-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler" page has been changed by CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler?action=diff&rev1=12&rev2=13

  === Reports ===
   *  
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport|Weekly
 Report]]
   *  
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/midtermreport|Midterm
 Report]]
-  *  
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/midtermreport|Final
 Report]]
+  *  
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/finalreport|Final
 Report]]
  
  === Documentation ===
  Documents will be added here.


[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/finalreport" by CihadGuzel

2015-08-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler/finalreport" page has been changed by 
CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/finalreport

Comment:
final report 

New page:
= Support Sitemap Crawler in Nutch 2.x Midterm Report =

||'''Title :'''GSOC 2015 Midterm Report||
||'''Reporting Date :'''25th June 2015||
||'''Issue :''' 
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler|NUTCH-1741 - 
Support Sitemap Crawler in Nutch 2.x]]||
||'''Student :'''Cihad Güzel - cguz...@gmail.com||
||'''Mentors :'''[[https://wiki.apache.org/nutch/LewisJohnMcgibbney|Lewis 
John McGibbney]], [[https://wiki.apache.org/nutch/talat|Talat Uyarer]]||
||'''Development Codebase: 
:'''[[https://github.com/cguzel/nutch-sitemapCrawler|Github Repo Url]]||

<>

== Abstract ==

The url’s can be got from only pages that were scanned before in nutch crawler 
system. This method is expensive. But, it is possible to find the whole of urls 
in a up-to-date sitemap file. For this reason, sitemap files in website should 
be crawled. Nutch project will have that support of sitemap crawler thanks to 
this development.

== Introduction ==
Sitemap is a file guiding to crawl website in a better way and it has different 
file formats (such as simple text format, xml format, rss 2.0, atom 0.3 & 1.0).
It is possible to find the whole of urls in a up-to-date sitemap file. Websites 
can be crawled faster by means of sitemap crawler that will be developed. In 
addition, some knowledge can be detected such as “change frequance”, “last 
update time” and “the priority” of the pages. Shortly, a better url list will 
be got easily and fast from sitemap file thanks to this software. It is another 
advantage that this process is under the control of the user. Finally, when the 
project concluded;

 * Nutch project will have that support of sitemap crawler thanks to this 
development.
 * Better url list will be got by eliminating the sitemaps according to 
criteria of quality.
 * The sitemaps not wanted can be ignored
 * The management and configuration of sitemap crawler are under the control of 
user.

== Project Details ==
It is aimed to power nutch project by sitemap crawler support. The main target 
is to detect the sitemap having correct urls and to be crawled. It is easy and 
fast to find correct ursl by sitemap crawler. The software will make following 
features possible.

 1. sitemap detection: Sitemap files will be detected automatically, if 
available.
 * sitemap list injection: Sitemap urls will be injected by using Nutch 
injection
 * “Change frequence” mechanism must be supported by the crawler.
 * Supporting multi-sitemap.
 * Sitemap constraint: The maximum sitemap size can not be greater than 10 MB 
and the maximum urls can not be greater than 50,000 in a sitemap file.
 * Sitemaps must have only inlink. Outlinks must be ignored.
 * Sitemap crawler is the part of Nutch Life Cycle [3]. Sitemap crawler is 
designed according to these cases:
   * Sitemap urls can be injected from seedlist.
   * Sitemap files can be detected automatically from sites crawled.
   * It can be wanted to crawl only sitemaps.
   * It can be wanted to crawl urls except sitemap.
   * A sitemap file can give reference another sitemap file.
   * Sitemap file can be in zip format.
   * Sitemap file may be larger than 50mb. In case of this some limits must be 
defined.
   * Sitemaps file may include more url than 50,000. In case of this some 
limits must be defined.

=== The advatages of the process of developing project ===

 1. The new features that will be developed can be entegrated easily thanks to 
the nutch pluginer design and nutch life cycle.
 * The current nutch plugins can be used.
 * There are some studies about sitemap crawler in Nutch project (NUTCH-1741 
[1], NUTCH-1465 [2]). The process improves by taking hand the weak and strong 
sides of the project

== How Nutch 2.x processes Sitemap? ==

There are two use cases supported in Nutch's Sitemap processing:
 1. Sitemaps are considered as "remote seed lists". Crawl administrators can 
prepare a list of sitemap links and get only those sitemap pages. This suits 
well for targeted crawl of specific hosts. The sitemap urls are directly 
injected, fetched and parsed, if pass parameter as “-sitemap”. Nutch uses 
Crawler Commons Project for parsing sitemaps.
 2. For open web crawl, it is not possible to track each host and get the 
sitemap links manually. Nutch would automatically get the sitemaps for all the 
hosts seen in the crawls and inject the urls from sitemap to the crawldb, if 
pass parameter as “-stmDetect” when fetch . we need a list of all hosts see 
throughout the duration of nutch crawl. Nutch's HostDb stores all the hosts 
that were seen in the long crawl. Link to the robots.txt of these hosts is 
generated by pre-pending "http://"; 

[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/finalreport" by CihadGuzel

2015-08-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler/finalreport" page has been changed by 
CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/finalreport?action=diff&rev1=1&rev2=2

Comment:
final report 

- = Support Sitemap Crawler in Nutch 2.x Midterm Report =
+ = Support Sitemap Crawler in Nutch 2.x Final Report =
  
- ||'''Title :'''GSOC 2015 Midterm Report||
+ ||'''Title :'''GSOC 2015 Final Report||
- ||'''Reporting Date :'''25th June 2015||
+ ||'''Reporting Date :'''23th August 2015||
  ||'''Issue :''' 
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler|NUTCH-1741 - 
Support Sitemap Crawler in Nutch 2.x]]||
  ||'''Student :'''Cihad Güzel - cguz...@gmail.com||
  ||'''Mentors :'''[[https://wiki.apache.org/nutch/LewisJohnMcgibbney|Lewis 
John McGibbney]], [[https://wiki.apache.org/nutch/talat|Talat Uyarer]]||


[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler" by CihadGuzel

2015-08-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler" page has been changed by CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler?action=diff&rev1=13&rev2=14

* Week6-7 (29June-12July): Sitemap ranking mechanism will be developed.
* Week8 (13July-19July): Sitemap black list, sitemap file detection 
yapılacak ve error detection yapılacak
* Week9 (20July-26July): Frequent mechanism will be developed
-   * Week10 (27July-2Agust): The filter plugins will be updated or new filter 
plugins is will be developed.
+   * Week10 (27July-2August): The filter plugins will be updated or new filter 
plugins is will be developed.
-   * Week11 (3Agust-9Agust): Code review and code cleaning.
+   * Week11 (3August-9August): Code review and code cleaning.
-   * Week12-13 (10Agust-23Agust): Further refine tests and documentation for 
the whole project.
+   * Week12-13 (10August-23August): Further refine tests and documentation for 
the whole project.
  
  
'''Features that will be developed after GSOC:''' Sitemap crawler report 
page, Sitemap monitoring page, Video Sitemaps crawler.


[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-08-23 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1679:

Attachment: NUTCH-1679-2.patch

I have another solution.
With a new link in DbUpdaterReducer we only add url, no status, no fetchtime or 
any other info. 
 - So if this link is already exist in database, we don't override anything. 
 - Otherwise, it is actually a new link, it will have status = 0 (default 
value)  and we will initialize (set status, fetch time, ...) it in Generator 
instead.
I tested it with hbase backend on nutch-2.3

> UpdateDb using batchId, link may override crawled page.
> ---
>
> Key: NUTCH-1679
> URL: https://issues.apache.org/jira/browse/NUTCH-1679
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Tien Nguyen Manh
>Priority: Critical
> Fix For: 2.3.1
>
> Attachments: NUTCH-1679-2.patch, NUTCH-1679.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>   page = new WebPage();
>   schedule.initializeSchedule(url, page);
>   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>   try {
> scoringFilters.initialScore(url, page);
>   } catch (ScoringFilterException e) {
> page.setScore(0.0f);
>   }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)