Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer

2013-03-17 Thread feng lu
Thanks a lot to everyone for inviting me.

I'm a software engineer in China, I have  been using Apache Nutch for three
years. In our team, I mainly responsible for modifying nutch 1.x to suit
the requirements of our database Mongodb. So i also write a simple database
abstraction layer to adapt different database like Apache Gora. In this
process, i found myself more and more like these places @user @dev @jira,
Because in these places, i can get some help from others, also others can
get help from my. Finally, i am also very pleased to make some contribution
for the Apache Nutch.

A problem has been troubling me a long time is that what is the target of
nutch 1.x, Does nutch 1.x is just a transitional version of Nutch 2.x, or
they can coexist because Nutch 1.x has a different data processing method
to Nutch 2.x, like Julien said, Nutch 1.x is great for batch processing and
2.x large scale processing. Perhaps with more and more people use NoSql as
their back-end DB, the developers should focus more on the development of
Nutch 2.x, ensure its stability and improve its function.

Best Regards
Feng


[jira] [Commented] (NUTCH-1533) Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage

2013-03-17 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604614#comment-13604614
 ] 

lufeng commented on NUTCH-1533:
---

Hi Lewis

I'm sorry, I did not make it clear, perhaps in my opinion, The prevFetchTime 
and prevModifiedTime are used together. Either set to 0L when 
CrawlStatus.STATUS_RETRY and CrawlStatus.STATUS_GONE which both set 
prevFetchTime and prevModifiedTime to 0L, or set to a value when 
CrawlStatus.NOTMODIFIED which set prevFetchTime and prevModifiedTime.

yes, you are right, the both method should set prevModifiedTime to it. i will 
modified the patch later. 

Thanks Lewis.





 Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and 
 setBatchId() accessors in o.a.n.storage.WebPage
 

 Key: NUTCH-1533
 URL: https://issues.apache.org/jira/browse/NUTCH-1533
 Project: Nutch
  Issue Type: Bug
  Components: storage
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1533.patch, NUTCH-1533v2.patch


 NUTCH-1532 needs to obtain a batchId to add to NutchDocument prior to 
 indexing. This is currently not available as we do not store the information 
 in the WebPage. Additionally, we do not store the other ModifiedTime's but 
 incorrectly set them in o.a.n.crawl.FetchSchedule#setFetchSchedule.
 All the above accessors should be implemented.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2013-03-17 Thread Roberto Gardenier (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604649#comment-13604649
 ] 

Roberto Gardenier commented on NUTCH-585:
-

Will this patch be implemented in Nutch at all? I've seen this patch / feature 
request being marked from 1.4 up till 1.7 now. 
Even though the patch works with Nutch 1.5 up till 1.5.1 I wonder if this will 
become part of Nutch at any time, [~markus17]?

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.7

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1533) Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage

2013-03-17 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1533:
--

Attachment: NUTCH-1533-v3.patch

add prevModifiedTime to FetchSchedule both methods when crawl status is equal 
to retry and gone in DbUpdateReducer class. Thanks Lewis.

 Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and 
 setBatchId() accessors in o.a.n.storage.WebPage
 

 Key: NUTCH-1533
 URL: https://issues.apache.org/jira/browse/NUTCH-1533
 Project: Nutch
  Issue Type: Bug
  Components: storage
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1533.patch, NUTCH-1533v2.patch, NUTCH-1533-v3.patch


 NUTCH-1532 needs to obtain a batchId to add to NutchDocument prior to 
 indexing. This is currently not available as we do not store the information 
 in the WebPage. Additionally, we do not store the other ModifiedTime's but 
 incorrectly set them in o.a.n.crawl.FetchSchedule#setFetchSchedule.
 All the above accessors should be implemented.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira