Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer
Thanks a lot to everyone for inviting me. I'm a software engineer in China, I have been using Apache Nutch for three years. In our team, I mainly responsible for modifying nutch 1.x to suit the requirements of our database Mongodb. So i also write a simple database abstraction layer to adapt different database like Apache Gora. In this process, i found myself more and more like these places @user @dev @jira, Because in these places, i can get some help from others, also others can get help from my. Finally, i am also very pleased to make some contribution for the Apache Nutch. A problem has been troubling me a long time is that what is the target of nutch 1.x, Does nutch 1.x is just a transitional version of Nutch 2.x, or they can coexist because Nutch 1.x has a different data processing method to Nutch 2.x, like Julien said, Nutch 1.x is great for batch processing and 2.x large scale processing. Perhaps with more and more people use NoSql as their back-end DB, the developers should focus more on the development of Nutch 2.x, ensure its stability and improve its function. Best Regards Feng
[jira] [Commented] (NUTCH-1533) Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage
[ https://issues.apache.org/jira/browse/NUTCH-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604614#comment-13604614 ] lufeng commented on NUTCH-1533: --- Hi Lewis I'm sorry, I did not make it clear, perhaps in my opinion, The prevFetchTime and prevModifiedTime are used together. Either set to 0L when CrawlStatus.STATUS_RETRY and CrawlStatus.STATUS_GONE which both set prevFetchTime and prevModifiedTime to 0L, or set to a value when CrawlStatus.NOTMODIFIED which set prevFetchTime and prevModifiedTime. yes, you are right, the both method should set prevModifiedTime to it. i will modified the patch later. Thanks Lewis. Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage Key: NUTCH-1533 URL: https://issues.apache.org/jira/browse/NUTCH-1533 Project: Nutch Issue Type: Bug Components: storage Affects Versions: 2.1 Reporter: Lewis John McGibbney Priority: Minor Fix For: 2.2 Attachments: NUTCH-1533.patch, NUTCH-1533v2.patch NUTCH-1532 needs to obtain a batchId to add to NutchDocument prior to indexing. This is currently not available as we do not store the information in the WebPage. Additionally, we do not store the other ModifiedTime's but incorrectly set them in o.a.n.crawl.FetchSchedule#setFetchSchedule. All the above accessors should be implemented. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604649#comment-13604649 ] Roberto Gardenier commented on NUTCH-585: - Will this patch be implemented in Nutch at all? I've seen this patch / feature request being marked from 1.4 up till 1.7 now. Even though the patch works with Nutch 1.5 up till 1.5.1 I wonder if this will become part of Nutch at any time, [~markus17]? [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Assignee: Markus Jelsma Priority: Minor Fix For: 1.7 Attachments: blacklist_whitelist_plugin.patch, nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1533) Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage
[ https://issues.apache.org/jira/browse/NUTCH-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1533: -- Attachment: NUTCH-1533-v3.patch add prevModifiedTime to FetchSchedule both methods when crawl status is equal to retry and gone in DbUpdateReducer class. Thanks Lewis. Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage Key: NUTCH-1533 URL: https://issues.apache.org/jira/browse/NUTCH-1533 Project: Nutch Issue Type: Bug Components: storage Affects Versions: 2.1 Reporter: Lewis John McGibbney Priority: Minor Fix For: 2.2 Attachments: NUTCH-1533.patch, NUTCH-1533v2.patch, NUTCH-1533-v3.patch NUTCH-1532 needs to obtain a batchId to add to NutchDocument prior to indexing. This is currently not available as we do not store the information in the WebPage. Additionally, we do not store the other ModifiedTime's but incorrectly set them in o.a.n.crawl.FetchSchedule#setFetchSchedule. All the above accessors should be implemented. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira