Re: Problems running some ant targets on recent trunk
It's actually a bit more twisted than that : see https://issues.apache.org/jira/browse/NUTCH-1818 This separation of the test and runtime dependencies has actually been very good for exposing inconsistencies in the way the existing build worked. The issue should be solved now, thanks for reporting it. On 17 July 2014 10:18, Julien Nioche lists.digitalpeb...@gmail.com wrote: In this case it is the target compile-test of lib-regex-filter which fails. Should it be really called for target runtime? target name=deps-jar ant target=jar inheritall=false dir=../lib-regex-filter/ ant target=compile-test inheritall=false dir=../lib-regex-filter/ /target This is the source of the problem indeed, the second line should not be there : the test classes are not required at that stage. Both urlfilter-automaton and urlfilter-regex have the same problem, which was not apparent until I introduced a cleaner separation between the compilation and test deps. I've fixed that for trunk in revision 1611303. As for the issue with calling 'ant test' from a plugin dir, this is not a new issue : we get the same thing with a fresh copy of Nutch1.8. It's just that the test task for the plugins assumes that the core classes and ivy jars have already been resolved. Thanks Julien On 16 July 2014 23:10, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi, I have some problems running ant targets on recent trunk: % ant runtime fails if run from scratch (after ant clean) but it succeeds after ant test or ant nightly. in a plugin folder, e.g., src/plugin/parse-metatags % ant test The error causing the failure is always: .../trunk/src/plugin/build-plugin.xml:190: .../trunk/build/test/lib does not exist. e.g. within the chain of calls: BUILD FAILED .../trunk/build.xml:112: The following error occurred while executing this line: .../trunk/src/plugin/build.xml:63: The following error occurred while executing this line: .../trunk/src/plugin/urlfilter-automaton/build.xml:25: The following error occurred while executing this line: .../trunk/src/plugin/build-plugin.xml:190: .../trunk/build/test/lib does not exist. Indeed the directory does not exist because it's removed by target clean-lib. In this case it is the target compile-test of lib-regex-filter which fails. Should it be really called for target runtime? target name=deps-jar ant target=jar inheritall=false dir=../lib-regex-filter/ ant target=compile-test inheritall=false dir=../lib-regex-filter/ /target Thanks, Sebastian -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068374#comment-14068374 ] Julien Nioche commented on NUTCH-1708: -- I like the approach and this would be the best way of solving the issue. +1 to commit Re-field orig in 2.x : sounds like a duplicate of 'id' indeed. Let's do its removal separately Thanks use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Fix For: 1.9 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1820) remove field orig which duplicates id
Sebastian Nagel created NUTCH-1820: -- Summary: remove field orig which duplicates id Key: NUTCH-1820 URL: https://issues.apache.org/jira/browse/NUTCH-1820 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 2.2.1 Reporter: Sebastian Nagel Priority: Trivial Fix For: 2.3 The indexing filter plugin index-basic (2.x only) adds a field orig which contains the real URL (not the reprUrl) and duplicates the field id (also regarding the field params: stored=true indexed=true). The field orig should be removed from index-basic and schema.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1821) Nutch Crawl class for EMR
Luis Lopez created NUTCH-1821: - Summary: Nutch Crawl class for EMR Key: NUTCH-1821 URL: https://issues.apache.org/jira/browse/NUTCH-1821 Project: Nutch Issue Type: Wish Affects Versions: 1.6 Environment: Amazon EMR Reporter: Luis Lopez Hi all, Some of us are using Amazon EMR to deploy/run Nutch and from what I've been reading in the users mailing list there are 2 common issues people run into... first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and second, from Nutch 1.8+ the Crawl class has been deprecated. The first issue poses a problem when we try to deploy recent Nutch versions. The most recent version that is supported by EMR is 1.6, the second issue is that EMR receives a jar and main class to do its job and from 1.8 the Crawl class has been removed. After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that improves the old Crawl class so it scales, since 1.6 is an old version I wonder how can we contribute back to those that need to use ElasticMapreduce. The things we did are: a) Add num fetchers as a parameter to the Crawl class. For some reason the generator was always defaulting to one list( see: http://stackoverflow.com/questions/10264183/why-does-nutch-only-run-the-fetch-step-on-one-hadoop-node-when-the-cluster-has) creating just one fetch map task... with the new parameter we can adjust the map tasks to fit the cluster size. b) Index documents on each Crawl cycle and not at the end. We had performance/memory issues when we tried to index all the documents when the whole crawl is done, we moved the index part into the main Crawl cycle. c) We added an option to delete segments after their content is indexed into Solr. It saves HDF space since the EC2 instances we use don't have a lot of space. So far these fixes have allowed us to scale out Nutch be efficient with Amazon EMR clusters. If you guys think that there is some value on these changes we can submit a patch file. Luis. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1821) Nutch Crawl class for EMR
[ https://issues.apache.org/jira/browse/NUTCH-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Lopez updated NUTCH-1821: -- Description: Hi all, Some of us are using Amazon EMR to deploy/run Nutch and from what I've been reading in the users mailing list there are 2 common issues people run into... first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and second, from Nutch 1.8+ the Crawl class has been deprecated. The first issue poses a problem when we try to deploy recent Nutch versions. The most recent version that is supported by EMR is 1.6, the second issue is that EMR receives a jar and main class to do its job and from 1.8 the Crawl class has been removed. After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that improves the old Crawl class so it scales, since 1.6 is an old version I wonder how can we contribute back to those that need to use ElasticMapreduce. The things we did are: a) Add num fetchers as a parameter to the Crawl class. For some reason the generator was always defaulting to one list( see: http://stackoverflow.com/questions/10264183/why-does-nutch-only-run-the-fetch-step-on-one-hadoop-node-when-the-cluster-has) creating just one fetch map task... with the new parameter we can adjust the map tasks to fit the cluster size. b) Index documents on each Crawl cycle and not at the end. We had performance/memory issues when we tried to index all the documents when the whole crawl is done, we moved the index part into the main Crawl cycle. c) We added an option to delete segments after their content is indexed into Solr. It saves HDF space since the EC2 instances we use don't have a lot of space. So far these fixes have allowed us to scale out Nutch and be efficient with Amazon EMR clusters. If you guys think that there is some value on these changes we can submit a patch file. Luis. was: Hi all, Some of us are using Amazon EMR to deploy/run Nutch and from what I've been reading in the users mailing list there are 2 common issues people run into... first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and second, from Nutch 1.8+ the Crawl class has been deprecated. The first issue poses a problem when we try to deploy recent Nutch versions. The most recent version that is supported by EMR is 1.6, the second issue is that EMR receives a jar and main class to do its job and from 1.8 the Crawl class has been removed. After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that improves the old Crawl class so it scales, since 1.6 is an old version I wonder how can we contribute back to those that need to use ElasticMapreduce. The things we did are: a) Add num fetchers as a parameter to the Crawl class. For some reason the generator was always defaulting to one list( see: http://stackoverflow.com/questions/10264183/why-does-nutch-only-run-the-fetch-step-on-one-hadoop-node-when-the-cluster-has) creating just one fetch map task... with the new parameter we can adjust the map tasks to fit the cluster size. b) Index documents on each Crawl cycle and not at the end. We had performance/memory issues when we tried to index all the documents when the whole crawl is done, we moved the index part into the main Crawl cycle. c) We added an option to delete segments after their content is indexed into Solr. It saves HDF space since the EC2 instances we use don't have a lot of space. So far these fixes have allowed us to scale out Nutch be efficient with Amazon EMR clusters. If you guys think that there is some value on these changes we can submit a patch file. Luis. Nutch Crawl class for EMR - Key: NUTCH-1821 URL: https://issues.apache.org/jira/browse/NUTCH-1821 Project: Nutch Issue Type: Wish Affects Versions: 1.6 Environment: Amazon EMR Reporter: Luis Lopez Labels: Amazon, Crawler, EMR, performance Hi all, Some of us are using Amazon EMR to deploy/run Nutch and from what I've been reading in the users mailing list there are 2 common issues people run into... first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and second, from Nutch 1.8+ the Crawl class has been deprecated. The first issue poses a problem when we try to deploy recent Nutch versions. The most recent version that is supported by EMR is 1.6, the second issue is that EMR receives a jar and main class to do its job and from 1.8 the Crawl class has been removed. After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that improves the old Crawl class so it scales, since 1.6 is an old version I wonder how can we contribute back to those that need to use ElasticMapreduce. The things we did are: a) Add num fetchers as a parameter to the Crawl class. For some reason the generator was
[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069567#comment-14069567 ] Alexander Kingson commented on NUTCH-1679: -- Hi, I was suggesting to close the datastore by adding this function protected void cleanup(Context context) throws IOException, InterruptedException { store.close(); }; to reducer class. Also, I found another issue, when inlink data is wiped out in the next updatedb stages. To solve this issue I replaced these lines if (page.getInlinks() != null) { page.getInlinks().clear(); } with if (page.getInlinks() != null !inlinkedScoreData.isEmpty()) { page.getInlinks().clear(); } This code change is tested only in one case, and I am not sure if it solves the issue permanently. Thanks. Alex. UpdateDb using batchId, link may override crawled page. --- Key: NUTCH-1679 URL: https://issues.apache.org/jira/browse/NUTCH-1679 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Tien Nguyen Manh Priority: Critical Fix For: 2.3 Attachments: NUTCH-1679.patch The problem is in Hbase store, not sure about other store. Suppose at first crawl cycle we crawl link A, then get an outlink B. In second cycle we crawl link B which also has a link point to A In second updatedb we load only page B from store, and will add A as new link because it doesn't know A already exist in store and will override A. UpdateDb must be run without batchId or we must set additionsAllowed=false Here are code for new page page = new WebPage(); schedule.initializeSchedule(url, page); page.setStatus(CrawlStatus.STATUS_UNFETCHED); try { scoringFilters.initialScore(url, page); } catch (ScoringFilterException e) { page.setScore(0.0f); } new page will override old page status, score, fetchTime, fetchInterval, retries, metadata[CASH_KEY] - i think we can change something here so that new page will only update one column for example 'link' and if it is really a new page, we can initialize all above fields in generator - or we add operator checkAndPut to store so when add new page we will check if already exist first -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069567#comment-14069567 ] Alexander Kingson edited comment on NUTCH-1679 at 7/22/14 12:15 AM: Hi, I was suggesting to close the datastore by adding this function protected void cleanup(Context context) throws IOException, InterruptedException { store.close(); } to reducer class. Also, I found another issue, when inlink data is wiped out in the next updatedb stages. To solve this issue I replaced these lines if (page.getInlinks() != null) { page.getInlinks().clear(); } with if (page.getInlinks() != null !inlinkedScoreData.isEmpty()) { page.getInlinks().clear(); } This code change is tested only in one case, and I am not sure if it solves the issue permanently. Thanks. Alex. was (Author: alxksn): Hi, I was suggesting to close the datastore by adding this function protected void cleanup(Context context) throws IOException, InterruptedException { store.close(); }; to reducer class. Also, I found another issue, when inlink data is wiped out in the next updatedb stages. To solve this issue I replaced these lines if (page.getInlinks() != null) { page.getInlinks().clear(); } with if (page.getInlinks() != null !inlinkedScoreData.isEmpty()) { page.getInlinks().clear(); } This code change is tested only in one case, and I am not sure if it solves the issue permanently. Thanks. Alex. UpdateDb using batchId, link may override crawled page. --- Key: NUTCH-1679 URL: https://issues.apache.org/jira/browse/NUTCH-1679 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Tien Nguyen Manh Priority: Critical Fix For: 2.3 Attachments: NUTCH-1679.patch The problem is in Hbase store, not sure about other store. Suppose at first crawl cycle we crawl link A, then get an outlink B. In second cycle we crawl link B which also has a link point to A In second updatedb we load only page B from store, and will add A as new link because it doesn't know A already exist in store and will override A. UpdateDb must be run without batchId or we must set additionsAllowed=false Here are code for new page page = new WebPage(); schedule.initializeSchedule(url, page); page.setStatus(CrawlStatus.STATUS_UNFETCHED); try { scoringFilters.initialScore(url, page); } catch (ScoringFilterException e) { page.setScore(0.0f); } new page will override old page status, score, fetchTime, fetchInterval, retries, metadata[CASH_KEY] - i think we can change something here so that new page will only update one column for example 'link' and if it is really a new page, we can initialize all above fields in generator - or we add operator checkAndPut to store so when add new page we will check if already exist first -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069567#comment-14069567 ] Alexander Kingson edited comment on NUTCH-1679 at 7/22/14 12:46 AM: Hi, I was suggesting to close the datastore by adding this function protected void cleanup(Context context) throws IOException, InterruptedException { store.close(); } to reducer class. Also, I found another issue, when inlinks data is wiped out in the next updatedb stages. To solve this issue I replaced these lines if (page.getInlinks() != null) { page.getInlinks().clear(); } with if (page.getInlinks() != null !inlinkedScoreData.isEmpty()) { page.getInlinks().clear(); } This code change is tested only in one case, and I am not sure if it solves the issue permanently. Basically, if not clearing inlinks data in each call to reduce function does not cause overlap of inlinks data between keys then this code change solves the issue. Thanks. Alex. was (Author: alxksn): Hi, I was suggesting to close the datastore by adding this function protected void cleanup(Context context) throws IOException, InterruptedException { store.close(); } to reducer class. Also, I found another issue, when inlink data is wiped out in the next updatedb stages. To solve this issue I replaced these lines if (page.getInlinks() != null) { page.getInlinks().clear(); } with if (page.getInlinks() != null !inlinkedScoreData.isEmpty()) { page.getInlinks().clear(); } This code change is tested only in one case, and I am not sure if it solves the issue permanently. Thanks. Alex. UpdateDb using batchId, link may override crawled page. --- Key: NUTCH-1679 URL: https://issues.apache.org/jira/browse/NUTCH-1679 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Tien Nguyen Manh Priority: Critical Fix For: 2.3 Attachments: NUTCH-1679.patch The problem is in Hbase store, not sure about other store. Suppose at first crawl cycle we crawl link A, then get an outlink B. In second cycle we crawl link B which also has a link point to A In second updatedb we load only page B from store, and will add A as new link because it doesn't know A already exist in store and will override A. UpdateDb must be run without batchId or we must set additionsAllowed=false Here are code for new page page = new WebPage(); schedule.initializeSchedule(url, page); page.setStatus(CrawlStatus.STATUS_UNFETCHED); try { scoringFilters.initialScore(url, page); } catch (ScoringFilterException e) { page.setScore(0.0f); } new page will override old page status, score, fetchTime, fetchInterval, retries, metadata[CASH_KEY] - i think we can change something here so that new page will only update one column for example 'link' and if it is really a new page, we can initialize all above fields in generator - or we add operator checkAndPut to store so when add new page we will check if already exist first -- This message was sent by Atlassian JIRA (v6.2#6252)