Re: Problems running some ant targets on recent trunk

2014-07-21 Thread Julien Nioche
It's actually a bit more twisted than that : see
https://issues.apache.org/jira/browse/NUTCH-1818

This separation of the test and runtime dependencies has actually been very
good for exposing inconsistencies in the way the existing build worked. The
issue should be solved now, thanks for reporting it.


On 17 July 2014 10:18, Julien Nioche lists.digitalpeb...@gmail.com wrote:

 In this case it is the target compile-test of lib-regex-filter which
 fails.
 Should it be really called for target runtime?
   target name=deps-jar
 ant target=jar inheritall=false dir=../lib-regex-filter/
 ant target=compile-test inheritall=false
 dir=../lib-regex-filter/
   /target


 This is the source of the problem indeed, the second line should not be
 there : the test classes are not required at that stage.
 Both urlfilter-automaton and urlfilter-regex have the same problem, which
 was not apparent until I introduced a cleaner separation between the
 compilation and test deps.

 I've fixed that for trunk in revision 1611303.

 As for the issue with calling 'ant test' from a plugin dir, this is not a
 new issue : we get the same thing with a fresh copy of Nutch1.8. It's just
 that the test task for the plugins assumes that the core classes and ivy
 jars have already been resolved.

 Thanks

 Julien


 On 16 July 2014 23:10, Sebastian Nagel wastl.na...@googlemail.com wrote:

 Hi,

 I have some problems running ant targets on recent trunk:

 % ant runtime
 fails if run from scratch (after ant clean)
 but it succeeds after ant test or ant nightly.

 in a plugin folder, e.g., src/plugin/parse-metatags
 % ant test


 The error causing the failure is always:
  .../trunk/src/plugin/build-plugin.xml:190: .../trunk/build/test/lib does
 not exist.
 e.g. within the chain of calls:
 BUILD FAILED
 .../trunk/build.xml:112: The following error occurred while executing
 this line:
 .../trunk/src/plugin/build.xml:63: The following error occurred while
 executing this line:
 .../trunk/src/plugin/urlfilter-automaton/build.xml:25: The following
 error occurred while executing
 this line:
 .../trunk/src/plugin/build-plugin.xml:190: .../trunk/build/test/lib does
 not exist.

 Indeed the directory does not exist because it's removed by target
 clean-lib.
 In this case it is the target compile-test of lib-regex-filter which
 fails.
 Should it be really called for target runtime?

   target name=deps-jar
 ant target=jar inheritall=false dir=../lib-regex-filter/
 ant target=compile-test inheritall=false
 dir=../lib-regex-filter/
   /target


 Thanks,
 Sebastian




 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-07-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068374#comment-14068374
 ] 

Julien Nioche commented on NUTCH-1708:
--

I like the approach and this would be the best way of solving the issue. +1 to 
commit

Re-field  orig in 2.x : sounds like a duplicate of 'id' indeed. Let's do its 
removal separately

Thanks

 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel
 Fix For: 1.9

 Attachments: NUTCH-1708-2x-v1.patch, NUTCH-1708-trunk-v1.patch


 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1820) remove field orig which duplicates id

2014-07-21 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-1820:
--

 Summary: remove field orig which duplicates id
 Key: NUTCH-1820
 URL: https://issues.apache.org/jira/browse/NUTCH-1820
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 2.2.1
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 2.3


The indexing filter plugin index-basic (2.x only) adds a field orig which 
contains the real URL (not the reprUrl) and duplicates the field id (also 
regarding the field params: stored=true indexed=true). The field orig 
should be removed from index-basic and schema.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1821) Nutch Crawl class for EMR

2014-07-21 Thread Luis Lopez (JIRA)
Luis Lopez created NUTCH-1821:
-

 Summary: Nutch Crawl class for EMR
 Key: NUTCH-1821
 URL: https://issues.apache.org/jira/browse/NUTCH-1821
 Project: Nutch
  Issue Type: Wish
Affects Versions: 1.6
 Environment: Amazon EMR
Reporter: Luis Lopez


Hi all,

Some of us are using Amazon EMR to deploy/run Nutch and from what I've been 
reading in the users mailing list there are 2 common issues people run into... 
first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and second, from 
Nutch 1.8+ the Crawl class has been deprecated. 

The first issue poses a problem when we try to deploy recent Nutch versions. 
The most recent version that is supported by EMR is 1.6, the second issue is 
that EMR receives a jar and main class to do its job and from 1.8 the Crawl 
class has been removed.

After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that 
improves the old Crawl class so it scales, since 1.6 is an old version I wonder 
how can we contribute back to those that need to use ElasticMapreduce.

The things we did are:

a) Add num fetchers as a parameter to the Crawl class.
For some reason the generator was always defaulting to one list( see: 
http://stackoverflow.com/questions/10264183/why-does-nutch-only-run-the-fetch-step-on-one-hadoop-node-when-the-cluster-has)
 creating just one fetch map task... with the new parameter we can adjust the 
map tasks to fit the cluster size.
b) Index documents on each Crawl cycle and not at the end.
 We had performance/memory issues when we tried to index all the documents 
when the whole crawl is done, we moved the index part into the main Crawl cycle.
c) We added an option to delete segments after their content is indexed into 
Solr. It saves HDF space since the EC2 instances we use don't have a lot of 
space.

So far these fixes have allowed us to scale out Nutch be efficient with Amazon 
EMR clusters. If you guys think that there is some value on these changes we 
can  submit a patch file.

Luis.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1821) Nutch Crawl class for EMR

2014-07-21 Thread Luis Lopez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Lopez updated NUTCH-1821:
--

Description: 
Hi all,

Some of us are using Amazon EMR to deploy/run Nutch and from what I've been 
reading in the users mailing list there are 2 common issues people run into... 
first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and second, from 
Nutch 1.8+ the Crawl class has been deprecated. 

The first issue poses a problem when we try to deploy recent Nutch versions. 
The most recent version that is supported by EMR is 1.6, the second issue is 
that EMR receives a jar and main class to do its job and from 1.8 the Crawl 
class has been removed.

After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that 
improves the old Crawl class so it scales, since 1.6 is an old version I wonder 
how can we contribute back to those that need to use ElasticMapreduce.

The things we did are:

a) Add num fetchers as a parameter to the Crawl class.
For some reason the generator was always defaulting to one list( see: 
http://stackoverflow.com/questions/10264183/why-does-nutch-only-run-the-fetch-step-on-one-hadoop-node-when-the-cluster-has)
 creating just one fetch map task... with the new parameter we can adjust the 
map tasks to fit the cluster size.
b) Index documents on each Crawl cycle and not at the end.
 We had performance/memory issues when we tried to index all the documents 
when the whole crawl is done, we moved the index part into the main Crawl cycle.
c) We added an option to delete segments after their content is indexed into 
Solr. It saves HDF space since the EC2 instances we use don't have a lot of 
space.

So far these fixes have allowed us to scale out Nutch and be efficient with 
Amazon EMR clusters. If you guys think that there is some value on these 
changes we can  submit a patch file.

Luis.

  was:
Hi all,

Some of us are using Amazon EMR to deploy/run Nutch and from what I've been 
reading in the users mailing list there are 2 common issues people run into... 
first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and second, from 
Nutch 1.8+ the Crawl class has been deprecated. 

The first issue poses a problem when we try to deploy recent Nutch versions. 
The most recent version that is supported by EMR is 1.6, the second issue is 
that EMR receives a jar and main class to do its job and from 1.8 the Crawl 
class has been removed.

After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that 
improves the old Crawl class so it scales, since 1.6 is an old version I wonder 
how can we contribute back to those that need to use ElasticMapreduce.

The things we did are:

a) Add num fetchers as a parameter to the Crawl class.
For some reason the generator was always defaulting to one list( see: 
http://stackoverflow.com/questions/10264183/why-does-nutch-only-run-the-fetch-step-on-one-hadoop-node-when-the-cluster-has)
 creating just one fetch map task... with the new parameter we can adjust the 
map tasks to fit the cluster size.
b) Index documents on each Crawl cycle and not at the end.
 We had performance/memory issues when we tried to index all the documents 
when the whole crawl is done, we moved the index part into the main Crawl cycle.
c) We added an option to delete segments after their content is indexed into 
Solr. It saves HDF space since the EC2 instances we use don't have a lot of 
space.

So far these fixes have allowed us to scale out Nutch be efficient with Amazon 
EMR clusters. If you guys think that there is some value on these changes we 
can  submit a patch file.

Luis.


 Nutch Crawl class for EMR
 -

 Key: NUTCH-1821
 URL: https://issues.apache.org/jira/browse/NUTCH-1821
 Project: Nutch
  Issue Type: Wish
Affects Versions: 1.6
 Environment: Amazon EMR
Reporter: Luis Lopez
  Labels: Amazon, Crawler, EMR, performance

 Hi all,
 Some of us are using Amazon EMR to deploy/run Nutch and from what I've been 
 reading in the users mailing list there are 2 common issues people run 
 into... first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and 
 second, from Nutch 1.8+ the Crawl class has been deprecated. 
 The first issue poses a problem when we try to deploy recent Nutch versions. 
 The most recent version that is supported by EMR is 1.6, the second issue is 
 that EMR receives a jar and main class to do its job and from 1.8 the Crawl 
 class has been removed.
 After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that 
 improves the old Crawl class so it scales, since 1.6 is an old version I 
 wonder how can we contribute back to those that need to use ElasticMapreduce.
 The things we did are:
 a) Add num fetchers as a parameter to the Crawl class.
 For some reason the generator was 

[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-07-21 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069567#comment-14069567
 ] 

Alexander Kingson commented on NUTCH-1679:
--

Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
InterruptedException {
store.close();
};

to reducer class.

Also, I found  another issue, when inlink data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
  page.getInlinks().clear();
}

with
if (page.getInlinks() != null !inlinkedScoreData.isEmpty()) {
  page.getInlinks().clear();
}

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently. 

Thanks.
Alex.

 UpdateDb using batchId, link may override crawled page.
 ---

 Key: NUTCH-1679
 URL: https://issues.apache.org/jira/browse/NUTCH-1679
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Tien Nguyen Manh
Priority: Critical
 Fix For: 2.3

 Attachments: NUTCH-1679.patch


 The problem is in Hbase store, not sure about other store.
 Suppose at first crawl cycle we crawl link A, then get an outlink B.
 In second cycle we crawl link B which also has a link point to A
 In second updatedb we load only page B from store, and will add A as new link 
 because it doesn't know A already exist in store and will override A.
 UpdateDb must be run without batchId or we must set additionsAllowed=false
 Here are code for new page
   page = new WebPage();
   schedule.initializeSchedule(url, page);
   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
   try {
 scoringFilters.initialScore(url, page);
   } catch (ScoringFilterException e) {
 page.setScore(0.0f);
   }
 new page will override old page status, score, fetchTime, fetchInterval, 
 retries, metadata[CASH_KEY]
  - i think we can change something here so that new page will only update one 
 column for example 'link' and if it is really a new page, we can initialize 
 all above fields in generator
 - or we add operator checkAndPut to store so when add new page we will check 
 if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-07-21 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069567#comment-14069567
 ] 

Alexander Kingson edited comment on NUTCH-1679 at 7/22/14 12:15 AM:


Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
InterruptedException {
store.close();
}

to reducer class.

Also, I found  another issue, when inlink data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
  page.getInlinks().clear();
}

with
if (page.getInlinks() != null !inlinkedScoreData.isEmpty()) {
  page.getInlinks().clear();
}

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently. 

Thanks.
Alex.


was (Author: alxksn):
Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
InterruptedException {
store.close();
};

to reducer class.

Also, I found  another issue, when inlink data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
  page.getInlinks().clear();
}

with
if (page.getInlinks() != null !inlinkedScoreData.isEmpty()) {
  page.getInlinks().clear();
}

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently. 

Thanks.
Alex.

 UpdateDb using batchId, link may override crawled page.
 ---

 Key: NUTCH-1679
 URL: https://issues.apache.org/jira/browse/NUTCH-1679
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Tien Nguyen Manh
Priority: Critical
 Fix For: 2.3

 Attachments: NUTCH-1679.patch


 The problem is in Hbase store, not sure about other store.
 Suppose at first crawl cycle we crawl link A, then get an outlink B.
 In second cycle we crawl link B which also has a link point to A
 In second updatedb we load only page B from store, and will add A as new link 
 because it doesn't know A already exist in store and will override A.
 UpdateDb must be run without batchId or we must set additionsAllowed=false
 Here are code for new page
   page = new WebPage();
   schedule.initializeSchedule(url, page);
   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
   try {
 scoringFilters.initialScore(url, page);
   } catch (ScoringFilterException e) {
 page.setScore(0.0f);
   }
 new page will override old page status, score, fetchTime, fetchInterval, 
 retries, metadata[CASH_KEY]
  - i think we can change something here so that new page will only update one 
 column for example 'link' and if it is really a new page, we can initialize 
 all above fields in generator
 - or we add operator checkAndPut to store so when add new page we will check 
 if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-07-21 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069567#comment-14069567
 ] 

Alexander Kingson edited comment on NUTCH-1679 at 7/22/14 12:46 AM:


Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
InterruptedException {
store.close();
}

to reducer class.

Also, I found  another issue, when inlinks data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
  page.getInlinks().clear();
}

with
if (page.getInlinks() != null !inlinkedScoreData.isEmpty()) {
  page.getInlinks().clear();
}

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently.  Basically, if not clearing inlinks data in each call to 
reduce function  does not cause overlap of inlinks data between keys then this 
code change solves the issue.

Thanks.
Alex.


was (Author: alxksn):
Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
InterruptedException {
store.close();
}

to reducer class.

Also, I found  another issue, when inlink data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
  page.getInlinks().clear();
}

with
if (page.getInlinks() != null !inlinkedScoreData.isEmpty()) {
  page.getInlinks().clear();
}

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently. 

Thanks.
Alex.

 UpdateDb using batchId, link may override crawled page.
 ---

 Key: NUTCH-1679
 URL: https://issues.apache.org/jira/browse/NUTCH-1679
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Tien Nguyen Manh
Priority: Critical
 Fix For: 2.3

 Attachments: NUTCH-1679.patch


 The problem is in Hbase store, not sure about other store.
 Suppose at first crawl cycle we crawl link A, then get an outlink B.
 In second cycle we crawl link B which also has a link point to A
 In second updatedb we load only page B from store, and will add A as new link 
 because it doesn't know A already exist in store and will override A.
 UpdateDb must be run without batchId or we must set additionsAllowed=false
 Here are code for new page
   page = new WebPage();
   schedule.initializeSchedule(url, page);
   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
   try {
 scoringFilters.initialScore(url, page);
   } catch (ScoringFilterException e) {
 page.setScore(0.0f);
   }
 new page will override old page status, score, fetchTime, fetchInterval, 
 retries, metadata[CASH_KEY]
  - i think we can change something here so that new page will only update one 
 column for example 'link' and if it is really a new page, we can initialize 
 all above fields in generator
 - or we add operator checkAndPut to store so when add new page we will check 
 if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)