Build failed in Jenkins: Nutch-trunk #1586

2011-08-26 Thread Apache Jenkins Server
See 

--
[...truncated 986 lines...]
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A src/plugin/parse-html/src/test/org/apache/nutch
A src/plugin/parse-html/src/test/org/apache/nutch/parse
A src/plugin/parse-html/src/test/org/apache/nutch/parse/html
A 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestRobotsMetaProcessor.java
A 
src/plugin/parse-html/s

[jira] [Commented] (NUTCH-937) When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967)

2011-08-26 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092064#comment-13092064
 ] 

Lewis John McGibbney commented on NUTCH-937:


This in not strictly true, Nutch does not contain or include hadoop-core 
0.20.2, instead it depends upon it as well as it depends on many other 
elements. The lengthy discussion above addresses the problem this issue was 
created to solve.

@ Julien : I agree with this, it would be nice to offer this flexibility as, 
quite obviously there are a number of Hadoop distros we can (and people are) 
using Nutch on. 

> When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins 
> because MapReduce will not unpack plugin/ directory from the job's pack (due 
> to MAPREDUCE-967)
> -
>
> Key: NUTCH-937
> URL: https://issues.apache.org/jira/browse/NUTCH-937
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2
> Environment: hadoop 0.21 or cloudera hadoop 0.20.2+737
>Reporter: Claudio Martella
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> Jobs running in on hadoop 0.21 or cloudera cdh 0.20.2+737 will fail because 
> of missing plugins (i.e.):
> 10/10/28 12:22:21 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/10/28 12:22:22 INFO mapred.FileInputFormat: Total input paths to
> process : 1
> 10/10/28 12:22:23 INFO mapred.JobClient: Running job: job_201010271826_0002
> 10/10/28 12:22:24 INFO mapred.JobClient:  map 0% reduce 0%
> 10/10/28 12:22:39 INFO mapred.JobClient: Task Id :
> attempt_201010271826_0002_m_00_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:379)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:317)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> at org.apache.hadoop.mapred.Child.main(Child.java:211)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 9 more
> Caused by: java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> ... 14 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 17 more
> Caused by: java.lang.RuntimeException: x point
> org.apache.nutch.net.URLNormalizer not found.
> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:122)
> at
> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
> ... 22 more
> 10/10/28 12:22:40 INFO mapred.JobClient: Task Id :
> attempt_201010271826_0002_m_01_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:379)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:317)

[jira] [Commented] (NUTCH-937) When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967)

2011-08-26 Thread Radim Kolar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092059#comment-13092059
 ] 

Radim Kolar commented on NUTCH-937:
---

nutch-1.4 contains hadoop-core 0.20.2. If nutch 1.4 is compatible with higher 
version then included hadoop-core should be upgraded. i hope that it will fix 
my problems with running nutch locally.

> When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins 
> because MapReduce will not unpack plugin/ directory from the job's pack (due 
> to MAPREDUCE-967)
> -
>
> Key: NUTCH-937
> URL: https://issues.apache.org/jira/browse/NUTCH-937
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2
> Environment: hadoop 0.21 or cloudera hadoop 0.20.2+737
>Reporter: Claudio Martella
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> Jobs running in on hadoop 0.21 or cloudera cdh 0.20.2+737 will fail because 
> of missing plugins (i.e.):
> 10/10/28 12:22:21 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/10/28 12:22:22 INFO mapred.FileInputFormat: Total input paths to
> process : 1
> 10/10/28 12:22:23 INFO mapred.JobClient: Running job: job_201010271826_0002
> 10/10/28 12:22:24 INFO mapred.JobClient:  map 0% reduce 0%
> 10/10/28 12:22:39 INFO mapred.JobClient: Task Id :
> attempt_201010271826_0002_m_00_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:379)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:317)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> at org.apache.hadoop.mapred.Child.main(Child.java:211)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 9 more
> Caused by: java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> ... 14 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 17 more
> Caused by: java.lang.RuntimeException: x point
> org.apache.nutch.net.URLNormalizer not found.
> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:122)
> at
> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
> ... 22 more
> 10/10/28 12:22:40 INFO mapred.JobClient: Task Id :
> attempt_201010271826_0002_m_01_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:379)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:317)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGro

[Nutch Wiki] Trivial Update of "FrontPage" by LewisJohnMcgibbney

2011-08-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "FrontPage" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FrontPage?action=diff&rev1=224&rev2=225

   * MultiLingualSupport - ''In development''.
   * FixingOpicScoring - ''In planning''.
   * HowToContribute
-  * TaskList -- Tasks for Nutch developers.
+  * TaskList -- Tasks for Nutch developers. /!\ :Severe update required: /!\
   * [[Committer's_Rules]] -- Committers should follow these guidelines when 
deciding, which branch to use for committing the patches and when to commit.
   * [[Release_HOWTO]]
   * [[Website_Update_HOWTO]]


[Nutch Wiki] Trivial Update of "FrontPage" by LewisJohnMcgibbney

2011-08-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "FrontPage" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FrontPage?action=diff&rev1=223&rev2=224

   * [[Image_Search_Design]]
   * [[NutchOSGi]]
   * StrategicGoals
-  * IndexStructure
+  * IndexStructure /!\ :This page needs a slight update to provide more 
information on plugins and the data they send to Solr for indexing: /!\
   * [[Getting_Started]]
   * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 
in Oakland (Nov 2-6)
   * [[NutchMavenSupport|Using Nutch as a Maven dependency]]


[Nutch Wiki] Trivial Update of "IndexStructure" by LewisJohnMcgibbney

2011-08-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "IndexStructure" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/IndexStructure?action=diff&rev1=7&rev2=8

  
  The index structure formed after indexing is shown below : 
  
- ||'''FieldName'''||'''Stored'''||'''Index'''|| '''IndexingFilter''' 
||'''Comment'''||
+ ||'''Field Name'''||'''Stored'''||'''Index'''|| '''Indexing Filter/Plugin''' 
||'''Comment'''||
- ||boost|| YES ||  NotIndexed  ||  Indexer || ||
+ ||boost|| YES ||  Not Indexed ||  Indexer || ||
- ||digest  ||  YES ||  NotIndexed  ||  Indexer || ||
+ ||digest  ||  YES ||  Not Indexed ||  Indexer || ||
- ||lang||  YES ||  UnTokenized ||  
language-identifier || ||
+ ||lang||  YES ||  Un-Tokenized||  
language-identifier || ||
- ||segment ||  YES ||  NotIndexed  ||  Indexer 
|| ||
+ ||segment ||  YES ||  Not Indexed ||  Indexer 
|| ||
  ||tstamp  ||  YES ||  Tokenized   ||  Indexer || ||
  ||anchor  ||  NO  ||  Tokenized   ||  index-anchor || 
Indexing filter that indexes all inbound '''anchor text''' for a document.||
  ||title   ||  YES ||  Tokenized   ||  index-basic 
|| Adds basic searchable '''title field''' to a document. Also indexed by 
index-more ||
- ||site||  NO  ||  UnTokenized ||  index-basic || 
Adds basic searchable '''site field''' to a document. ||
+ ||site||  NO  ||  Un-Tokenized||  index-basic || 
Adds basic searchable '''site field''' to a document. ||
  ||host||  NO  ||  Tokenized   ||  index-basic 
|| Adds basic searchable '''hostname field''' to a document. ||
  ||url ||  YES ||  Tokenized   ||  index-basic || 
Adds basic searchable '''URL field''' to a document. ||
  ||content ||  NO  ||  Tokenized   ||  
index-basic || Adds basic searchable '''content field''' to a document. ||
  ||lastModified||  YES ||  NotIndexed  ||  
index-more || ||
- ||date||  NO  ||  UnTokenized ||  index-more || ||
+ ||date||  NO  ||  Un-Tokenized||  index-more || ||
- ||contentLength   ||  YES ||  NotIndexed  ||  
index-more || ||
+ ||contentLength   ||  YES ||  Not Indexed ||  
index-more || ||
- ||type||  NO  ||  UnTokenized ||  index-more  
||  contentType,primaryType,subType (all mime-types) ||
+ ||type||  NO  ||  Un-Tokenized||  index-more  
||  contentType,primaryType,subType (all mime-types) ||
- ||primaryType ||  YES ||  UnTokenized ||  
index-more  ||  primaryType (mime-type) ||
+ ||primaryType ||  YES ||  Un-Tokenized||  
index-more  ||  primaryType (mime-type) ||
- ||subType ||  YES ||  UnTokenized ||  
index-more  ||  subType (mime-type) ||
+ ||subType ||  YES ||  Un-Tokenized||  
index-more  ||  subType (mime-type) ||
- ||  tld || YES  || UnTokenized / NotStored(based on 
conf) || tld || see http://issues.apache.org/jira/browse/NUTCH-439 ||
+ ||  tld || YES  || Un-Tokenized / NotStored(based on 
conf) || tld || see http://issues.apache.org/jira/browse/NUTCH-439 ||
- ||  category||NO|| UnTokenized || index-url-category 
|| see http://issues.apache.org/jira/browse/NUTCH-386 ||
+ ||  category||NO|| Un-Tokenized || index-url-category 
|| see http://issues.apache.org/jira/browse/NUTCH-386 ||
  ||  subcollection   ||YES || Tokenized || subcollection || see 
subcollection plugin ||
  
  


[Nutch Wiki] Trivial Update of "IndexStructure" by LewisJohnMcgibbney

2011-08-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "IndexStructure" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/IndexStructure?action=diff&rev1=6&rev2=7

  ||lang||  YES ||  UnTokenized ||  
language-identifier || ||
  ||segment ||  YES ||  NotIndexed  ||  Indexer 
|| ||
  ||tstamp  ||  YES ||  Tokenized   ||  Indexer || ||
- ||anchor  ||  NO  ||  Tokenized   ||  index-anchor || 
Indexing filter that indexes all inbound anchor text for a document.||
+ ||anchor  ||  NO  ||  Tokenized   ||  index-anchor || 
Indexing filter that indexes all inbound '''anchor text''' for a document.||
- ||title   ||  YES ||  Tokenized   ||  index-basic 
|| Adds basic searchable title field to a document. Also indexed by index-more 
||
+ ||title   ||  YES ||  Tokenized   ||  index-basic 
|| Adds basic searchable '''title field''' to a document. Also indexed by 
index-more ||
- ||site||  NO  ||  UnTokenized ||  index-basic || 
Adds basic searchable site field to a document. ||
+ ||site||  NO  ||  UnTokenized ||  index-basic || 
Adds basic searchable '''site field''' to a document. ||
- ||host||  NO  ||  Tokenized   ||  index-basic 
|| Adds basic searchable hostname field to a document. ||
+ ||host||  NO  ||  Tokenized   ||  index-basic 
|| Adds basic searchable '''hostname field''' to a document. ||
- ||url ||  YES ||  Tokenized   ||  index-basic || 
Adds basic searchable URL field to a document. ||
+ ||url ||  YES ||  Tokenized   ||  index-basic || 
Adds basic searchable '''URL field''' to a document. ||
- ||content ||  NO  ||  Tokenized   ||  
index-basic ||  content ||
+ ||content ||  NO  ||  Tokenized   ||  
index-basic || Adds basic searchable '''content field''' to a document. ||
  ||lastModified||  YES ||  NotIndexed  ||  
index-more || ||
  ||date||  NO  ||  UnTokenized ||  index-more || ||
  ||contentLength   ||  YES ||  NotIndexed  ||  
index-more || ||


[Nutch Wiki] Trivial Update of "IndexStructure" by LewisJohnMcgibbney

2011-08-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "IndexStructure" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/IndexStructure?action=diff&rev1=5&rev2=6

  ||segment ||  YES ||  NotIndexed  ||  Indexer 
|| ||
  ||tstamp  ||  YES ||  Tokenized   ||  Indexer || ||
  ||anchor  ||  NO  ||  Tokenized   ||  index-anchor || 
Indexing filter that indexes all inbound anchor text for a document.||
- ||title   ||  YES ||  Tokenized   ||  index-basic 
||  also by index-more ||
- ||site||  NO  ||  UnTokenized ||  index-basic || 
||
- ||host||  NO  ||  Tokenized   ||  index-basic 
||  hostname ||
- ||url ||  YES ||  Tokenized   ||  index-basic || 
||
+ ||title   ||  YES ||  Tokenized   ||  index-basic 
|| Adds basic searchable title field to a document. Also indexed by index-more 
||
+ ||site||  NO  ||  UnTokenized ||  index-basic || 
Adds basic searchable site field to a document. ||
+ ||host||  NO  ||  Tokenized   ||  index-basic 
|| Adds basic searchable hostname field to a document. ||
+ ||url ||  YES ||  Tokenized   ||  index-basic || 
Adds basic searchable URL field to a document. ||
  ||content ||  NO  ||  Tokenized   ||  
index-basic ||  content ||
  ||lastModified||  YES ||  NotIndexed  ||  
index-more || ||
  ||date||  NO  ||  UnTokenized ||  index-more || ||


[Nutch Wiki] Trivial Update of "IndexStructure" by LewisJohnMcgibbney

2011-08-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "IndexStructure" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/IndexStructure?action=diff&rev1=4&rev2=5

  ||lang||  YES ||  UnTokenized ||  
language-identifier || ||
  ||segment ||  YES ||  NotIndexed  ||  Indexer 
|| ||
  ||tstamp  ||  YES ||  Tokenized   ||  Indexer || ||
- ||anchor  ||  NO  ||  Tokenized   ||  index-basic || 
||
+ ||anchor  ||  NO  ||  Tokenized   ||  index-anchor || 
Indexing filter that indexes all inbound anchor text for a document.||
  ||title   ||  YES ||  Tokenized   ||  index-basic 
||  also by index-more ||
  ||site||  NO  ||  UnTokenized ||  index-basic || 
||
  ||host||  NO  ||  Tokenized   ||  index-basic 
||  hostname ||


[jira] [Commented] (NUTCH-386) Plugin to index categories by url rules

2011-08-26 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091846#comment-13091846
 ] 

Lewis John McGibbney commented on NUTCH-386:


What is the position with this one? I'm in the middle of updating the wiki and 
there was reference to this issue. I'm assuming that you could close it again 
as won't fix. legacy???

> Plugin to index categories by url rules
> ---
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Ernesto De Santis
>Priority: Minor
> Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[Nutch Wiki] Trivial Update of "IndexStructure" by LewisJohnMcgibbney

2011-08-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "IndexStructure" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/IndexStructure?action=diff&rev1=3&rev2=4

  ||type||  NO  ||  UnTokenized ||  index-more  
||  contentType,primaryType,subType (all mime-types) ||
  ||primaryType ||  YES ||  UnTokenized ||  
index-more  ||  primaryType (mime-type) ||
  ||subType ||  YES ||  UnTokenized ||  
index-more  ||  subType (mime-type) ||
- ||  domain  || NO   || Tokenized  || index-domain  || see 
http://issues.apache.org/jira/browse/NUTCH-445 ||
  ||  tld || YES  || UnTokenized / NotStored(based on 
conf) || tld || see http://issues.apache.org/jira/browse/NUTCH-439 ||
  ||  category||NO|| UnTokenized || index-url-category 
|| see http://issues.apache.org/jira/browse/NUTCH-386 ||
  ||  subcollection   ||YES || Tokenized || subcollection || see 
subcollection plugin ||


[Nutch Wiki] Trivial Update of "MapReduce" by LewisJohnMcgibbney

2011-08-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "MapReduce" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/MapReduce?action=diff&rev1=7&rev2=8

+ = How Map and Reduce operations are actually carried out =
+ == Introduction ==
+ 
+ This document describes how MapReduce operations are carried out in Hadoop. 
If you are not familiar with the Google 
[[http://labs.google.com/papers/mapreduce.html|MapReduce]] programming model 
you should get acquainted with it first.
+ 
  
[[http://weblogs.java.net/blog/tomwhite/archive/2005/09/mapreduce.html#more|"Excerpt
 from TomWhite's blog: MapReduce"]]<>
- 
-  * MapReduce is the brainchild of Google and is very well documented by 
Jeffrey Dean and Sanjay Ghemawat in their paper 
[[http://labs.google.com/papers/mapreduce.html|"MapReduce: Simplified Data 
Processing on Large Clusters"]].
  
   * In essence, it allows massive data sets to be processed in a distributed 
fashion by breaking the processing into many small computations of two types:
1. A Map operation that transforms the input into an intermediate 
representation.
@@ -10, +13 @@

  
   * This processing model is ideal for the operations a search engine indexer 
like Nutch or Google needs to perform - like computing inlinks for URLs, or 
building inverted indexes - and it will 
[[attachment:Presentations/mapred.pdf|"transform Nutch"]] into a scalable, 
distributed search engine.
  
+ <>
+ 
+ == Map ==
+ 
+ As the Map operation is parallelized the input file set is first
+ split to several pieces called 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/lib/input/FileSplit.html|FileSplits]].
 If an individual file
+ is so large that it will affect seek time it will be split to
+ several Splits. The splitting does not know anything about the
+ input file's internal logical structure, for example
+ line-oriented text files are split on arbitrary byte boundaries.
+ Then a new map task is created per !FileSplit.
+ 
+ When an individual map task starts it will open a new output
+ writer per configured reduce task. It will then proceed to read
+ its !FileSplit using the 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/RecordReader.html|RecordReader]]
 it gets from the specified
+ 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat]].
 !InputFormat parses the input and generates
+ key-value pairs. !InputFormat must also handle records that may be split on 
the !FileSplit boundary. For example 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html|TextInputFormat]]
 will read the last line of the !FileSplit past the split boundary and, when 
reading other than the first !FileSplit, !TextInputFormat ignores the content 
up to the first newline.
+ 
+ It is not necessary for the !InputFormat to
+ generate both meaningful keys ''and'' values. For example the
+ default output from !TextInputFormat consists of input lines as
+ values and somewhat meaninglessly line start file offsets as
+ keys - most applications only use the lines and ignore the
+ offsets.
+ 
+ As key-value pairs are read from the !RecordReader they are
+ passed to the configured 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html|Mapper]].
 The user supplied Mapper does
+ whatever it wants with the input pair and calls   
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html#collect(org.apache.hadoop.io.WritableComparable,%20org.apache.hadoop.io.Writable)|OutputCollector.collect]]
 with key-value pairs of its own choosing. The output it
+ generates must use one key class and one value class.  This is because
+ the Map output will be written into a SequenceFile
+ which has per-file type information and all the records must
+ have the same type (use subclassing if you want to output
+ different data structures). The Map input and output key-value
+ pairs are not necessarily related typewise or in cardinality.
+ 
+ When Mapper output is collected it is partitioned, which means
+ that it will be written to the output specified by the
+ 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/Partitioner.html|Partitioner]].
 The default 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html|HashPartitioner]]
 uses the
+ hashcode function on the key's class (which means that this hashcode function 
must be good in order to achieve an even workload across the reduce tasks).  
See 
[[http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/MapTask.java?view=markup|MapTask]]
 for details.
+ 
+ N input files will generate M map tasks to be run and each map
+ task will generate as many output files as there are reduce
+ tas

[Nutch Wiki] Trivial Update of "Archive and Legacy" by LewisJohnMcgibbney

2011-08-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Archive and Legacy" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/Archive%20and%20Legacy?action=diff&rev1=15&rev2=16

  === Development and Old Nutch 2.0 ===
   * InstallingWeb2
   * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture (old)
- 
+  * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin 
a Java application
  === Pre-Nutch 1.3 Plugin Resources ===
   * OldPluginCentral
  


[Nutch Wiki] Trivial Update of "FrontPage" by LewisJohnMcgibbney

2011-08-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "FrontPage" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FrontPage?action=diff&rev1=222&rev2=223

   * StrategicGoals
   * IndexStructure
   * [[Getting_Started]]
-  * JavaDemoApplication - A simple demonstration of how to use the Nutch APIin 
a Java application
   * ApacheConUs2009MeetUp - List of topics for !MeetUp at !ApacheCon US 2009 
in Oakland (Nov 2-6)
   * [[NutchMavenSupport|Using Nutch as a Maven dependency]]
  


[no subject]

2011-08-26 Thread gaurav bagga



[jira] [Commented] (NUTCH-937) When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967)

2011-08-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091747#comment-13091747
 ] 

Julien Nioche commented on NUTCH-937:
-

@Radim : Nutch is based on the Apache distribution of Hadoop and 1.4 already 
works with it. No one suggested that it should be based on something different. 
The point here is that if we can get it to work on other distributions by 
simply adding a default parameter then it is probably worth doing. 

@Ferdy : don't agree that embedding the property within nutch-site has no 
effect -> it does work. You probably have a different issue  

> When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins 
> because MapReduce will not unpack plugin/ directory from the job's pack (due 
> to MAPREDUCE-967)
> -
>
> Key: NUTCH-937
> URL: https://issues.apache.org/jira/browse/NUTCH-937
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2
> Environment: hadoop 0.21 or cloudera hadoop 0.20.2+737
>Reporter: Claudio Martella
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> Jobs running in on hadoop 0.21 or cloudera cdh 0.20.2+737 will fail because 
> of missing plugins (i.e.):
> 10/10/28 12:22:21 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/10/28 12:22:22 INFO mapred.FileInputFormat: Total input paths to
> process : 1
> 10/10/28 12:22:23 INFO mapred.JobClient: Running job: job_201010271826_0002
> 10/10/28 12:22:24 INFO mapred.JobClient:  map 0% reduce 0%
> 10/10/28 12:22:39 INFO mapred.JobClient: Task Id :
> attempt_201010271826_0002_m_00_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:379)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:317)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> at org.apache.hadoop.mapred.Child.main(Child.java:211)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 9 more
> Caused by: java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> ... 14 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 17 more
> Caused by: java.lang.RuntimeException: x point
> org.apache.nutch.net.URLNormalizer not found.
> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:122)
> at
> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
> ... 22 more
> 10/10/28 12:22:40 INFO mapred.JobClient: Task Id :
> attempt_201010271826_0002_m_01_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:379)
> at org.apache.hadoop.mapred.MapTask.run(MapTask

[jira] [Commented] (NUTCH-937) When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967)

2011-08-26 Thread Radim Kolar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091740#comment-13091740
 ] 

Radim Kolar commented on NUTCH-937:
---

we should stick with hadoop 0.20.203.0 not CDH and make nutch-1.4 to work with 
it.

> When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins 
> because MapReduce will not unpack plugin/ directory from the job's pack (due 
> to MAPREDUCE-967)
> -
>
> Key: NUTCH-937
> URL: https://issues.apache.org/jira/browse/NUTCH-937
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2
> Environment: hadoop 0.21 or cloudera hadoop 0.20.2+737
>Reporter: Claudio Martella
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> Jobs running in on hadoop 0.21 or cloudera cdh 0.20.2+737 will fail because 
> of missing plugins (i.e.):
> 10/10/28 12:22:21 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/10/28 12:22:22 INFO mapred.FileInputFormat: Total input paths to
> process : 1
> 10/10/28 12:22:23 INFO mapred.JobClient: Running job: job_201010271826_0002
> 10/10/28 12:22:24 INFO mapred.JobClient:  map 0% reduce 0%
> 10/10/28 12:22:39 INFO mapred.JobClient: Task Id :
> attempt_201010271826_0002_m_00_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:379)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:317)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> at org.apache.hadoop.mapred.Child.main(Child.java:211)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 9 more
> Caused by: java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> ... 14 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 17 more
> Caused by: java.lang.RuntimeException: x point
> org.apache.nutch.net.URLNormalizer not found.
> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:122)
> at
> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
> ... 22 more
> 10/10/28 12:22:40 INFO mapred.JobClient: Task Id :
> attempt_201010271826_0002_m_01_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:379)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:317)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> at org.apache.hadoop.mapred.Child.main(Child.java:211)
> 

[jira] [Resolved] (NUTCH-990) protocol-httpclient fails with short pages

2011-08-26 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-990.
-

   Resolution: Fixed
Fix Version/s: (was: 1.3)
   1.4

A patch has been committed recently that fixes the issues with compressed short 
pages - checkout the code from SVN
See https://issues.apache.org/jira/browse/NUTCH-1089

Note that protocol-httpclient still needs replacing and is considered broken. 
See https://issues.apache.org/jira/browse/NUTCH-1086

> protocol-httpclient fails with short pages
> --
>
> Key: NUTCH-990
> URL: https://issues.apache.org/jira/browse/NUTCH-990
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Reporter: Gabriele Kahlout
>Priority: Minor
> Fix For: 1.4
>
> Attachments: hadoop.log
>
>
> Using protocol-http with a few words html pages works fine. But with 
> protocol-httpclient the same pages disappear from the index, although they 
> are still fetched.
> Those small files are useful for quick testing. 
> Steps to reproduce:
> $ svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 nutch-1.3
> Checked out revision 1097214.
> $ cd nutch-1.3
> $ xmlstarlet edit -L -u 
> "/configuration/property[name='http.agent.name']"/value -v 'test' 
> conf/nutch-default.xml
> $ ant
> Download to runtime/local the following script and seeds list file. They 
> assume a $HADOOP_HOME environment variable. It's a 1.3 adaptation of [1].
> http://dp4j.sf.net/debug/whole-web-crawling-incremental
> http://dp4j.sf.net/debug/urls
> $ cd runtime/local
> This will empty your Solr index (-f) and crawl:
> $ ./whole-web-crawling-incremental -f .
> Now Check Solr index searching for artificial and you will find the page 
> pointed to in urls.
> Now change plugin-includes in conf/nutch-default to use protocol-httpclient 
> instead of protocol-http and re-run the script. No more results in solr. Try 
> again with http and the results return.
> [1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Why URLNormalizer doesn't implement the Pluggable?

2011-08-26 Thread Julien Nioche
Resending your messages every hour won't get you more answers - at the
opposite

On 26 August 2011 09:28, Kaiwii Ho  wrote:

>
> I'm a freshman learning about the nutch.
> Here,I have serval questions:
> 1、URLNormalizer is a kind of a ExtensionPoint.But why does it implement the
> Pluggable as other extensionpoint does?And further-more,do any difference
> exist between the URLNormalizer and the other ExtensionPoint leading
> the URLNormalizer's not implementing the Pluggable?
> 2、while URLNormalizer not implementing the Pluggable,everything seems to
> work well.So,I wonder whether it is necessary to make the extensionpoint
> implement the Pluggable?
>
> Is it a bug or a specified trick?
>
> Waiting for ur answer.And thank u!
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] [Reopened] (NUTCH-990) protocol-httpclient fails with short pages

2011-08-26 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reopened NUTCH-990:
-


> protocol-httpclient fails with short pages
> --
>
> Key: NUTCH-990
> URL: https://issues.apache.org/jira/browse/NUTCH-990
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Reporter: Gabriele Kahlout
>Priority: Minor
> Fix For: 1.3
>
> Attachments: hadoop.log
>
>
> Using protocol-http with a few words html pages works fine. But with 
> protocol-httpclient the same pages disappear from the index, although they 
> are still fetched.
> Those small files are useful for quick testing. 
> Steps to reproduce:
> $ svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 nutch-1.3
> Checked out revision 1097214.
> $ cd nutch-1.3
> $ xmlstarlet edit -L -u 
> "/configuration/property[name='http.agent.name']"/value -v 'test' 
> conf/nutch-default.xml
> $ ant
> Download to runtime/local the following script and seeds list file. They 
> assume a $HADOOP_HOME environment variable. It's a 1.3 adaptation of [1].
> http://dp4j.sf.net/debug/whole-web-crawling-incremental
> http://dp4j.sf.net/debug/urls
> $ cd runtime/local
> This will empty your Solr index (-f) and crawl:
> $ ./whole-web-crawling-incremental -f .
> Now Check Solr index searching for artificial and you will find the page 
> pointed to in urls.
> Now change plugin-includes in conf/nutch-default to use protocol-httpclient 
> instead of protocol-http and re-run the script. No more results in solr. Try 
> again with http and the results return.
> [1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-990) protocol-httpclient fails with short pages

2011-08-26 Thread Stephan Grotz (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091682#comment-13091682
 ] 

Stephan Grotz commented on NUTCH-990:
-

Same here - been trying to fetch https pages through protocol-httpclient but am 
getting the same error. If that library has a lot of underlying issues, which 
alternative library should we use? Any recommendation appreciated :)

> protocol-httpclient fails with short pages
> --
>
> Key: NUTCH-990
> URL: https://issues.apache.org/jira/browse/NUTCH-990
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Reporter: Gabriele Kahlout
>Priority: Minor
> Fix For: 1.3
>
> Attachments: hadoop.log
>
>
> Using protocol-http with a few words html pages works fine. But with 
> protocol-httpclient the same pages disappear from the index, although they 
> are still fetched.
> Those small files are useful for quick testing. 
> Steps to reproduce:
> $ svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 nutch-1.3
> Checked out revision 1097214.
> $ cd nutch-1.3
> $ xmlstarlet edit -L -u 
> "/configuration/property[name='http.agent.name']"/value -v 'test' 
> conf/nutch-default.xml
> $ ant
> Download to runtime/local the following script and seeds list file. They 
> assume a $HADOOP_HOME environment variable. It's a 1.3 adaptation of [1].
> http://dp4j.sf.net/debug/whole-web-crawling-incremental
> http://dp4j.sf.net/debug/urls
> $ cd runtime/local
> This will empty your Solr index (-f) and crawl:
> $ ./whole-web-crawling-incremental -f .
> Now Check Solr index searching for artificial and you will find the page 
> pointed to in urls.
> Now change plugin-includes in conf/nutch-default to use protocol-httpclient 
> instead of protocol-http and re-run the script. No more results in solr. Try 
> again with http and the results return.
> [1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Are there any tutorial for writing regex-normalize.xml?

2011-08-26 Thread Kaiwii Ho
I'm gonna to specify my own regex-normalize.xml.Are there any tutorial for
writing regex-normalize.xml?
waiting for ur help and thank u


Why URLNormalizer doesn't implement the Pluggable?

2011-08-26 Thread Kaiwii Ho
I'm a freshman learning about the nutch.
Here,I have serval questions:
1、URLNormalizer is a kind of a ExtensionPoint.But why does it implement the
Pluggable as other extensionpoint does?And further-more,do any difference
exist between the URLNormalizer and the other ExtensionPoint leading
the URLNormalizer's not implementing the Pluggable?
2、while URLNormalizer not implementing the Pluggable,everything seems to
work well.So,I wonder whether it is necessary to make the extensionpoint
implement the Pluggable?

Is it a bug or a specified trick?

Waiting for ur answer.And thank u!