Build failed in Jenkins: Nutch-trunk #1484

2011-05-11 Thread Apache Jenkins Server
See 

--
[...truncated 1018 lines...]
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A src/plugin/parse-html/src/test/org/apache/nutch
A src/plugin/parse-html/src/test/org/apache/nutch/parse
A src/plugin/parse-html/src/test/org/apache/nutch/parse/html
A 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestRobotsMetaProcessor.java
A 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
A src/plugin/parse-html/src/java
A src/plugin/parse-html/src/java/org
A src/plugin/parse-html/src/java/org/apache
A src/plugin/parse-html/src/java/org/apache/nutch
A src/plugin/parse-html/src/java/org/apache/nutch/parse
A src/plugin/parse-html/src/java/org/apache/nutch/parse/html
A 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
A 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/XMLCharacterRecognizer.java
A 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java
A 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
A 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HTMLMetaProcessor.java
A 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/package.html
AUsrc/plugin/parse-html/plugin.xml
AUsrc/plugin/parse-html/build.xml
A src/plugin/urlfilter-domain
A src/plugin/urlfilter-domain/ivy.xml
A src/plugin/urlfilter-domain/src
A src/plugin/urlfilter-domain/src/test
A src/plugin/urlfilter-domain/src/test/org
A src/plugin/urlfilter-domain/src/test/org/apache
A src/plugin/urlfilter-domain/src/test/org/apache/nutch
A src/plugin/urlfilter-domain/src/test/org/apache/nutch/urlfilter
A src/plugin/urlfilter-domain/src/test/org/apache/nutch/urlfilter/domain
A 
src/plugin/urlfilter-domain/src/test/org/apache/nutch/urlfilter/domain/TestDomainURLFilter.java
A src/plugin/urlfilter-domain/src/java
A src/plugin/urlfilter-domain/src/java/org
A src/plugin/urlfilter-domain/src/java/org/apache
A src/plugin/urlfilter-domain/src/java/org/apache/nutch
A src/plug

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2011-05-11 Thread Gabriele Kahlout (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriele Kahlout updated NUTCH-961:
---

Attachment: NUTCH-961-1.3-tikaparser1.patch

Modified to include necessary changes to parse-plugins.xml also.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
> Fix For: 2.0
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, 
> NUTCH-961-1.3-tikaparser1.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2011-05-11 Thread Gabriele Kahlout (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriele Kahlout updated NUTCH-961:
---

Attachment: NUTCH-961-1.3-tikaparser1.patch

Same as NUTCH-961-1.3-tikaparser.patch by Markus but adds necessary 
configuration to nutch-default.xml (!nutch-site.xml!) as discussed on the 
mailing list or privately time ago.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
> Fix For: 2.0
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-937) When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967)

2011-05-11 Thread Viksit Gaur (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031939#comment-13031939
 ] 

Viksit Gaur commented on NUTCH-937:
---

A workaround for this is:

- Set the following in nutch-site.xml


mapreduce.job.jar.unpack.pattern
(?:classes/|lib/|plugins/).*



plugin.folders
${job.local.dir}/../jars/plugins
 

- Recreate the nutch job file using ant.

This error will now vanish, but I'm guessing people will hit NUTCH-993 
afterwards.

> When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins 
> because MapReduce will not unpack plugin/ directory from the job's pack (due 
> to MAPREDUCE-967)
> -
>
> Key: NUTCH-937
> URL: https://issues.apache.org/jira/browse/NUTCH-937
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2
> Environment: hadoop 0.21 or cloudera hadoop 0.20.2+737
>Reporter: Claudio Martella
>
> Jobs running in on hadoop 0.21 or cloudera cdh 0.20.2+737 will fail because 
> of missing plugins (i.e.):
> 10/10/28 12:22:21 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 10/10/28 12:22:22 INFO mapred.FileInputFormat: Total input paths to
> process : 1
> 10/10/28 12:22:23 INFO mapred.JobClient: Running job: job_201010271826_0002
> 10/10/28 12:22:24 INFO mapred.JobClient:  map 0% reduce 0%
> 10/10/28 12:22:39 INFO mapred.JobClient: Task Id :
> attempt_201010271826_0002_m_00_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:379)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:317)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> at org.apache.hadoop.mapred.Child.main(Child.java:211)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 9 more
> Caused by: java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> ... 14 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 17 more
> Caused by: java.lang.RuntimeException: x point
> org.apache.nutch.net.URLNormalizer not found.
> at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:122)
> at
> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:70)
> ... 22 more
> 10/10/28 12:22:40 INFO mapred.JobClient: Task Id :
> attempt_201010271826_0002_m_01_0, Status : FAILED
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:379)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:317)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
>

[jira] [Commented] (NUTCH-985) MoreIndexingFilter doesn't use properly formatted date fields for Solr

2011-05-11 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13031682#comment-13031682
 ] 

Markus Jelsma commented on NUTCH-985:
-

>From dev@nutch
> For now a quick fix for the moreindexingfilter would be OK, but we can
> maybe create a new issue for 1.4 and rely on Date objects everywhere then
> format it properly in the SOLRWriter. We could of course to the latter now,
> but since I have no time to do it in the short time and don't want to twist
> your arm I'll let you decide

In that case i'll go with this quicker fix. Test and commit it and i'll open 
new issue for trunk and 1.4 once that version is added.

> MoreIndexingFilter doesn't use properly formatted date fields for Solr
> --
>
> Key: NUTCH-985
> URL: https://issues.apache.org/jira/browse/NUTCH-985
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3, 2.0
>Reporter: Dietrich Schmidt
>Assignee: Markus Jelsma
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-985-trunk-1.patch, NUTCH-985.1.3-1.patch, 
> indexlastmodifieddate.jar
>
>
> I am using the index-more plugin to parse the lastModified data in web
> pages in order to store it in a Solr data field.
> In solrindex-mapping.xml I am mapping lastModified to a field "changed" in 
> Solr:
> 
> However, when posting data to Solr the SolrIndexer posts it as a long,
> not as a date:
>  name="changed">107932680 name="tstamp">20110414144140188 name="date">20040315
> Solr rejects the data because of the improper data type.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Update schema to get solrdedup working again

2011-05-11 Thread Julien Nioche
Resending to dev@nutch - had sent to markus only


>
>> We still need to do
>> something about the moreindexing filter.
>>
>> https://issues.apache.org/jira/browse/NUTCH-985
>>
>
> For now a quick fix for the moreindexingfilter would be OK, but we can
> maybe create a new issue for 1.4 and rely on Date objects everywhere then
> format it properly in the SOLRWriter. We could of course to the latter now,
> but since I have no time to do it in the short time and don't want to twist
> your arm I'll let you decide
>
>
>
>>
>> On Thursday 05 May 2011 15:34:56 Julien Nioche wrote:
>> > Hi Markus,
>> >
>> > Sorry for the late reply. Definitely +1 to change to Date in the schema,
>> it
>> > is the right thing to do and it's also the right time to do it
>> >
>> > Thanks
>> >
>> > Julien
>> >
>> > On 28 April 2011 12:43, Markus Jelsma 
>> wrote:
>> > > Hi devs,
>> > >
>> > > The Solr schema must be updated as well to get dedup to work in 1.3.
>> This
>> > > is
>> > > because in december last year index-basic seems to have been updated
>> to
>> > > write
>> > > proper formatted dates to Solr but the schema field was still a long.
>> > >
>> > > Somehow Solr accepted (this is a bug) the input but cannot cope with
>> the
>> > > output, nor could Nutch convert the date to the internally used long
>> > > (which it
>> > > now can). The remaining issue is to update the field to use date
>> instead
>> > > of long. But this will break existing Solr set ups for sure because of
>> > > field incompatibility.
>> > >
>> > > I propose to update the field, regardless of current Solr set ups
>> because
>> > > of
>> > > the assumption that 1) an index can always be recreated from segments
>> and
>> > > 2)
>> > > the current indexer assumes the Solr bug remains in 3.1 and higher as
>> > > well.
>> > >
>> > > I haven't tested it with 3.1 but the bug is in 1.4.1 for sure.
>> > >
>> > > Thoughts?
>> > >
>> > > Cheers,
>> > > --
>> > > Markus Jelsma - CTO - Openindex
>> > > http://www.linkedin.com/in/markus17
>> > > 050-8536620 / 06-50258350
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com