Build failed in Jenkins: Nutch-trunk #1530

2011-06-28 Thread Apache Jenkins Server
See 

Changes:

[markus] NUTCH-1012 Cannot handle illegal charset

--
[...truncated 936 lines...]
A src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api
AU
src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java
A src/plugin/lib-regex-filter/src/java
A src/plugin/lib-regex-filter/src/java/org
A src/plugin/lib-regex-filter/src/java/org/apache
A src/plugin/lib-regex-filter/src/java/org/apache/nutch
A src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter
A src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api
AU
src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java
AU
src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
AUsrc/plugin/lib-regex-filter/plugin.xml
AUsrc/plugin/lib-regex-filter/build.xml
A src/plugin/feed
A src/plugin/feed/sample
A src/plugin/feed/sample/rsstest.rss
A src/plugin/feed/ivy.xml
A src/plugin/feed/src
A src/plugin/feed/src/test
A src/plugin/feed/src/test/org
A src/plugin/feed/src/test/org/apache
A src/plugin/feed/src/test/org/apache/nutch
A src/plugin/feed/src/test/org/apache/nutch/parse
A src/plugin/feed/src/test/org/apache/nutch/parse/feed
A 
src/plugin/feed/src/test/org/apache/nutch/parse/feed/TestFeedParser.java
A src/plugin/feed/src/java
A src/plugin/feed/src/java/org
A src/plugin/feed/src/java/org/apache
A src/plugin/feed/src/java/org/apache/nutch
A src/plugin/feed/src/java/org/apache/nutch/parse
A src/plugin/feed/src/java/org/apache/nutch/parse/feed
A src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
A src/plugin/feed/src/java/org/apache/nutch/indexer
A src/plugin/feed/src/java/org/apache/nutch/indexer/feed
A 
src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java
A src/plugin/feed/plugin.xml
A src/plugin/feed/build.xml
A src/plugin/subcollection
A src/plugin/subcollection/ivy.xml
A src/plugin/subcollection/src
A src/plugin/subcollection/src/test
A src/plugin/subcollection/src/test/org
A src/plugin/subcollection/src/test/org/apache
A src/plugin/subcollection/src/test/org/apache/nutch
A src/plugin/subcollection/src/test/org/apache/nutch/collection
A 
src/plugin/subcollection/src/test/org/apache/nutch/collection/TestSubcollection.java
A src/plugin/subcollection/src/java
A src/plugin/subcollection/src/java/org
A src/plugin/subcollection/src/java/org/apache
A src/plugin/subcollection/src/java/org/apache/nutch
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plug

[jira] [Commented] (NUTCH-888) Remove parse-rss

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056993#comment-13056993
 ] 

Hudson commented on NUTCH-888:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Remove parse-rss
> 
>
> Key: NUTCH-888
> URL: https://issues.apache.org/jira/browse/NUTCH-888
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.3, 2.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.3, 2.0
>
>
> See https://issues.apache.org/jira/browse/NUTCH-887
> {quote}
> CM : I wrote parse-rss back in 2005, and used commons-feedparser from Kevin 
> Burton and his crew. At the time it was well developed, and a little more 
> flexible and easier for me to pick up than Rome. Since then however, its 
> development has really become stagnant and it is no longer maintained.
> In terms of real differences in terms of functionality, they are roughly 
> equivalent so there isn't much difference.
> {quote}
> Already +1 from Andrzej and Chris. Will remove it tomorrow if there aren't 
> any objections in the meantime 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-991) SolrDedup must issue a commit

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056991#comment-13056991
 ] 

Hudson commented on NUTCH-991:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> SolrDedup must issue a commit
> -
>
> Key: NUTCH-991
> URL: https://issues.apache.org/jira/browse/NUTCH-991
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-991-1.3-1.patch, NUTCH-991-trunk-1.patch
>
>
> Title says it all. SolrDedup job doesn't commit but it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-983) Upgrade SolrJ

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056989#comment-13056989
 ] 

Hudson commented on NUTCH-983:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Upgrade SolrJ
> -
>
> Key: NUTCH-983
> URL: https://issues.apache.org/jira/browse/NUTCH-983
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 1.3, 2.0
>
>
> Solr 3.1 has been released a while ago. The Javabin format between 1.4.1 and 
> 3.1 has been changed so our SolrJ 1.4.1 cannot send documents to 3.1. Since 
> Nutch 2.0 won't be released within a short period i believe it would be a 
> good idea to upgrade our SolrJ to 3.1. New Solr users are encouraged to use 
> Solr 3.1 or upgrade so i expect more users wanting to use 3.1 as well. Any 
> thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-999) Normalise String representation for Dates in IndexingFilters

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056988#comment-13056988
 ] 

Hudson commented on NUTCH-999:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Normalise String representation for Dates in IndexingFilters
> 
>
> Key: NUTCH-999
> URL: https://issues.apache.org/jira/browse/NUTCH-999
> Project: Nutch
>  Issue Type: Task
>  Components: indexer
>Affects Versions: 2.0
>Reporter: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-999.patch
>
>
> NUTCH-997 has been applied to Nutch-1.3 so that various indexing filters 
> store Date objects as value for fields. However in trunk NutchDocuments can 
> have only String values which means that we will have to convert the Dates to 
> Strings in each indexing filter.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1006) meta equiv with single quotes not accepted

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056990#comment-13056990
 ] 

Hudson commented on NUTCH-1006:
---

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> meta equiv with single quotes not accepted
> --
>
> Key: NUTCH-1006
> URL: https://issues.apache.org/jira/browse/NUTCH-1006
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2, 1.3, 1.4, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1006-104.patch, NUTCH-1006-2.0.patch
>
>
> As posted by Alex F:
> the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
> suitable for sites using single quotes for 
>   Example: 
>   We experienced a couple of pages with that kind of quotes and Nutch-1.2
> was not able to handle it.
> Is there any fallback or would it be good to use the following
> regex: "]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>" (single
> or regular quotes are accepted)?
> See this thread:
> http://lucene.472066.n3.nabble.com/Character-encoding-on-Html-Pages-td3034850.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056992#comment-13056992
 ] 

Hudson commented on NUTCH-967:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Upgrade to Tika 0.9
> ---
>
> Key: NUTCH-967
> URL: https://issues.apache.org/jira/browse/NUTCH-967
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Assignee: Julien Nioche
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, 
> NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1010) ContentLength not trimmed

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056987#comment-13056987
 ] 

Hudson commented on NUTCH-1010:
---

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> ContentLength not trimmed
> -
>
> Key: NUTCH-1010
> URL: https://issues.apache.org/jira/browse/NUTCH-1010
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3, 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1010-1.4.patch, NUTCH-1010-2.0.patch
>
>
> Somewhere in some component the ContentLength field is not trimmed. This 
> allows a seemingly numeric field to be treated as a string by the indexer in 
> cases one or more leading or trailing whitespace is added. The result is a 
> hard to debug exception with no way to identify the bad document (amongst 
> thousands) or the bad field.
> {code}
> Jun 22, 2011 1:03:42 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.NumberFormatException: For input string: "32717 "
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
> at java.lang.Long.parseLong(Long.java:419)
> at java.lang.Long.parseLong(Long.java:468)
> {code}
> This can be quickly fixed in the index-more plugin by simply using the trim() 
> when adding the field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-995) Generate POM file using the Ivy makepom task

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056985#comment-13056985
 ] 

Hudson commented on NUTCH-995:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Generate POM file using the Ivy makepom task 
> -
>
> Key: NUTCH-995
> URL: https://issues.apache.org/jira/browse/NUTCH-995
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.3
>
> Attachments: NUTCH-955-1.3.patch, NUTCH-997.branch-1.3.v2.patch, 
> mvn-template-build.patch
>
>
> We currently have a pom.xml file in the SVN repository and use it for 
> publishing our artefacts. The trouble with this is that we need to keep its 
> content in sync with our ivy file. Instead we could use the makepom task 
> (http://ant.apache.org/ivy/history/2.2.0/use/makepom.html) to generate the 
> pom.xml automatically.
> The existing pom.xml for 1.3 needs fixing anyway as it declares dependencies 
> to GORA and has the wrong versions for some dependencies.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-989) index-basic plugin doesn't use Solr date fieldType

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056982#comment-13056982
 ] 

Hudson commented on NUTCH-989:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> index-basic plugin doesn't use Solr date fieldType
> --
>
> Key: NUTCH-989
> URL: https://issues.apache.org/jira/browse/NUTCH-989
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.3, 2.0
>
>
> The index-basic plugin actually sends over a properly formatted date with 
> millis but the schema isn't configured to use the dateField.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-986) Dedup fails due to date format (long)

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056984#comment-13056984
 ] 

Hudson commented on NUTCH-986:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Dedup fails due to date format (long)
> -
>
> Key: NUTCH-986
> URL: https://issues.apache.org/jira/browse/NUTCH-986
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-986-1.3-1.patch, NUTCH-986-1.3-2.patch, 
> NUTCH-986-trunk-1.patch, NUTCH-986-trunk-2.patch
>
>
> As already mentioned on the list, dedup also failes because of invalid date 
> formats.
> Apr 19, 2011 10:34:50 AM 
> org.apache.solr.request.BinaryResponseWriter$Resolver 
> getDoc
> WARNING: Error reading a field from document : 
> SolrDocument[{digest=7ff92a31c58e43a34fd45bc6d87cda03}]
> java.lang.NumberFormatException: For input string: "2011-04-19T08:16:31.675Z"
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
> at java.lang.Long.parseLong(Long.java:419)
> at java.lang.Long.valueOf(Long.java:525)
> at org.apache.solr.schema.LongField.toObject(LongField.java:82)
> 
> Strange enough, Solr seems to allow updates of long fields with a formatted 
> date. In Nutch 1.2 the tstamp field is actually a long but in 1.3 the field 
> is 
> a valid Solr date format. This exception is only triggered using the javabin 
> response writer so there's something weird in Solr too.
> We need to either change the tstamp field back to a long or update the Solr 
> example schema and fix SolrDeleteDuplicates to use the formatted date instead 
> of the long.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset $charset

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056986#comment-13056986
 ] 

Hudson commented on NUTCH-1012:
---

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])
NUTCH-1012 Cannot handle illegal charset

markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1140696
Files : 
* /nutch/trunk/src/java/org/apache/nutch/util/EncodingDetector.java
* /nutch/trunk/CHANGES.txt


> Cannot handle illegal charset $charset
> --
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1012-1.4.patch
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): 
> java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: 
> Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN  parse.html - 
> java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN  parse.ParseSegment - Error parsing: 
> http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-994) Fine tune Solr schema

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056983#comment-13056983
 ] 

Hudson commented on NUTCH-994:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Fine tune Solr schema
> -
>
> Key: NUTCH-994
> URL: https://issues.apache.org/jira/browse/NUTCH-994
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-994-all.patch
>
>
> The supplied schema is old and doesn't use more advanced fieldTypes such as 
> Trie based (since Solr 1.4) and perhaps other improvements. We need to fine 
> tune the schema.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[Nutch Wiki] Trivial Update of "bin/nutch_readdb" by LewisJohnMcgibbney

2011-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch_readdb" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_readdb?action=diff&rev1=9&rev2=10

Comment:
trivial formatting

  Readdb is an alias for org.apache.nutch.crawl.CrawlDbReader
  
- The CrawlDbReader implements all the read-only parts of accessing our web 
database. It provides us with a read utility for the CrawlDB.
+ The CrawlDbReader implements all the read-only parts of accessing our web 
database. It provides us with a read utility for the crawldb.
  
  Usage: 
  


[Nutch Wiki] Update of "bin/nutch_readdb" by LewisJohnMcgibbney

2011-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch_readdb" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_readdb?action=diff&rev1=8&rev2=9

Comment:
Update to reflect Nutch 1.3 API

  The CrawlDbReader implements all the read-only parts of accessing our web 
database. It provides us with a read utility for the CrawlDB.
  
  Usage: 
+ 
  {{{
- bin/nutch org.apache.nutch.crawl.CrawlDbReader (-local | -ndfs 
)  [-pageurl url] | [-pagemd5 md5] | [-dumppageurl] | 
[-dumppagemd5] | [-toppages ] | [-linkurl url] | [-linkmd5 md5] | 
[-dumplinks] | [-stats]
+ bin/nutch org.apache.nutch.crawl.CrawlDbReader  (-stats | -dump 
 | -topN   [] | -url )
- }}}
+ }}} 
  
- '''(-local | -ndfs )''':
+ '': The location of the crawldb directory we wish to read and 
obtain information from.
  
- '':
+ '''-stats''': This prints the overall statistics to System.out.
  
- '''[-pageurl url]''':
+ '''-dump ''': Enables us to dump the whole crawldb to a text file in 
any  we wish to specify.
  
- '''[-pagemd5 md5]''':
+ '''-topN   []''': This dumps the top  urls sorted 
by score relevance to any  we wish to specify. If the [] 
parameter is passed in the command the reader will skip records with scores 
below this particluar value. This can significantly improve retrieval 
performance of statistics or crawldb dump results.
  
- '''[-dumppageurl]''':
+ '''-url ''': This simply prints information of any particular  to 
System.out.
  
- '''[-dumppagemd5]''':
  
- '''[-toppages ]''':
- 
- '''[-linkurl url]''':
- 
- '''[-linkmd5 md5]''':
- 
- '''[-dumplinks]''':
- 
- '''[-stats]''':
  
  CommandLineOptions
  


[Nutch Wiki] Trivial Update of "bin/nutch readlinkdb" by LewisJohnMcgibbney

2011-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch readlinkdb" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch%20readlinkdb?action=diff&rev1=1&rev2=2

Comment:
formatting

  Readlinkdb is an alias for org.apache.nutch.crawl.LinkDbReader
  
  This reader class enables us to to obtain various information from within a 
linkdb. The two types of information we can retirieve is
- '''i.''' A dump of the whole linkdb which is then written to a text file for 
easy viewing.
+  * A dump of the whole linkdb which is then written to a text file for easy 
viewing.
- '''ii.''' Specific information relating to a specific URL. 
+  * Specific information relating to a specific URL. 
+ 
  /!\ :TODO: More could be added to the above e.g what is the nature and 
structure of the information we retieve from a dump of the linkdb and a 
specific URL. /!\ 
+ 
  Usage: 
  {{{
  bin/nutch Usage: LinkDbReader  (-dump  | -url )


[Nutch Wiki] Update of "bin/nutch readlinkdb" by LewisJohnMcgibbney

2011-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch readlinkdb" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch%20readlinkdb

Comment:
Update to reflect Nutch 1.3 API

New page:
Readlinkdb is an alias for org.apache.nutch.crawl.LinkDbReader

This reader class enables us to to obtain various information from within a 
linkdb. The two types of information we can retirieve is
'''i.''' A dump of the whole linkdb which is then written to a text file for 
easy viewing.
'''ii.''' Specific information relating to a specific URL. 
/!\ :TODO: More could be added to the above e.g what is the nature and 
structure of the information we retieve from a dump of the linkdb and a 
specific URL. /!\ 
Usage: 
{{{
bin/nutch Usage: LinkDbReader  (-dump  | -url )
}}}

'': This is the linkdb diretory we wish to read and obtain 
information from.


'''-dump ''': This parameter dumps the whole linkdb to a text file in 
any  we wish to specify.


'''-url ''': The -url arguement provides us with information about a 
specific . This is written to System.out.



CommandLineOptions


[jira] [Commented] (NUTCH-1023) Trivial error in error message for org.apache.nutch.crawl.LinkDbReader

2011-06-28 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056750#comment-13056750
 ] 

Lewis John McGibbney commented on NUTCH-1023:
-

I will submitt a patch in a week or so.

> Trivial error in error message for org.apache.nutch.crawl.LinkDbReader
> --
>
> Key: NUTCH-1023
> URL: https://issues.apache.org/jira/browse/NUTCH-1023
> Project: Nutch
>  Issue Type: Improvement
>  Components: linkdb
>Affects Versions: 1.3
>Reporter: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.4, 2.0
>
>
> The following line in the above class has a trivial error in syntax before 
> the -dump parameter. Instead of a curly bracket, it should be consistent with 
> the round bracket.
> 126   System.err.println("Usage: LinkDbReader  {-dump  | 
> -url )");
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1023) Trivial error in error message for org.apache.nutch.crawl.LinkDbReader

2011-06-28 Thread Lewis John McGibbney (JIRA)
Trivial error in error message for org.apache.nutch.crawl.LinkDbReader
--

 Key: NUTCH-1023
 URL: https://issues.apache.org/jira/browse/NUTCH-1023
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 1.3
Reporter: Lewis John McGibbney
Priority: Trivial
 Fix For: 1.4, 2.0


The following line in the above class has a trivial error in syntax before the 
-dump parameter. Instead of a curly bracket, it should be consistent with the 
round bracket.

126   System.err.println("Usage: LinkDbReader  {-dump  | 
-url )");
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[Nutch Wiki] Trivial Update of "bin/nutch mergedb" by LewisJohnMcgibbney

2011-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch mergedb" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch%20mergedb?action=diff&rev1=3&rev2=4

  
  Usage: 
  {{{
- bin/nutch CrawlDbMerger   [  
...] [-normalize] [-filter]");
+ bin/nutch CrawlDbMerger   [  
...] [-normalize] [-filter]
  }}}
  
  '': This allows us to specify a name for the new merged 
output crawldb. 


[Nutch Wiki] Trivial Update of "bin/nutch mergedb" by LewisJohnMcgibbney

2011-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch mergedb" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch%20mergedb?action=diff&rev1=2&rev2=3

  Mergedb is an alias for org.apache.nutch.crawl.CrawlDbMerger
  
- This tool merges several crawldb's into one, optionally filtering URLs 
through the current URLFilters, to skip prohibited pages. It is possible to use 
this tool just for filtering - in that case only one crawldb should be 
specified in arguments. If more than one CrawlDb contains information about the 
same URL, only the most recent version is retained, as determined by the value 
of ' org.apache.nutch.crawl.CrawlDatum#getFetchTime()'. However, all 
metadata information from all versions is accumulated, with newer values taking 
precedence over older values.
+ This tool merges several crawldb's into one, optionally filtering URLs 
through the current URLFilters, to skip prohibited pages. It is possible to use 
this tool just for filtering - in that case only one crawldb should be 
specified in arguments. If more than one crawldb contains information about the 
same URL, only the most recent version is retained, as determined by the value 
of  org.apache.nutch.crawl.CrawlDatum#getFetchTime(). However, all metadata 
information from all versions is accumulated, with newer values taking 
precedence over older values.
   
  
  Usage: 


[Nutch Wiki] Trivial Update of "bin/nutch mergedb" by LewisJohnMcgibbney

2011-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch mergedb" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch%20mergedb?action=diff&rev1=1&rev2=2

Comment:
trivial formatting

  Mergedb is an alias for org.apache.nutch.crawl.CrawlDbMerger
  
- This tool merges several CrawlDb's into one, optionally filtering URLs 
through the current URLFilters, to skip prohibited pages. It is possible to use 
this tool just for filtering - in that case only one CrawlDb should be 
specified in arguments. If more than one CrawlDb contains information about the 
same URL, only the most recent version is retained, as determined by the value 
of {@link org.apache.nutch.crawl.CrawlDatum#getFetchTime()}. However, all 
metadata information from all versions is accumulated, with newer values taking 
precedence over older values.
+ This tool merges several crawldb's into one, optionally filtering URLs 
through the current URLFilters, to skip prohibited pages. It is possible to use 
this tool just for filtering - in that case only one crawldb should be 
specified in arguments. If more than one CrawlDb contains information about the 
same URL, only the most recent version is retained, as determined by the value 
of ' org.apache.nutch.crawl.CrawlDatum#getFetchTime()'. However, all 
metadata information from all versions is accumulated, with newer values taking 
precedence over older values.
   
  
  Usage: 


[Nutch Wiki] Update of "bin/nutch mergedb" by LewisJohnMcgibbney

2011-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch mergedb" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch%20mergedb

Comment:
Update to reflect changes in Nutch 1.3 API and classes.

New page:
Mergedb is an alias for org.apache.nutch.crawl.CrawlDbMerger

This tool merges several CrawlDb's into one, optionally filtering URLs through 
the current URLFilters, to skip prohibited pages. It is possible to use this 
tool just for filtering - in that case only one CrawlDb should be specified in 
arguments. If more than one CrawlDb contains information about the same URL, 
only the most recent version is retained, as determined by the value of {@link 
org.apache.nutch.crawl.CrawlDatum#getFetchTime()}. However, all metadata 
information from all versions is accumulated, with newer values taking 
precedence over older values.
 

Usage: 
{{{
bin/nutch CrawlDbMerger   [  ...] 
[-normalize] [-filter]");
}}}

'': This allows us to specify a name for the new merged 
output crawldb. 

'': Only one crawldb parameter is used if we only wish to filter 
URLs through the current URLFilters. This enables us to filter unwanted pages 
from the crawldb.

'''[  ...]]''': Two or more crawldb arguements can be 
passed if we wish to undertake a merging of crawldb's. More information 
regarding dulication of URLs and URL metadata etc can be found above.

'''[-normalize]''': If we know/think that URLs require to be normalized prior 
to being merged we pass this parameter. This uses the URLNormalizer on urls in 
the crawldb(s), although this is usually not required.

'''[-filter]''': Enables is to filter URLs through the current URLFilters. This 
can be used in conjuction with a single crawldb arguement.


CommandLineOptions


Relevance of -dir parameter in org.apache.nutch.crawl.Crawl

2011-06-28 Thread lewis john mcgibbney
Hi,

There seems to be some conflicting arguements which currently exist in the
above class. My understanding is that we don't need the -dir parameter
anymore as this previously specified the location for the Lucene index. Is
this correct?

If this is the case then code within the class needs to be cleaned up, if
not then can someone explain to me why we require the -dir parameter. If the
former then I am happy to open a JIRA and complete when I arrive home.

Thank you

-- 
*Lewis*


[Nutch Wiki] Trivial Update of "bin/nutch_crawl" by LewisJohnMcgibbney

2011-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch_crawl" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_crawl?action=diff&rev1=16&rev2=17

  
  Usage: 
  {{{
- bin/nutch org.apache.nutch.crawl.Crawl  [-solr ] [-threads 
n] [-depth i] }}}
+ bin/nutch org.apache.nutch.crawl.Crawl  [-solr ] [-threads 
n] [-depth i] [-topN N]
+ }}}
  
  '': Contains text files with URL lists. This must be an existing 
directory. Example would be ${NUTCH_HOME}/urls
  


[Nutch Wiki] Trivial Update of "bin/nutch_crawl" by LewisJohnMcgibbney

2011-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch_crawl" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_crawl?action=diff&rev1=15&rev2=16

  
  Usage: 
  {{{
+ bin/nutch org.apache.nutch.crawl.Crawl  [-solr ] [-threads 
n] [-depth i] }}}
- bin/nutch org.apache.nutch.crawl.Crawl (-local | -ndfs ) 
 [-threads n] [-depth i] [-showThreadID] [-solrindex s]
- }}}
  
- '': Contains text files with URL lists. This must be 
an existing directory. Example would be ${NUTCH_HOME}/urls
+ '': Contains text files with URL lists. This must be an existing 
directory. Example would be ${NUTCH_HOME}/urls
+ 
+ '''[-solr ]''': Enables us to pass our Solr instance as an indexing 
parameter to simplify the process of indexing with Solr.
  
  '''[-threads n]''': This parameter enables you to choose how many threads 
Nutch should use when crawling.
  
  '''[-depth i]''': You can tell Nutch how deep it should crawl. If you don’t 
tell Nutch a value, it takes 5 as his standard parameter. 
  For example if you pass –depth 1 as the parameter, Nutch will only index the 
first level. If you say –depth 2 (or more) Nutch will follow this number of 
outlinks.
  
+ '''[-topN N]''': The maximum number of outlinks Nutch will obtain from any 
one page.
- '''[-solrindex s]''': Enables us to pass our Solr instance as an indexing 
parameter to simplify the process of indexing with Solr.
- 
- '''[-showThreadID]''': 
- 
- '''-local''':
- 
- '''-ndfs ''':
- 
  
  CommandLineOptions
  


Re: [jira] [Commented] (NUTCH-1019) Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy

2011-06-28 Thread Markus Jelsma
have a pleasant holiday!

> [
> https://issues.apache.org/jira/browse/NUTCH-1019?page=com.atlassian.jira.p
> lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056715#comm
> ent-13056715 ]
> 
> Lewis John McGibbney commented on NUTCH-1019:
> -
> 
> Yes I will do when I get home from vacation. As this is trivial I'm sure it
> can wait until next week when I will be able to get it sorted out.
> 
> > Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy
> > -
> > 
> > Key: NUTCH-1019
> > URL: https://issues.apache.org/jira/browse/NUTCH-1019
> > 
> > Project: Nutch
> >  
> >  Issue Type: Improvement
> >  Components: documentation
> >
> >Affects Versions: 1.4, 2.0
> >
> >Reporter: Lewis John McGibbney
> >Priority: Trivial
> >
> > Fix For: 1.4, 2.0
> > 
> > When updating the wiki documentation for command line options, I noticed
> > that the comment on line 51 of the above class is inaccurate and needs
> > to be updated to reflect changes. Although this is a trivial task I
> > won't be able to committ until 2nd week July. Can I ask someone else
> > please?
> 
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1019) Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy

2011-06-28 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056715#comment-13056715
 ] 

Lewis John McGibbney commented on NUTCH-1019:
-

Yes I will do when I get home from vacation. As this is trivial I'm sure it can 
wait until next week when I will be able to get it sorted out.

> Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy
> -
>
> Key: NUTCH-1019
> URL: https://issues.apache.org/jira/browse/NUTCH-1019
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.4, 2.0
>Reporter: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.4, 2.0
>
>
> When updating the wiki documentation for command line options, I noticed that 
> the comment on line 51 of the above class is inaccurate and needs to be 
> updated to reflect changes. Although this is a trivial task I won't be able 
> to committ until 2nd week July. Can I ask someone else please?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-961:


Patch Info: [Patch Available]

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
> NUTCH-961-1.3-tikaparser1.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1012) Cannot handle illegal charset $charset

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1012.
--

Resolution: Fixed
  Assignee: Markus Jelsma

Committed for 1.4 rev. 1140695 and for trunk in rev. 1140696.

> Cannot handle illegal charset $charset
> --
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1012-1.4.patch
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): 
> java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: 
> Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN  parse.html - 
> java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN  parse.ParseSegment - Error parsing: 
> http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1000) Add option not to commit to Solr

2011-06-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056576#comment-13056576
 ] 

Markus Jelsma commented on NUTCH-1000:
--

Committed for 1.4 in rev. 1140685.

> Add option not to commit to Solr
> 
>
> Key: NUTCH-1000
> URL: https://issues.apache.org/jira/browse/NUTCH-1000
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.3, 1.4, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1000-1.4-2.patch, NUTCH-1000-1.4.patch
>
>
> We need an option to prevent a job from sending a commit to Solr. A commit 
> can take a lot of resources (cache warming) and it's not always necessary to 
> commit after index, dedup or clean, especially if they are run immediately 
> after the other.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1000) Add option not to commit to Solr

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1000:
-

Attachment: NUTCH-1000-1.4-2.patch

Added patch with method overrides for indexer and dedup. The lack of them 
caused the crawl.java to fail. It didn't show up in my builds before just now.



> Add option not to commit to Solr
> 
>
> Key: NUTCH-1000
> URL: https://issues.apache.org/jira/browse/NUTCH-1000
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.3, 1.4, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1000-1.4-2.patch, NUTCH-1000-1.4.patch
>
>
> We need an option to prevent a job from sending a commit to Solr. A commit 
> can take a lot of resources (cache warming) and it's not always necessary to 
> commit after index, dedup or clean, especially if they are run immediately 
> after the other.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (NUTCH-1017) Exception getting mime type by name

2011-06-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056564#comment-13056564
 ] 

Markus Jelsma edited comment on NUTCH-1017 at 6/28/11 3:12 PM:
---

Another curiosity, this error is not to be found in the hadoop.log file 
although it clearly uses the logging facility:


{code}
  public MimeType forName(String name) {
try {
  return this.mimeTypes.forName(name);
} catch (MimeTypeException e) {
  LOG.warning("Exception getting mime type by name: [" + name
  + "]: Message: " + e.getMessage());
  return null;
}
  }
{code}

  was (Author: markus17):
Another curiosity, this error is not to be found in the hadoop.log file! 
Anyone knows where this is coming from? 
  
> Exception getting mime type by name
> ---
>
> Key: NUTCH-1017
> URL: https://issues.apache.org/jira/browse/NUTCH-1017
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> Large crawls of `bad` websites tend to produce a lot of parsing errors. One 
> of them is related to retrieving mime types, so it seems:
> {code}
> WARNING: Exception getting mime type by name: []: Message: 
> Invalid media type name: 
> Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
> WARNING: Exception getting mime type by name: []: Message: 
> Invalid media type name: 
> Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
> WARNING: Exception getting mime type by name: [Mime-Type]: Message: Invalid 
> media type name: Mime-Type
> Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
> WARNING: Exception getting mime type by name: []: Message: 
> Invalid media type name: 
> Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
> WARNING: Exception getting mime type by name: [text/html charset=utf-8]: 
> Message: Invalid media type name: text/html charset=utf-8
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1017) Exception getting mime type by name

2011-06-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056564#comment-13056564
 ] 

Markus Jelsma commented on NUTCH-1017:
--

Another curiosity, this error is not to be found in the hadoop.log file! Anyone 
knows where this is coming from? 

> Exception getting mime type by name
> ---
>
> Key: NUTCH-1017
> URL: https://issues.apache.org/jira/browse/NUTCH-1017
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> Large crawls of `bad` websites tend to produce a lot of parsing errors. One 
> of them is related to retrieving mime types, so it seems:
> {code}
> WARNING: Exception getting mime type by name: []: Message: 
> Invalid media type name: 
> Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
> WARNING: Exception getting mime type by name: []: Message: 
> Invalid media type name: 
> Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
> WARNING: Exception getting mime type by name: [Mime-Type]: Message: Invalid 
> media type name: Mime-Type
> Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
> WARNING: Exception getting mime type by name: []: Message: 
> Invalid media type name: 
> Jun 27, 2011 9:23:27 PM org.apache.nutch.util.MimeUtil forName
> WARNING: Exception getting mime type by name: [text/html charset=utf-8]: 
> Message: Invalid media type name: text/html charset=utf-8
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1021) Migrate OutlinkExtractor from Apache ORO to java.util.regex

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1021:
-

Patch Info: [Patch Available]

> Migrate OutlinkExtractor from Apache ORO to java.util.regex 
> 
>
> Key: NUTCH-1021
> URL: https://issues.apache.org/jira/browse/NUTCH-1021
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1021-1.4-2.patch, NUTCH-1021-1.4.patch
>
>
> Migrate from deprecated ORO to Java util regex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1021) Migrate OutlinkExtractor from Apache ORO to java.util.regex

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1021:
-

Attachment: NUTCH-1021-1.4-2.patch

Reworked patch to pass unit test. It still complains about a failed build but i 
cannot seem to find the cause, maybe somewhere in the plugins. Is there an 
easier way to find plugins that fail the test? This is really cumbersome.

> Migrate OutlinkExtractor from Apache ORO to java.util.regex 
> 
>
> Key: NUTCH-1021
> URL: https://issues.apache.org/jira/browse/NUTCH-1021
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1021-1.4-2.patch, NUTCH-1021-1.4.patch
>
>
> Migrate from deprecated ORO to Java util regex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1021) Migrate OutlinkExtractor from Apache ORO to java.util.regex

2011-06-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056541#comment-13056541
 ] 

Markus Jelsma commented on NUTCH-1021:
--

ant test fails for OutlinkExtractor. Anyone know why no failure or error is 
output? Also, how can i execute this test beside the others? It's more than 
cumbersome to test all everytime.

{code}
Testsuite: org.apache.nutch.parse.TestOutlinkExtractor
Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.018 sec
- Standard Output ---
2011-06-28 16:34:13,518 ERROR parse.OutlinkExtractor 
(OutlinkExtractor.java:getOutlinks(111)) - getOutlinks
java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1140)
at java.util.regex.Matcher.reset(Matcher.java:291)
at java.util.regex.Matcher.(Matcher.java:211)
at java.util.regex.Pattern.matcher(Pattern.java:888)
at 
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:85)
at 
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:66)
at 
org.apache.nutch.parse.TestOutlinkExtractor.testGetNoOutlinks(TestOutlinkExtractor.java:40)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at junit.framework.TestCase.runTest(TestCase.java:168)
at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:232)
at junit.framework.TestSuite.run(TestSuite.java:227)
at 
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:79)
at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39)
at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:420)
at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:911)
at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:768)
{code}

> Migrate OutlinkExtractor from Apache ORO to java.util.regex 
> 
>
> Key: NUTCH-1021
> URL: https://issues.apache.org/jira/browse/NUTCH-1021
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1021-1.4.patch
>
>
> Migrate from deprecated ORO to Java util regex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1021) Migrate OutlinkExtractor from Apache ORO to java.util.regex

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1021:
-

Attachment: NUTCH-1021-1.4.patch

Here's a patch for 1.4. It compiles against trunk as well.

> Migrate OutlinkExtractor from Apache ORO to java.util.regex 
> 
>
> Key: NUTCH-1021
> URL: https://issues.apache.org/jira/browse/NUTCH-1021
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1021-1.4.patch
>
>
> Migrate from deprecated ORO to Java util regex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (NUTCH-1021) Migrate OutlinkExtractor from Apache ORO to java.util.regex

2011-06-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056487#comment-13056487
 ] 

Markus Jelsma edited comment on NUTCH-1021 at 6/28/11 2:13 PM:
---

Hm, the class o.a.n.parse.OutlinkExtractor seems to be only in use by parse-ext 
and parse-tika, for which in the latter it is only used as a fallback in case 
parse-tika cannot find outlinks itself. 

  was (Author: markus17):
Hm, the class o.a.n.parse.OutlinkExtractor seems to be only in use by 
parse-ext and parse-tika, for which in the latter it is only used as a fallback 
in case parse-tika cannot find outlinks itself.  Note to self: this may be 
useful for NUTCH-961.
  
> Migrate OutlinkExtractor from Apache ORO to java.util.regex 
> 
>
> Key: NUTCH-1021
> URL: https://issues.apache.org/jira/browse/NUTCH-1021
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1021-1.4.patch
>
>
> Migrate from deprecated ORO to Java util regex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1016:
-

Attachment: NUTCH-1016-1.4-4.patch

Previous patch included debug line to stdout. Removed now.

> Strip UTF-8 non-character codepoints
> 
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1016-1.4-4.patch
>
>
> During a very large crawl i found a few documents producing non-character 
> codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class 
> java.io.CharConversionException] Invalid UTF-8 character 0x at char 
> #1142033, byte #1155068)
> at 
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at 
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
> content field to a method to strip away non-characters. I'm not too sure 
> about this implementation but the tests i've done locally with a huge dataset 
> now passes correctly. Here's a list of codepoints to strip away: 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1016:
-

Attachment: (was: NUTCH-1016-1.4-2.patch)

> Strip UTF-8 non-character codepoints
> 
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> During a very large crawl i found a few documents producing non-character 
> codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class 
> java.io.CharConversionException] Invalid UTF-8 character 0x at char 
> #1142033, byte #1155068)
> at 
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at 
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
> content field to a method to strip away non-characters. I'm not too sure 
> about this implementation but the tests i've done locally with a huge dataset 
> now passes correctly. Here's a list of codepoints to strip away: 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1016:
-

Attachment: NUTCH-1016-1.4-3.patch

New patch also includes checking for non-printable control characters.

> Strip UTF-8 non-character codepoints
> 
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> During a very large crawl i found a few documents producing non-character 
> codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class 
> java.io.CharConversionException] Invalid UTF-8 character 0x at char 
> #1142033, byte #1155068)
> at 
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at 
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
> content field to a method to strip away non-characters. I'm not too sure 
> about this implementation but the tests i've done locally with a huge dataset 
> now passes correctly. Here's a list of codepoints to strip away: 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1016:
-

Attachment: (was: NUTCH-1016-1.4-3.patch)

> Strip UTF-8 non-character codepoints
> 
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> During a very large crawl i found a few documents producing non-character 
> codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class 
> java.io.CharConversionException] Invalid UTF-8 character 0x at char 
> #1142033, byte #1155068)
> at 
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at 
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
> content field to a method to strip away non-characters. I'm not too sure 
> about this implementation but the tests i've done locally with a huge dataset 
> now passes correctly. Here's a list of codepoints to strip away: 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1022) Upgrade version number of Nutch agent in conf

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1022.
--

Resolution: Fixed

Committed in in rev. 1140619.

> Upgrade version number of Nutch agent in conf
> -
>
> Key: NUTCH-1022
> URL: https://issues.apache.org/jira/browse/NUTCH-1022
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.4
>
> Attachments: NUTCH-1022-1.4.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Closed] (NUTCH-1022) Upgrade version number of Nutch agent in conf

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-1022.



> Upgrade version number of Nutch agent in conf
> -
>
> Key: NUTCH-1022
> URL: https://issues.apache.org/jira/browse/NUTCH-1022
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.4
>
> Attachments: NUTCH-1022-1.4.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1022) Upgrade version number of Nutch agent in conf

2011-06-28 Thread Markus Jelsma (JIRA)
Upgrade version number of Nutch agent in conf
-

 Key: NUTCH-1022
 URL: https://issues.apache.org/jira/browse/NUTCH-1022
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.4
 Attachments: NUTCH-1022-1.4.patch



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1022) Upgrade version number of Nutch agent in conf

2011-06-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1022:
-

Attachment: NUTCH-1022-1.4.patch

> Upgrade version number of Nutch agent in conf
> -
>
> Key: NUTCH-1022
> URL: https://issues.apache.org/jira/browse/NUTCH-1022
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.4
>
> Attachments: NUTCH-1022-1.4.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1021) Migrate OutlinkExtractor from Apache ORO to java.util.regex

2011-06-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056487#comment-13056487
 ] 

Markus Jelsma commented on NUTCH-1021:
--

Hm, the class o.a.n.parse.OutlinkExtractor seems to be only in use by parse-ext 
and parse-tika, for which in the latter it is only used as a fallback in case 
parse-tika cannot find outlinks itself.  Note to self: this may be useful for 
NUTCH-961.

> Migrate OutlinkExtractor from Apache ORO to java.util.regex 
> 
>
> Key: NUTCH-1021
> URL: https://issues.apache.org/jira/browse/NUTCH-1021
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> Migrate from deprecated ORO to Java util regex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1021) Migrate OutlinkExtractor from Apache ORO to java.util.regex

2011-06-28 Thread Markus Jelsma (JIRA)
Migrate OutlinkExtractor from Apache ORO to java.util.regex 


 Key: NUTCH-1021
 URL: https://issues.apache.org/jira/browse/NUTCH-1021
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0


Migrate from deprecated ORO to Java util regex.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1020) Create or locate class for org.apache.nutch.tools.compat.CrawlDbConverter

2011-06-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056470#comment-13056470
 ] 

Markus Jelsma commented on NUTCH-1020:
--

I'm not sure we need to reincorporate this converter. At least, we still hope 
to launch 2.0 this year, right? 

> Create or locate class for org.apache.nutch.tools.compat.CrawlDbConverter
> -
>
> Key: NUTCH-1020
> URL: https://issues.apache.org/jira/browse/NUTCH-1020
> Project: Nutch
>  Issue Type: Task
>  Components: linkdb
>Affects Versions: 1.3, 1.4, 2.0
>Reporter: Lewis John McGibbney
>  Labels: features
> Fix For: 1.4, 2.0
>
>
> Whilst updating the CommandLineOptions for release 1.3 on the wiki, I noticed 
> that the above class does not exist in the expected location in /src folder. 
> Having looked further afield, it appears that this class (which is meant to 
> convert Nutch 0.9 WebDB to 1.3 format WebDB) does not exist.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1019) Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy

2011-06-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056451#comment-13056451
 ] 

Markus Jelsma commented on NUTCH-1019:
--

Can you provide a patch?

> Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy
> -
>
> Key: NUTCH-1019
> URL: https://issues.apache.org/jira/browse/NUTCH-1019
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.4, 2.0
>Reporter: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.4, 2.0
>
>
> When updating the wiki documentation for command line options, I noticed that 
> the comment on line 51 of the above class is inaccurate and needs to be 
> updated to reflect changes. Although this is a trivial task I won't be able 
> to committ until 2nd week July. Can I ask someone else please?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira