date:20070509

Build failed in Hudson: Nutch-Nightly #80

2007-05-09 Thread hudson

See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/80/

--
started
Checking out http://svn.apache.org/repos/asf/lucene/nutch/trunk
A NOTICE.txt
A default.properties
A LICENSE.txt
A contrib
A contrib/web2
A contrib/web2/plugins
A contrib/web2/plugins/web-keymatch
A contrib/web2/plugins/web-keymatch/lib
A contrib/web2/plugins/web-keymatch/src
A contrib/web2/plugins/web-keymatch/src/test
A contrib/web2/plugins/web-keymatch/src/test/org
A contrib/web2/plugins/web-keymatch/src/test/org/apache
A contrib/web2/plugins/web-keymatch/src/test/org/apache/nutch
A contrib/web2/plugins/web-keymatch/src/test/org/apache/nutch/keymatch
A 
contrib/web2/plugins/web-keymatch/src/test/org/apache/nutch/keymatch/TestViewCountSorter.java
A 
contrib/web2/plugins/web-keymatch/src/test/org/apache/nutch/keymatch/TestSimpleKeyMatcher.java
A contrib/web2/plugins/web-keymatch/src/java
A contrib/web2/plugins/web-keymatch/src/java/org
A contrib/web2/plugins/web-keymatch/src/java/org/apache
A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch
A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch
A 
contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/ViewCountSorter.java
A 
contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/KeyMatch.java
A 
contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/SimpleKeyMatcher.java
A 
contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/AbstractFilter.java
A 
contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/KeyMatchFilter.java
A 
contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/CountFilter.java
A 
contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/package.html
A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/webapp
A 
contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/webapp/controller
A 
contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/webapp/controller/KeyMatchController.java
A contrib/web2/plugins/web-keymatch/src/conf
A contrib/web2/plugins/web-keymatch/src/conf/tiles-defs.xml
A contrib/web2/plugins/web-keymatch/src/resources
A contrib/web2/plugins/web-keymatch/src/web
A contrib/web2/plugins/web-keymatch/src/web/web-keymatch
A contrib/web2/plugins/web-keymatch/src/web/web-keymatch/keymatch.jsp
A contrib/web2/plugins/web-keymatch/README.txt
A contrib/web2/plugins/web-keymatch/keymatches.xml
A contrib/web2/plugins/web-keymatch/plugin.xml
A contrib/web2/plugins/web-keymatch/build.xml
A contrib/web2/plugins/web-query-propose-spellcheck
A contrib/web2/plugins/web-query-propose-spellcheck/src
A contrib/web2/plugins/web-query-propose-spellcheck/src/test
A contrib/web2/plugins/web-query-propose-spellcheck/src/java
A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org
A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/spell
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/spell/SpellCheckerTerms.java
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/spell/SpellCheckerBean.java
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/spell/NGramSpeller.java
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/spell/SpellCheckerTerm.java
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/webapp
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/webapp/controller
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/webapp/controller/SpellCheckController.java
A contrib/web2/plugins/web-query-propose-spellcheck/src/conf
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/conf/tiles-defs.xml
A contrib/web2/plugins/web-query-propose-spellcheck/src/resources
A contrib/web2/plugins/web-query-propose-spellcheck/src/web
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/web/web-query-propose-spellcheck
A 
contrib/web2/plugins/web-query-propose-spellcheck/src/web/web-query-propose-spellcheck/propose.jsp
A contrib/web2/plugins/web-query-propose-spellcheck/plugin.xml
A contrib/web2/plugins/web-query-propose-spellcheck/build.xml
A contrib/web2/plugins/web-subcollection
A

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-443:


Attachment: NUTCH-443.08052007.patch

Patch updated to latest trunk.

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Chris A. Mattmann
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
 NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
 NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
 NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-470) Adding optional terms to a query

2007-05-09 Thread JIRA


[ 
https://issues.apache.org/jira/browse/NUTCH-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494496
 ] 

Ronny Næss commented on NUTCH-470:
--

Hi, Trond. 

Optional meaning does that mean?
I would like more Lucene based query's with possibility for query's like 
fieldname1:term1 fieldname2:term2 .. (Se 
http://lucene.apache.org/java/docs/queryparsersyntax.html). Is that what this 
is?

 Adding optional terms to a query
 

 Key: NUTCH-470
 URL: https://issues.apache.org/jira/browse/NUTCH-470
 Project: Nutch
  Issue Type: Wish
  Components: searcher
Affects Versions: 0.9.0
 Environment: Any
Reporter: Trond Andersen
Priority: Minor
 Attachments: optional.patch


 I'm missing API to add optional terms in the query class. Made a small 
 adjustment to the API to support this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2007-05-09 Thread Mike Schwartz


Hi,

It's been a couple of weeks since I uploaded my patches to make the 
GeoPosition plugin work on nutch 0.9.  I'm wondering whether there's 
something I can do to help the process along to get these changes 
accepted - or whether there was a problem with the code?


Thanks,
 - Mike Schwartz

At 01:15 PM 4/24/2007, Mike Schwartz (JIRA) wrote:

 [ 
https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
]


Mike Schwartz updated NUTCH-469:


Attachment: geoPosition0.6_cdiff.zip

I've attached the contenxt diff from geoPosition 0.5 that I'm 
calling geoPosition 0.6, which makes it work with nutch 0.9.


 changes to geoPosition plugin to make it work on nutch 0.9
 --

 Key: NUTCH-469
 URL: https://issues.apache.org/jira/browse/NUTCH-469
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Mike Schwartz
 Fix For: 0.7.3

 Attachments: geoPosition0.6_cdiff.zip


 I have modified the geoPosition plugin 
(http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 
0.9.  (The code was built originally using nutch 0.7.)  I'd like to 
contribute my changes back to the nutch project.  I already 
communicated with the code's author (Matthias Jaekle), and he 
agrees with my mods.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-470) Adding optional terms to a query

2007-05-09 Thread Trond Andersen (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494499
 ] 

Trond Andersen commented on NUTCH-470:
--

The reason for this patch is that I don't know the whole query at once and 
would like to add more elements to the Query object as I explore relevant 
search terms. The practical result is that if I create a Query object with 
java as a term, then I would like to add weblogic. This patch result in the 
toString() method to return java weblogic as the string representation of the 
Query. I don't think this equals to the Lucene search terms. 

 Adding optional terms to a query
 

 Key: NUTCH-470
 URL: https://issues.apache.org/jira/browse/NUTCH-470
 Project: Nutch
  Issue Type: Wish
  Components: searcher
Affects Versions: 0.9.0
 Environment: Any
Reporter: Trond Andersen
Priority: Minor
 Attachments: optional.patch


 I'm missing API to add optional terms in the query class. Made a small 
 adjustment to the API to support this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

how is crawl-urlfilter.txt taken care of?

2007-05-09 Thread Manoharam Reddy


I find four url-filters

automaton-urlfilter.txt
regex-urlfilter.txt
suffix-urlfilter.txt
crawl-urlfilter.txt

I can see plugins for the first 4 in nutch-site.xml file but not for
the 4th one. So, how is the crawl-urlfilter.txt considered by Nutch?

[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2007-05-09 Thread Sami Siren (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-469:
-

Attachment: NUTCH-469-2007-05-09.txt.gz

tnahks for putting this together, I briefly checked through the .gz and patch

-please use diffs against trunk in future, they're more easy to check (svn diff 
 file)
-there is no junit tests at all, however there is tiny piece of test code in 
class GeoIndexingFilter, atleast this code could perhaps be moved to a junit 
test class
-i replaced System.out.prints with logging statements
-i changed some formatting
-would it make sense to move the zip folder from conf to under plugins src/java 
and change the load mechanism to use (context) class loader as i believe they 
are quite static piece of information once generated?

I am attaching the patch is it is now

 changes to geoPosition plugin to make it work on nutch 0.9
 --

 Key: NUTCH-469
 URL: https://issues.apache.org/jira/browse/NUTCH-469
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Mike Schwartz
 Fix For: 0.7.3

 Attachments: geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip, 
 NUTCH-469-2007-05-09.txt.gz


 I have modified the geoPosition plugin 
 (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9.  (The 
 code was built originally using nutch 0.7.)  I'd like to contribute my 
 changes back to the nutch project.  I already communicated with the code's 
 author (Matthias Jaekle), and he agrees with my mods.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2007-05-09 Thread Sami Siren (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-469:
-

Fix Version/s: (was: 0.7.3)
   1.0.0

 changes to geoPosition plugin to make it work on nutch 0.9
 --

 Key: NUTCH-469
 URL: https://issues.apache.org/jira/browse/NUTCH-469
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Mike Schwartz
 Fix For: 1.0.0

 Attachments: geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip, 
 NUTCH-469-2007-05-09.txt.gz


 I have modified the geoPosition plugin 
 (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9.  (The 
 code was built originally using nutch 0.7.)  I'd like to contribute my 
 changes back to the nutch project.  I already communicated with the code's 
 author (Matthias Jaekle), and he agrees with my mods.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains

2007-05-09 Thread Sami Siren (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494531
 ] 

Sami Siren commented on NUTCH-477:
--

I don't feel strongly about this but could enums be used instead of static 
Strings/ints because it gives us typesafety?

+1

 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
 Assigned To: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-472) NullPointerException in ZipTextExtractor if no MIME type for zipped file

2007-05-09 Thread Sami Siren (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494534
 ] 

Sami Siren commented on NUTCH-472:
--

have a patch?

 NullPointerException in ZipTextExtractor if no MIME type for zipped file
 

 Key: NUTCH-472
 URL: https://issues.apache.org/jira/browse/NUTCH-472
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.9.0
 Environment: Any
Reporter: Antony Bowesman

 extractText throws a NPE in
   String contentType = MIME.getMimeType(fname).getName();
 if the file in the zip has no configured mime type which breaks the parsing 
 of the zip.
 Code should do:
   public String extractText(InputStream input, String url, List outLinksList) 
 throws IOException {
 String resultText = ;
 byte temp;
 
 ZipInputStream zin = new ZipInputStream(input);
 
 ZipEntry entry;
 
 while ((entry = zin.getNextEntry()) != null) {
   
   if (!entry.isDirectory()) {
 int size = (int) entry.getSize();
 byte[] b = new byte[size];
 for(int x = 0; x  size; x++) {
   int err = zin.read();
   if(err != -1) {
 b[x] = (byte)err;
   }
 }
 String newurl = url + /;
 String fname = entry.getName();
 newurl += fname;
 URL aURL = new URL(newurl);
 String base = aURL.toString();
 int i = fname.lastIndexOf('.');
 if (i != -1) {
   // Trying to resolve the Mime-Type
   MimeType mt = MIME.getMimeType(fname);
   if (mt != null) {
 String contentType = mt.getName();
 try {
   Metadata metadata = new Metadata();
   metadata.set(Response.CONTENT_LENGTH, 
 Long.toString(entry.getSize()));
   metadata.set(Response.CONTENT_TYPE, contentType);
   Content content = new Content(newurl, base, b, contentType, 
 metadata, this.conf);
   Parse parse = new ParseUtil(this.conf).parse(content);
   ParseData theParseData = parse.getData();
   Outlink[] theOutlinks = theParseData.getOutlinks();
 
   for(int count = 0; count  theOutlinks.length; count++) {
 outLinksList.add(new Outlink(theOutlinks[count].getToUrl(), 
 theOutlinks[count].getAnchor(), this.conf));
   }
   
   resultText += entry.getName() +   + parse.getText() +  ;
 } catch (ParseException e) {
   if (LOG.isInfoEnabled()) { 
LOG.info(fetch okay, but can't parse  + fname + , reason:  
 + e.getMessage());
   }
 }
   } else {
   resultText += entry.getName();
   }
 }
   }
 }
 
 return resultText;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-476) Would like to add a field to the document class for its MD5 signature

2007-05-09 Thread Sami Siren (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494537
 ] 

Sami Siren commented on NUTCH-476:
--

md5 sum (or any other configurable digest) is already calculated in fetcher 
or parser and dedup can be used to remove duplicates.

 Would like to add a field to the document class for its MD5 signature 
 --

 Key: NUTCH-476
 URL: https://issues.apache.org/jira/browse/NUTCH-476
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
 Environment: all
Reporter: Linh Pham
Priority: Minor

 During indexing a file, if an MD5 signature was calculated and stored along 
 with the document  as a default,
 it could then be used to remove duplicates from the results on retrieval.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: how is crawl-urlfilter.txt taken care of?

2007-05-09 Thread Sami Siren

Manoharam Reddy wrote:
 I find four url-filters
 
 automaton-urlfilter.txt
 regex-urlfilter.txt
 suffix-urlfilter.txt
 crawl-urlfilter.txt
 
 I can see plugins for the first 4 in nutch-site.xml file but not for
 the 4th one. So, how is the crawl-urlfilter.txt considered by Nutch?

This question is more suitable for the user list.

crawl-urlfilter is used by the crawl command by default (see crawl-tool.xml)

--
 Sami Siren

[jira] Resolved: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-09 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-443.
-

Resolution: Fixed

Committed in rev. 536606. Big thanks to all who contributed to this patch!

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
 NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
 NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
 NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents

2007-05-09 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-467.
-

Resolution: Fixed
  Assignee: Andrzej Bialecki 

Patch applied in rev. 532105.

 DeleteDuplicate fails if Segment index directory has 0 documents
 

 Key: NUTCH-467
 URL: https://issues.apache.org/jira/browse/NUTCH-467
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.9.0
 Environment: all
Reporter: Dennis Kubes
 Assigned To: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: nutch-467.patch


 If any of the segment indexes have 0 documents, then the DDRecordReader in 
 DeleteDuplicates throws an IndexOutOfBoundsException.  The record reader 
 needs to check for empty document segment indexes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: SIGSEGV

2007-05-09 Thread Brian Whitman


On May 7, 2007, at 6:34 PM, Brian Whitman wrote:
OK. I got the crash again today on different urls. It's strange  
because I've been crawling quite regularly with the same nutch  
setup for a while. It's possible that a recent system-level change  
is getting in the way (I'm running debian with a recent full upgrade.)


After googling the culprit for a while I found this trick:

-Djava.net.preferIPv4Stack=true

I'm running a large crawl with it now and will let you know if I  
don't see it in a while!


Just a note I've crawled 500K pages over a couple of days on the same  
start URL set that has been crashing it without a problem after  
adding that flag in bin/nutch.


So if anyone else gets the segfault it might be that.

-Brian

Re: svn commit: r536606 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/metadata/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/util/ src/plugin/

2007-05-09 Thread Sami Siren

[EMAIL PROTECTED] wrote:
 Author: ab
 Date: Wed May  9 11:00:56 2007
 New Revision: 536606
 
 URL: http://svn.apache.org/viewvc?view=revrev=536606
 Log:
 NUTCH-443 - Allow parsers to return multiple Parse objects.

did you forgot to add something (ParseResult) or is it just me?

-- 
 Sami Siren

[jira] Closed: (NUTCH-418) Fixes parsing of XHTML (e.g. title)

2007-05-09 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-418.
---

   Resolution: Fixed
Fix Version/s: 0.9.0

Already applied.

 Fixes parsing of XHTML (e.g. title)
 ---

 Key: NUTCH-418
 URL: https://issues.apache.org/jira/browse/NUTCH-418
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.2
 Environment: Ubuntu Linux
Reporter: Michael Wechner
 Fix For: 0.9.0

 Attachments: parse-xhtml-patch.txt


 Fixes parsing of XHTML (e.g. title)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

2007-05-09 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-417.
---

   Resolution: Fixed
Fix Version/s: 0.9.0
 Assignee: Andrzej Bialecki 

Fixed as a part of upgrade to Hadoop 0.12.2

 After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.
 -

 Key: NUTCH-417
 URL: https://issues.apache.org/jira/browse/NUTCH-417
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Doğacan Güney
 Assigned To: Andrzej Bialecki 
 Fix For: 0.9.0

 Attachments: index.patch


 If you parse while fetching then it is fine, but if you run parse as a 
 different job, it creates an essentially empty parse_data directory(which has 
 index files, but doesn't have data files). I am not sure why this is 
 happening.
 Also, indexing fails at Indexer.OutputFormat.getRecordWriter. The parameter 
 fs seems to be an instance of PhasedFileSystem which throws exceptions on 
 delete and {start,complete}LocalOutput.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2007-05-09 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494552
 ] 

Andrzej Bialecki  commented on NUTCH-393:
-

I agree with that - either all filters should run or the document should be 
discarded. If it's acceptable to tolerate exceptions in some indexing filters, 
such exceptions should be caught there.

 Indexer doesn't handle null documents returned by filters
 -

 Key: NUTCH-393
 URL: https://issues.apache.org/jira/browse/NUTCH-393
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8.1, 0.9.0
Reporter: Eelco Lempsink
 Attachments: NUTCH-393.patch


 Plugins (like IndexingFilter) may return a null value, but this isn't handled 
 by the Indexer.  A trivial adjustment is all it takes:
 @@ -237,6 +237,7 @@
if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); }
return;
  }
 +if (doc == null) return;
  
  float boost = 1.0f;
  // run scoring filters

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: svn commit: r536606 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/metadata/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/util/ src/plugin/

2007-05-09 Thread Andrzej Bialecki


Sami Siren wrote:

[EMAIL PROTECTED] wrote:

Author: ab
Date: Wed May  9 11:00:56 2007
New Revision: 536606

URL: http://svn.apache.org/viewvc?view=revrev=536606
Log:
NUTCH-443 - Allow parsers to return multiple Parse objects.


did you forgot to add something (ParseResult) or is it just me?


Indeed. Thanks for spotting this - it's fixed.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

[jira] Resolved: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2007-05-09 Thread Andrzej Bialecki (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-393.
-

   Resolution: Fixed
Fix Version/s: 1.0.0
 Assignee: Andrzej Bialecki 

Both places (Indexer and IndexingFilters) fixed in rev. 536629, plus some 
javadoc clarification has been added. Thank you!

 Indexer doesn't handle null documents returned by filters
 -

 Key: NUTCH-393
 URL: https://issues.apache.org/jira/browse/NUTCH-393
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8.1, 0.9.0
Reporter: Eelco Lempsink
 Assigned To: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-393.patch


 Plugins (like IndexingFilter) may return a null value, but this isn't handled 
 by the Indexer.  A trivial adjustment is all it takes:
 @@ -237,6 +237,7 @@
if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); }
return;
  }
 +if (doc == null) return;
  
  float boost = 1.0f;
  // run scoring filters

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Recrawl help

2007-05-09 Thread karthik085


Hi,

I crawled a website. Around 500 out of 5000 pages generated
errors/exceptions. I would like to recrawl only these 500 pages. The errors
appear to be something similar to this:

Segment#1: 0 errors
Segment#2: 120 errors
Segment#3: 10 errors
Segment#4: 370 errors
Segment#5: 0 errors

Q1: If I want to crawl the 500 urls, I just have to re-crawl all the urls in
Segment#2, #3 and #4? How do I do this?

Q2: Say, Segment#3 has around 1000urls. Only 10 of them generated errors. If
I ask nutch to recrawl the same segment, will it just recrawl all the urls?
In this case, it might be inefficient. Does it have the ways to check if the
content was modified like using last modified http header? Does anybody have
suggestions on how to get around this problem?

Thanks,
Karthik
-- 
View this message in context: 
http://www.nabble.com/Recrawl-help-tf3717887.html#a10401361
Sent from the Nutch - Dev mailing list archive at Nabble.com.

[jira] Commented: (NUTCH-479) Support for OR queries

2007-05-09 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494582
 ] 

Andrzej Bialecki  commented on NUTCH-479:
-

Correct - the only syntax element added in this patch is an OR clause. Nested 
queries like that are probably not high on the priority list, because they may 
be expensive to run, and they would also complicate the implementation of 
QueryFilter plugins. Anyway, improvements are welcome ;)

 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
 Assigned To: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Build failed in Hudson: Nutch-Nightly #80

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Commented: (NUTCH-470) Adding optional terms to a query

Re: [jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

[jira] Commented: (NUTCH-470) Adding optional terms to a query

how is crawl-urlfilter.txt taken care of?

[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains

[jira] Commented: (NUTCH-472) NullPointerException in ZipTextExtractor if no MIME type for zipped file

[jira] Commented: (NUTCH-476) Would like to add a field to the document class for its MD5 signature

Re: how is crawl-urlfilter.txt taken care of?

[jira] Resolved: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Resolved: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents

Re: SIGSEGV

Re: svn commit: r536606 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/metadata/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/util/ src/plugin/

[jira] Closed: (NUTCH-418) Fixes parsing of XHTML (e.g. title)

[jira] Closed: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

Re: svn commit: r536606 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/metadata/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/util/ src/plugin/

[jira] Resolved: (NUTCH-393) Indexer doesn't handle null documents returned by filters

Recrawl help

[jira] Commented: (NUTCH-479) Support for OR queries

23 matches

Site Navigation

Mail list logo

Footer information