Build failed in Hudson: Nutch-Nightly #80
See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/80/ -- started Checking out http://svn.apache.org/repos/asf/lucene/nutch/trunk A NOTICE.txt A default.properties A LICENSE.txt A contrib A contrib/web2 A contrib/web2/plugins A contrib/web2/plugins/web-keymatch A contrib/web2/plugins/web-keymatch/lib A contrib/web2/plugins/web-keymatch/src A contrib/web2/plugins/web-keymatch/src/test A contrib/web2/plugins/web-keymatch/src/test/org A contrib/web2/plugins/web-keymatch/src/test/org/apache A contrib/web2/plugins/web-keymatch/src/test/org/apache/nutch A contrib/web2/plugins/web-keymatch/src/test/org/apache/nutch/keymatch A contrib/web2/plugins/web-keymatch/src/test/org/apache/nutch/keymatch/TestViewCountSorter.java A contrib/web2/plugins/web-keymatch/src/test/org/apache/nutch/keymatch/TestSimpleKeyMatcher.java A contrib/web2/plugins/web-keymatch/src/java A contrib/web2/plugins/web-keymatch/src/java/org A contrib/web2/plugins/web-keymatch/src/java/org/apache A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/ViewCountSorter.java A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/KeyMatch.java A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/SimpleKeyMatcher.java A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/AbstractFilter.java A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/KeyMatchFilter.java A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/CountFilter.java A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/keymatch/package.html A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/webapp A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/webapp/controller A contrib/web2/plugins/web-keymatch/src/java/org/apache/nutch/webapp/controller/KeyMatchController.java A contrib/web2/plugins/web-keymatch/src/conf A contrib/web2/plugins/web-keymatch/src/conf/tiles-defs.xml A contrib/web2/plugins/web-keymatch/src/resources A contrib/web2/plugins/web-keymatch/src/web A contrib/web2/plugins/web-keymatch/src/web/web-keymatch A contrib/web2/plugins/web-keymatch/src/web/web-keymatch/keymatch.jsp A contrib/web2/plugins/web-keymatch/README.txt A contrib/web2/plugins/web-keymatch/keymatches.xml A contrib/web2/plugins/web-keymatch/plugin.xml A contrib/web2/plugins/web-keymatch/build.xml A contrib/web2/plugins/web-query-propose-spellcheck A contrib/web2/plugins/web-query-propose-spellcheck/src A contrib/web2/plugins/web-query-propose-spellcheck/src/test A contrib/web2/plugins/web-query-propose-spellcheck/src/java A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/spell A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/spell/SpellCheckerTerms.java A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/spell/SpellCheckerBean.java A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/spell/NGramSpeller.java A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/spell/SpellCheckerTerm.java A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/webapp A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/webapp/controller A contrib/web2/plugins/web-query-propose-spellcheck/src/java/org/apache/nutch/webapp/controller/SpellCheckController.java A contrib/web2/plugins/web-query-propose-spellcheck/src/conf A contrib/web2/plugins/web-query-propose-spellcheck/src/conf/tiles-defs.xml A contrib/web2/plugins/web-query-propose-spellcheck/src/resources A contrib/web2/plugins/web-query-propose-spellcheck/src/web A contrib/web2/plugins/web-query-propose-spellcheck/src/web/web-query-propose-spellcheck A contrib/web2/plugins/web-query-propose-spellcheck/src/web/web-query-propose-spellcheck/propose.jsp A contrib/web2/plugins/web-query-propose-spellcheck/plugin.xml A contrib/web2/plugins/web-query-propose-spellcheck/build.xml A contrib/web2/plugins/web-subcollection A
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: NUTCH-443.08052007.patch Patch updated to latest trunk. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Chris A. Mattmann Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-470) Adding optional terms to a query
[ https://issues.apache.org/jira/browse/NUTCH-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494496 ] Ronny Næss commented on NUTCH-470: -- Hi, Trond. Optional meaning does that mean? I would like more Lucene based query's with possibility for query's like fieldname1:term1 fieldname2:term2 .. (Se http://lucene.apache.org/java/docs/queryparsersyntax.html). Is that what this is? Adding optional terms to a query Key: NUTCH-470 URL: https://issues.apache.org/jira/browse/NUTCH-470 Project: Nutch Issue Type: Wish Components: searcher Affects Versions: 0.9.0 Environment: Any Reporter: Trond Andersen Priority: Minor Attachments: optional.patch I'm missing API to add optional terms in the query class. Made a small adjustment to the API to support this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9
Hi, It's been a couple of weeks since I uploaded my patches to make the GeoPosition plugin work on nutch 0.9. I'm wondering whether there's something I can do to help the process along to get these changes accepted - or whether there was a problem with the code? Thanks, - Mike Schwartz At 01:15 PM 4/24/2007, Mike Schwartz (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Schwartz updated NUTCH-469: Attachment: geoPosition0.6_cdiff.zip I've attached the contenxt diff from geoPosition 0.5 that I'm calling geoPosition 0.6, which makes it work with nutch 0.9. changes to geoPosition plugin to make it work on nutch 0.9 -- Key: NUTCH-469 URL: https://issues.apache.org/jira/browse/NUTCH-469 Project: Nutch Issue Type: Improvement Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Mike Schwartz Fix For: 0.7.3 Attachments: geoPosition0.6_cdiff.zip I have modified the geoPosition plugin (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9. (The code was built originally using nutch 0.7.) I'd like to contribute my changes back to the nutch project. I already communicated with the code's author (Matthias Jaekle), and he agrees with my mods. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-470) Adding optional terms to a query
[ https://issues.apache.org/jira/browse/NUTCH-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494499 ] Trond Andersen commented on NUTCH-470: -- The reason for this patch is that I don't know the whole query at once and would like to add more elements to the Query object as I explore relevant search terms. The practical result is that if I create a Query object with java as a term, then I would like to add weblogic. This patch result in the toString() method to return java weblogic as the string representation of the Query. I don't think this equals to the Lucene search terms. Adding optional terms to a query Key: NUTCH-470 URL: https://issues.apache.org/jira/browse/NUTCH-470 Project: Nutch Issue Type: Wish Components: searcher Affects Versions: 0.9.0 Environment: Any Reporter: Trond Andersen Priority: Minor Attachments: optional.patch I'm missing API to add optional terms in the query class. Made a small adjustment to the API to support this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
how is crawl-urlfilter.txt taken care of?
I find four url-filters automaton-urlfilter.txt regex-urlfilter.txt suffix-urlfilter.txt crawl-urlfilter.txt I can see plugins for the first 4 in nutch-site.xml file but not for the 4th one. So, how is the crawl-urlfilter.txt considered by Nutch?
[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-469: - Attachment: NUTCH-469-2007-05-09.txt.gz tnahks for putting this together, I briefly checked through the .gz and patch -please use diffs against trunk in future, they're more easy to check (svn diff file) -there is no junit tests at all, however there is tiny piece of test code in class GeoIndexingFilter, atleast this code could perhaps be moved to a junit test class -i replaced System.out.prints with logging statements -i changed some formatting -would it make sense to move the zip folder from conf to under plugins src/java and change the load mechanism to use (context) class loader as i believe they are quite static piece of information once generated? I am attaching the patch is it is now changes to geoPosition plugin to make it work on nutch 0.9 -- Key: NUTCH-469 URL: https://issues.apache.org/jira/browse/NUTCH-469 Project: Nutch Issue Type: Improvement Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Mike Schwartz Fix For: 0.7.3 Attachments: geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip, NUTCH-469-2007-05-09.txt.gz I have modified the geoPosition plugin (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9. (The code was built originally using nutch 0.7.) I'd like to contribute my changes back to the nutch project. I already communicated with the code's author (Matthias Jaekle), and he agrees with my mods. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9
[ https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sami Siren updated NUTCH-469: - Fix Version/s: (was: 0.7.3) 1.0.0 changes to geoPosition plugin to make it work on nutch 0.9 -- Key: NUTCH-469 URL: https://issues.apache.org/jira/browse/NUTCH-469 Project: Nutch Issue Type: Improvement Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Mike Schwartz Fix For: 1.0.0 Attachments: geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip, NUTCH-469-2007-05-09.txt.gz I have modified the geoPosition plugin (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9. (The code was built originally using nutch 0.7.) I'd like to contribute my changes back to the nutch project. I already communicated with the code's author (Matthias Jaekle), and he agrees with my mods. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494531 ] Sami Siren commented on NUTCH-477: -- I don't feel strongly about this but could enums be used instead of static Strings/ints because it gives us typesafety? +1 Extend URLFilters to support different filtering chains --- Key: NUTCH-477 URL: https://issues.apache.org/jira/browse/NUTCH-477 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Fix For: 1.0.0 Attachments: urlfilters.patch I propose to make the following changes to URLFilters: * extend URLFilters so that they support different filtering rules depending on the context where they are executed. This functionality mirrors the one that URLNormalizers already support. * change their return value to an int code, in order to support early termination of long filtering chains. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-472) NullPointerException in ZipTextExtractor if no MIME type for zipped file
[ https://issues.apache.org/jira/browse/NUTCH-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494534 ] Sami Siren commented on NUTCH-472: -- have a patch? NullPointerException in ZipTextExtractor if no MIME type for zipped file Key: NUTCH-472 URL: https://issues.apache.org/jira/browse/NUTCH-472 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.9.0 Environment: Any Reporter: Antony Bowesman extractText throws a NPE in String contentType = MIME.getMimeType(fname).getName(); if the file in the zip has no configured mime type which breaks the parsing of the zip. Code should do: public String extractText(InputStream input, String url, List outLinksList) throws IOException { String resultText = ; byte temp; ZipInputStream zin = new ZipInputStream(input); ZipEntry entry; while ((entry = zin.getNextEntry()) != null) { if (!entry.isDirectory()) { int size = (int) entry.getSize(); byte[] b = new byte[size]; for(int x = 0; x size; x++) { int err = zin.read(); if(err != -1) { b[x] = (byte)err; } } String newurl = url + /; String fname = entry.getName(); newurl += fname; URL aURL = new URL(newurl); String base = aURL.toString(); int i = fname.lastIndexOf('.'); if (i != -1) { // Trying to resolve the Mime-Type MimeType mt = MIME.getMimeType(fname); if (mt != null) { String contentType = mt.getName(); try { Metadata metadata = new Metadata(); metadata.set(Response.CONTENT_LENGTH, Long.toString(entry.getSize())); metadata.set(Response.CONTENT_TYPE, contentType); Content content = new Content(newurl, base, b, contentType, metadata, this.conf); Parse parse = new ParseUtil(this.conf).parse(content); ParseData theParseData = parse.getData(); Outlink[] theOutlinks = theParseData.getOutlinks(); for(int count = 0; count theOutlinks.length; count++) { outLinksList.add(new Outlink(theOutlinks[count].getToUrl(), theOutlinks[count].getAnchor(), this.conf)); } resultText += entry.getName() + + parse.getText() + ; } catch (ParseException e) { if (LOG.isInfoEnabled()) { LOG.info(fetch okay, but can't parse + fname + , reason: + e.getMessage()); } } } else { resultText += entry.getName(); } } } } return resultText; } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-476) Would like to add a field to the document class for its MD5 signature
[ https://issues.apache.org/jira/browse/NUTCH-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494537 ] Sami Siren commented on NUTCH-476: -- md5 sum (or any other configurable digest) is already calculated in fetcher or parser and dedup can be used to remove duplicates. Would like to add a field to the document class for its MD5 signature -- Key: NUTCH-476 URL: https://issues.apache.org/jira/browse/NUTCH-476 Project: Nutch Issue Type: Improvement Components: indexer Environment: all Reporter: Linh Pham Priority: Minor During indexing a file, if an MD5 signature was calculated and stored along with the document as a default, it could then be used to remove duplicates from the results on retrieval. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: how is crawl-urlfilter.txt taken care of?
Manoharam Reddy wrote: I find four url-filters automaton-urlfilter.txt regex-urlfilter.txt suffix-urlfilter.txt crawl-urlfilter.txt I can see plugins for the first 4 in nutch-site.xml file but not for the 4th one. So, how is the crawl-urlfilter.txt considered by Nutch? This question is more suitable for the user list. crawl-urlfilter is used by the crawl command by default (see crawl-tool.xml) -- Sami Siren
[jira] Resolved: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-443. - Resolution: Fixed Committed in rev. 536606. Big thanks to all who contributed to this patch! allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Andrzej Bialecki Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents
[ https://issues.apache.org/jira/browse/NUTCH-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-467. - Resolution: Fixed Assignee: Andrzej Bialecki Patch applied in rev. 532105. DeleteDuplicate fails if Segment index directory has 0 documents Key: NUTCH-467 URL: https://issues.apache.org/jira/browse/NUTCH-467 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.9.0 Environment: all Reporter: Dennis Kubes Assigned To: Andrzej Bialecki Fix For: 1.0.0 Attachments: nutch-467.patch If any of the segment indexes have 0 documents, then the DDRecordReader in DeleteDuplicates throws an IndexOutOfBoundsException. The record reader needs to check for empty document segment indexes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: SIGSEGV
On May 7, 2007, at 6:34 PM, Brian Whitman wrote: OK. I got the crash again today on different urls. It's strange because I've been crawling quite regularly with the same nutch setup for a while. It's possible that a recent system-level change is getting in the way (I'm running debian with a recent full upgrade.) After googling the culprit for a while I found this trick: -Djava.net.preferIPv4Stack=true I'm running a large crawl with it now and will let you know if I don't see it in a while! Just a note I've crawled 500K pages over a couple of days on the same start URL set that has been crashing it without a problem after adding that flag in bin/nutch. So if anyone else gets the segfault it might be that. -Brian
Re: svn commit: r536606 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/metadata/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/util/ src/plugin/
[EMAIL PROTECTED] wrote: Author: ab Date: Wed May 9 11:00:56 2007 New Revision: 536606 URL: http://svn.apache.org/viewvc?view=revrev=536606 Log: NUTCH-443 - Allow parsers to return multiple Parse objects. did you forgot to add something (ParseResult) or is it just me? -- Sami Siren
[jira] Closed: (NUTCH-418) Fixes parsing of XHTML (e.g. title)
[ https://issues.apache.org/jira/browse/NUTCH-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-418. --- Resolution: Fixed Fix Version/s: 0.9.0 Already applied. Fixes parsing of XHTML (e.g. title) --- Key: NUTCH-418 URL: https://issues.apache.org/jira/browse/NUTCH-418 Project: Nutch Issue Type: Bug Affects Versions: 0.8.2 Environment: Ubuntu Linux Reporter: Michael Wechner Fix For: 0.9.0 Attachments: parse-xhtml-patch.txt Fixes parsing of XHTML (e.g. title) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.
[ https://issues.apache.org/jira/browse/NUTCH-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-417. --- Resolution: Fixed Fix Version/s: 0.9.0 Assignee: Andrzej Bialecki Fixed as a part of upgrade to Hadoop 0.12.2 After upgrade to hadoop-0.9.1, parsing and indexing doesn't work. - Key: NUTCH-417 URL: https://issues.apache.org/jira/browse/NUTCH-417 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Doğacan Güney Assigned To: Andrzej Bialecki Fix For: 0.9.0 Attachments: index.patch If you parse while fetching then it is fine, but if you run parse as a different job, it creates an essentially empty parse_data directory(which has index files, but doesn't have data files). I am not sure why this is happening. Also, indexing fails at Indexer.OutputFormat.getRecordWriter. The parameter fs seems to be an instance of PhasedFileSystem which throws exceptions on delete and {start,complete}LocalOutput. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters
[ https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494552 ] Andrzej Bialecki commented on NUTCH-393: - I agree with that - either all filters should run or the document should be discarded. If it's acceptable to tolerate exceptions in some indexing filters, such exceptions should be caught there. Indexer doesn't handle null documents returned by filters - Key: NUTCH-393 URL: https://issues.apache.org/jira/browse/NUTCH-393 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.8.1, 0.9.0 Reporter: Eelco Lempsink Attachments: NUTCH-393.patch Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer. A trivial adjustment is all it takes: @@ -237,6 +237,7 @@ if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); } return; } +if (doc == null) return; float boost = 1.0f; // run scoring filters -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: svn commit: r536606 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/metadata/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/util/ src/plugin/
Sami Siren wrote: [EMAIL PROTECTED] wrote: Author: ab Date: Wed May 9 11:00:56 2007 New Revision: 536606 URL: http://svn.apache.org/viewvc?view=revrev=536606 Log: NUTCH-443 - Allow parsers to return multiple Parse objects. did you forgot to add something (ParseResult) or is it just me? Indeed. Thanks for spotting this - it's fixed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Resolved: (NUTCH-393) Indexer doesn't handle null documents returned by filters
[ https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-393. - Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki Both places (Indexer and IndexingFilters) fixed in rev. 536629, plus some javadoc clarification has been added. Thank you! Indexer doesn't handle null documents returned by filters - Key: NUTCH-393 URL: https://issues.apache.org/jira/browse/NUTCH-393 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.8.1, 0.9.0 Reporter: Eelco Lempsink Assigned To: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-393.patch Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer. A trivial adjustment is all it takes: @@ -237,6 +237,7 @@ if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); } return; } +if (doc == null) return; float boost = 1.0f; // run scoring filters -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Recrawl help
Hi, I crawled a website. Around 500 out of 5000 pages generated errors/exceptions. I would like to recrawl only these 500 pages. The errors appear to be something similar to this: Segment#1: 0 errors Segment#2: 120 errors Segment#3: 10 errors Segment#4: 370 errors Segment#5: 0 errors Q1: If I want to crawl the 500 urls, I just have to re-crawl all the urls in Segment#2, #3 and #4? How do I do this? Q2: Say, Segment#3 has around 1000urls. Only 10 of them generated errors. If I ask nutch to recrawl the same segment, will it just recrawl all the urls? In this case, it might be inefficient. Does it have the ways to check if the content was modified like using last modified http header? Does anybody have suggestions on how to get around this problem? Thanks, Karthik -- View this message in context: http://www.nabble.com/Recrawl-help-tf3717887.html#a10401361 Sent from the Nutch - Dev mailing list archive at Nabble.com.
[jira] Commented: (NUTCH-479) Support for OR queries
[ https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494582 ] Andrzej Bialecki commented on NUTCH-479: - Correct - the only syntax element added in this patch is an OR clause. Nested queries like that are probably not high on the priority list, because they may be expensive to run, and they would also complicate the implementation of QueryFilter plugins. Anyway, improvements are welcome ;) Support for OR queries -- Key: NUTCH-479 URL: https://issues.apache.org/jira/browse/NUTCH-479 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Fix For: 1.0.0 Attachments: or.patch There have been many requests from users to extend Nutch query syntax to add support for OR queries, in addition to the implicit AND and NOT queries supported now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.