Re: IndexSorter optimizer
Doug Cutting wrote: I have committed this, along with the LuceneQueryOptimizer changes. I could only find one place where I was using numDocs() instead of maxDoc(). Right, I confused two bugs from different files - the other bug still exists in the committed version of the LuceneQueryOptimizer.LimitedCollector constructor, instead of super(maxHits) it should be super(numHits) - this was actually the bug, which was causing that mysterious slowdown for higher values of MAX_HITS. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
NullPointerException (new as of Dec 31st)
During a fetch I have recently started getting these (pretty consistently). task_r_5m9ybr 0.15 reduce > copy > java.lang.NullPointerException at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:991) at java.lang.Float.parseFloat(Float.java:394) at org.apache.nutch.parse.ParseOutputFormat $1.write(ParseOutputFormat.java:84) at org.apache.nutch.fetcher.FetcherOutputFormat $1.write(FetcherOutputFormat.java:80) at org.apache.nutch.mapred.ReduceTask$2.collect(ReduceTask.java:247) at org.apache.nutch.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:41) at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) at org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:604) task_r_8d8tt5 0.0 java.lang.NullPointerException at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:991) at java.lang.Float.parseFloat(Float.java:394) at -- Rod Taylor <[EMAIL PROTECTED]>
[jira] Created: (NUTCH-161) Plain text parser should use parser.character.encoding.default property for fall back encoding
Plain text parser should use parser.character.encoding.default property for fall back encoding -- Key: NUTCH-161 URL: http://issues.apache.org/jira/browse/NUTCH-161 Project: Nutch Type: Bug Components: indexer Environment: any Reporter: KuroSaka TeruHiko Priority: Minor The value of the property parser.character.encoding.default is used as a fallback character encoding (charset) when HTML parser cannot find the charset information in HTTP Content-Type header or in META HTTP-EQUIV tag. But the plain text parser behaves differently. It just uses the system encoding (Java VM file.encodings, which in turn derives from the OS and the locale of the environment from which the JVM was spawned). This is not pretty. To gurantee a consistent behavior, plain text parser should use the value of the same property. Though not tested, these changes in ./src/plugin/parse-text/src/java/org/apache/nutch/parse/text/TextParser.java should do it: Insert this statement in the class definition: private static String defaultCharEncoding = NutchConf.get().get("parser.character.encoding.default", "windows-1252"); Replace this: text = new String(content.getContent());// use default encoding with this: text = new String(content.getContent(), defaultCharEncoding );// use default encoding -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: IndexSorter optimizer
Andrzej Bialecki wrote: Sounds like tf/idf might be de-emphasized in scoring. Perhaps NutchSimilarity.tf() should use log() instead of sqrt() when field==content? I don't think it's that simple, the OPIC score is what determined this behaviour, and it doesn't correspond at all to tf/idf, but to a human judgement. If we think that high-OPIC is more valuable than high-content-tf, then we should use different functions to damp these. Currently both are damped with sqrt(). I've updated the version of Lucene included with Nutch to have the required patch. Would you like me to commit IndexSorter.java or would you? Please do it. There are two typos in your version of IndexSorter, you used numDocs() in two places instead of maxDoc(), which for indexes with deleted docs (after dedup) leads to exceptions. I have committed this, along with the LuceneQueryOptimizer changes. I could only find one place where I was using numDocs() instead of maxDoc(). Cheers, Doug
Re: IndexSorter optimizer
Doug Cutting wrote: Andrzej Bialecki wrote: Using the original index, it was possible for pages with high tf/idf of a term, but with a low "boost" value (the OPIC score), to outrank pages with high "boost" but lower tf/idf of a term. This phenomenon leads quite often to results that are perceived as "junk", e.g. pages with a lot of repeated terms, but with little other real content, like for example navigation bars. Sounds like tf/idf might be de-emphasized in scoring. Perhaps NutchSimilarity.tf() should use log() instead of sqrt() when field==content? I don't think it's that simple, the OPIC score is what determined this behaviour, and it doesn't correspond at all to tf/idf, but to a human judgement. To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ... I've updated the version of Lucene included with Nutch to have the required patch. Would you like me to commit IndexSorter.java or would you? Please do it. There are two typos in your version of IndexSorter, you used numDocs() in two places instead of maxDoc(), which for indexes with deleted docs (after dedup) leads to exceptions. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: IndexSorter optimizer
Andrzej Bialecki wrote: Using the original index, it was possible for pages with high tf/idf of a term, but with a low "boost" value (the OPIC score), to outrank pages with high "boost" but lower tf/idf of a term. This phenomenon leads quite often to results that are perceived as "junk", e.g. pages with a lot of repeated terms, but with little other real content, like for example navigation bars. Sounds like tf/idf might be de-emphasized in scoring. Perhaps NutchSimilarity.tf() should use log() instead of sqrt() when field==content? To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ... I've updated the version of Lucene included with Nutch to have the required patch. Would you like me to commit IndexSorter.java or would you? Doug
[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361552 ] KuroSaka TeruHiko commented on NUTCH-138: - Sorry, my oversight, useBodyEncodingForURI did not work as I expected. Setting URIEncoding is the only way. I'll write this in Wiki. > non-Latin-1 characters cannot be submitted for search > - > > Key: NUTCH-138 > URL: http://issues.apache.org/jira/browse/NUTCH-138 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.7.1 > Environment: Windows XP, Tomcat 5.5.12 > Reporter: KuroSaka TeruHiko > Priority: Minor > > The search.html currently specifies GET method for query submission. > Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over > GET because of some restrictions of HTML or HTTP spec they discovered. (If my > memory is correct, non ISO-8859-1 characters were woking OK over GET with > older versions of Tomcat as far as setCharacterEncoding() is called properly.) > To allow proper transmission of non-ISO-8859-1, POST method should be used. > Here's a proposed patch: > *** search.html Tue Dec 13 15:02:15 2005 > --- search-org.html Tue Dec 13 15:02:07 2005 > *** > *** 59,65 > > > > ! > > help > > --- 59,65 > > > > ! > > help > > BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well > as packaged. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src
Doug Cutting wrote: [EMAIL PROTECTED] wrote: Now users can select their own page signature implementation, possibly with better properties than the old one. Two implementations are provided: * MD5Signature: backward-compatible with the old schema. * TextProfileSignature: an example implementation of a signature, which gives the same values for near-duplicate pages. Please see Javadoc for more information. This looks great! Thanks! Shouldn't this also be used in DeleteDuplicates.java? Yes, I missed that. No harm done (yet), because the two existing implementations both produce an MD5 digest, just differently. I'll fix it. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361549 ] Piotr Kosiorowski commented on NUTCH-138: - BTW - just create user for yourself in nutch Wiki and you shoudl be able to add a new page with information without problems. Thanks for checking and documenting it. > non-Latin-1 characters cannot be submitted for search > - > > Key: NUTCH-138 > URL: http://issues.apache.org/jira/browse/NUTCH-138 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.7.1 > Environment: Windows XP, Tomcat 5.5.12 > Reporter: KuroSaka TeruHiko > Priority: Minor > > The search.html currently specifies GET method for query submission. > Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over > GET because of some restrictions of HTML or HTTP spec they discovered. (If my > memory is correct, non ISO-8859-1 characters were woking OK over GET with > older versions of Tomcat as far as setCharacterEncoding() is called properly.) > To allow proper transmission of non-ISO-8859-1, POST method should be used. > Here's a proposed patch: > *** search.html Tue Dec 13 15:02:15 2005 > --- search-org.html Tue Dec 13 15:02:07 2005 > *** > *** 59,65 > > > > ! > > help > > --- 59,65 > > > > ! > > help > > BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well > as packaged. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-138) non-Latin-1 characters cannot be submitted for search
[ http://issues.apache.org/jira/browse/NUTCH-138?page=all ] Piotr Kosiorowski closed NUTCH-138: --- Resolution: Invalid Setting URIEncoding in tomcat config file fixes the problem. > non-Latin-1 characters cannot be submitted for search > - > > Key: NUTCH-138 > URL: http://issues.apache.org/jira/browse/NUTCH-138 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.7.1 > Environment: Windows XP, Tomcat 5.5.12 > Reporter: KuroSaka TeruHiko > Priority: Minor > > The search.html currently specifies GET method for query submission. > Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over > GET because of some restrictions of HTML or HTTP spec they discovered. (If my > memory is correct, non ISO-8859-1 characters were woking OK over GET with > older versions of Tomcat as far as setCharacterEncoding() is called properly.) > To allow proper transmission of non-ISO-8859-1, POST method should be used. > Here's a proposed patch: > *** search.html Tue Dec 13 15:02:15 2005 > --- search-org.html Tue Dec 13 15:02:07 2005 > *** > *** 59,65 > > > > ! > > help > > --- 59,65 > > > > ! > > help > > BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well > as packaged. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361546 ] KuroSaka TeruHiko commented on NUTCH-138: - You are right. WIth this Tomcat config, UTF-8 characters can be passed. Also works is having: useBodyEncodingForURI="true" in the tag within $TOMCAT/conf/service.xml This is documented in: http://issues.apache.org/bugzilla/show_bug.cgi?id=29900 What I suggest is to add this note to: http://lucene.apache.org/nutch/i18n.html (which currently explains the GUI localization issue only, rather than internationalization proper), or perhaps creating a new page: http://wiki.apache.org/nutch/GettingNutchRunningUTF8Tomcat5 I am willing to write a draft if someone tell me where to submit. Feel free to close this bug. > non-Latin-1 characters cannot be submitted for search > - > > Key: NUTCH-138 > URL: http://issues.apache.org/jira/browse/NUTCH-138 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.7.1 > Environment: Windows XP, Tomcat 5.5.12 > Reporter: KuroSaka TeruHiko > Priority: Minor > > The search.html currently specifies GET method for query submission. > Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over > GET because of some restrictions of HTML or HTTP spec they discovered. (If my > memory is correct, non ISO-8859-1 characters were woking OK over GET with > older versions of Tomcat as far as setCharacterEncoding() is called properly.) > To allow proper transmission of non-ISO-8859-1, POST method should be used. > Here's a proposed patch: > *** search.html Tue Dec 13 15:02:15 2005 > --- search-org.html Tue Dec 13 15:02:07 2005 > *** > *** 59,65 > > > > ! > > help > > --- 59,65 > > > > ! > > help > > BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well > as packaged. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: IndexSorter optimizer
Andrzej Bialecki wrote: I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better. Great news! I will submit the Lucene patches ASAP, now that we know they're useful. Doug
Re: [bug?] PRC called emthod require parameter
Stefan Groschupf wrote: I also note this line in client.java public Writable[] call(Writable[] params, InetSocketAddress[] addresses) throws IOException { if (params.length == 0) return new Writable[0]; Do I understand it correct that in case the remote method does not need any parameter no remote call is done? Different parameters are sent to each address. So params.length should equal addresses.length, and if params.length==0 then addresses.length==0 and there's no call to be made. Make sense? It might be clearer if the test were changed to addresses.length==0. Doug
Re: Bug in DeleteDuplicates.java ?
Andrzej Bialecki wrote: Gal Nitzan wrote: this function throws IOException. Why? public long getPos() throws IOException { return (doc*INDEX_LENGTH)/maxDoc; } It should be throwing ArithmeticException The IOException is required by the API of RecordReader. What happens when maxDoc is zero? Ka-boom! ;-) You're right, this should be wrapped in an IOException and rethrown. No, it should really just be fixed to not cause an ArithmeticException. This is called to report progress. In this case the input "file" for the map is a Lucene index whose documents we iterate through. To simplify the construction of input splits (without opening each index) a constant "length" is used for each "file". So we have to scale the document numbers to give progress in this range. The problem is that progress may be reported even when there are no documents in the index. So the call is valid and no exception should be thrown. Doug
Re: java.io.IOException: Job failed
Gal Nitzan wrote: I am using trunk. while trying to crawl I get the following: [ ...] 050825 100222 task_m_ns3ehv Error running child 050825 100222 task_m_ns3ehv java.lang.ArithmeticException: / by zero 050825 100222 task_m_ns3ehv at org.apache.nutch.indexer.DeleteDuplicates $1.getPos(DeleteDuplicates.java:193) I just fixed this. Doug
Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src
[EMAIL PROTECTED] wrote: Now users can select their own page signature implementation, possibly with better properties than the old one. Two implementations are provided: * MD5Signature: backward-compatible with the old schema. * TextProfileSignature: an example implementation of a signature, which gives the same values for near-duplicate pages. Please see Javadoc for more information. This looks great! Thanks! Shouldn't this also be used in DeleteDuplicates.java? Doug
[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl
[ http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361545 ] byron miller commented on NUTCH-159: While it's from the mapred trunk, it is a non ndfs/local instance only. Mapred.temp.dir was left at it's defaults.. (which didn't exist) mapred.temp.dir /tmp/nutch/mapred/temp A shared directory for temporary files. I'm going to modify this and re-run my fetch and let you know how that works. > Specify temp/working directory for crawl > > > Key: NUTCH-159 > URL: http://issues.apache.org/jira/browse/NUTCH-159 > Project: Nutch > Type: Bug > Components: fetcher, indexer > Versions: 0.8-dev > Environment: Linux/Debian > Reporter: byron miller > > I ran a crawl of 100k web pages and got: > org.apache.nutch.fs.FSError: java.io.IOException: No space left on device > at > org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149) > at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65) > at > org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178) > at > org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224) > at > org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80) > Caused by: java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at > org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147) > ... 4 more > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) > at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:107) > [EMAIL PROTECTED]:/data/nutch$ df -k > It appears crawl created a /tmp/nutch directory that filled up even though i > specified a db directory. > Need to add a parameter to the command line or make a globaly configurable > /tmp (work area) for the nutch instance so that crawls won't fail. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl
[ http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361541 ] Doug Cutting commented on NUTCH-159: mapred.local.dir is the thing to set. if that fails, then there is a bug. what did you have this set to? > Specify temp/working directory for crawl > > > Key: NUTCH-159 > URL: http://issues.apache.org/jira/browse/NUTCH-159 > Project: Nutch > Type: Bug > Components: fetcher, indexer > Versions: 0.8-dev > Environment: Linux/Debian > Reporter: byron miller > > I ran a crawl of 100k web pages and got: > org.apache.nutch.fs.FSError: java.io.IOException: No space left on device > at > org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149) > at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65) > at > org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178) > at > org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224) > at > org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80) > Caused by: java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at > org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147) > ... 4 more > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) > at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:107) > [EMAIL PROTECTED]:/data/nutch$ df -k > It appears crawl created a /tmp/nutch directory that filled up even though i > specified a db directory. > Need to add a parameter to the command line or make a globaly configurable > /tmp (work area) for the nutch instance so that crawls won't fail. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Trunk is broken
Hi Andrzej, Gal Nitzan wrote: >> It seems that Trunk is now broken... >> DmozParser seems to be broken, too. It's package declaration is still org.apache.nutch.crawl instead of org.apache.nutch.tools. TJ
Re: Trunk is broken
Hi Andrzej, Gal Nitzan wrote: > It seems that Trunk is now broken... > DmozParser seems to be broken, too. It's package declaration is still org.apache.nutch.crawl instead of org.apache.nutch.tools. TJ
Re: Mega-cleanup in trunk/
Piotr Kosiorowski wrote: Andrzej Bialecki wrote: Hi, I just commited a large patch to cleanup the trunk/ of obsolete and broken classes remaining from the 0.7.x development line. Please test that things still work as they should ... Hi, I am not sure what is wrong but a lot of JUnit test simply does not compile - I did svn checkout to new directory to be sure I do not anything left from my experiments. Yes, you are right - I would welcome any help, I'm a bit tight on time... I am looking at it right now but - I would suggest to temporarily do a quick cleanup to make trunk testable: Agreed. 3) Remove unused import in: src/test/org/apache/nutch/parse/TestParseText.java Ok. 4) Fix (as it looks simple to fix it - I will look at it in meantime): src/plugin/parse-msword/src/test/org/apache/nutch/parse/msword/TestMSWordParser.java src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java src/plugin/parse-rss/src/test/org/apache/nutch/parse/rss/TestRSSParser.java src/plugin/parse-pdf/src/test/org/apache/nutch/parse/pdf/TestPdfParser.java src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/TestMSPowerPointParser.java src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/AllTests.java Yes, they are just one-line fixes. I removed the getProtocolContent(urlString) methods, you need to replace them with getProtocolContent(new UTF8(urlString), new CrawlDatum()). After removal of all these not compiling classes tests of trunk complete sucessfully on my machine (JDK 1.4.2). If no objections - especially from Andrzej would be raised I can do the cleanup tommorow. Your help would be most welcome, no objections here. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search
[ http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361520 ] Piotr Kosiorowski commented on NUTCH-138: - I am not sure but I would suspect it is a problem of bad tomcat configuration. To handle special characters in query urls one have to change default tomcat configuration - especially URIEncoding attribute to UTF8. See: http://tomcat.apache.org/faq/connectors.html#utf8 Please check if it helps in your particular case so we can close the issue. > non-Latin-1 characters cannot be submitted for search > - > > Key: NUTCH-138 > URL: http://issues.apache.org/jira/browse/NUTCH-138 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.7.1 > Environment: Windows XP, Tomcat 5.5.12 > Reporter: KuroSaka TeruHiko > Priority: Minor > > The search.html currently specifies GET method for query submission. > Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over > GET because of some restrictions of HTML or HTTP spec they discovered. (If my > memory is correct, non ISO-8859-1 characters were woking OK over GET with > older versions of Tomcat as far as setCharacterEncoding() is called properly.) > To allow proper transmission of non-ISO-8859-1, POST method should be used. > Here's a proposed patch: > *** search.html Tue Dec 13 15:02:15 2005 > --- search-org.html Tue Dec 13 15:02:07 2005 > *** > *** 59,65 > > > > ! > > help > > --- 59,65 > > > > ! > > help > > BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well > as packaged. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira