Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki

Doug Cutting wrote:


I have committed this, along with the LuceneQueryOptimizer changes.

I could only find one place where I was using numDocs() instead of 
maxDoc().



Right, I confused two bugs from different files - the other bug still 
exists in the committed version of the 
LuceneQueryOptimizer.LimitedCollector constructor, instead of 
super(maxHits) it should be super(numHits) - this was actually the bug, 
which was causing that mysterious slowdown for  higher values of MAX_HITS.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




NullPointerException (new as of Dec 31st)

2006-01-02 Thread Rod Taylor
During a fetch I have recently started getting these (pretty
consistently).

   task_r_5m9ybr 0.15 reduce > copy > java.lang.NullPointerException at

sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:991)
at
   java.lang.Float.parseFloat(Float.java:394) at
   org.apache.nutch.parse.ParseOutputFormat
$1.write(ParseOutputFormat.java:84) at
   org.apache.nutch.fetcher.FetcherOutputFormat
$1.write(FetcherOutputFormat.java:80) at
   org.apache.nutch.mapred.ReduceTask$2.collect(ReduceTask.java:247) at

org.apache.nutch.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:41) at
   org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) at
   org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:604)
   task_r_8d8tt5 0.0 java.lang.NullPointerException at

sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:991)
at
   java.lang.Float.parseFloat(Float.java:394) at
-- 
Rod Taylor <[EMAIL PROTECTED]>



[jira] Created: (NUTCH-161) Plain text parser should use parser.character.encoding.default property for fall back encoding

2006-01-02 Thread KuroSaka TeruHiko (JIRA)
Plain text parser should use parser.character.encoding.default property for 
fall back encoding
--

 Key: NUTCH-161
 URL: http://issues.apache.org/jira/browse/NUTCH-161
 Project: Nutch
Type: Bug
  Components: indexer  
 Environment: any
Reporter: KuroSaka TeruHiko
Priority: Minor


The value of the property parser.character.encoding.default is used as a 
fallback character encoding (charset) when HTML parser cannot find the charset 
information in HTTP Content-Type header or in META HTTP-EQUIV tag.  But the 
plain text parser behaves differently.  It just uses the system encoding (Java 
VM file.encodings, which in turn derives from the OS and the locale of the 
environment from which the JVM was spawned).  This is not pretty.  To gurantee 
a consistent behavior, plain text parser should use the value of the same 
property.

Though not tested, these changes in 
./src/plugin/parse-text/src/java/org/apache/nutch/parse/text/TextParser.java 
should do it:
Insert this statement in the class definition:
  private static String defaultCharEncoding =
NutchConf.get().get("parser.character.encoding.default", "windows-1252");

Replace this:
  text = new String(content.getContent());// use default encoding
with this:
  text = new String(content.getContent(), defaultCharEncoding );// use 
default encoding


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: IndexSorter optimizer

2006-01-02 Thread Doug Cutting

Andrzej Bialecki wrote:
Sounds like tf/idf might be de-emphasized in scoring.  Perhaps 
NutchSimilarity.tf() should use log() instead of sqrt() when 
field==content?


I don't think it's that simple, the OPIC score is what determined this 
behaviour, and it doesn't correspond at all to tf/idf, but to a human 
judgement.


If we think that high-OPIC is more valuable than high-content-tf, then 
we should use different functions to damp these.  Currently both are 
damped with sqrt().


I've updated the version of Lucene included with Nutch to have the 
required patch.  Would you like me to commit IndexSorter.java or would 
you?


Please do it. There are two typos in your version of IndexSorter, you 
used numDocs() in two places instead of maxDoc(), which for indexes with 
deleted docs (after dedup) leads to exceptions.


I have committed this, along with the LuceneQueryOptimizer changes.

I could only find one place where I was using numDocs() instead of maxDoc().

Cheers,

Doug


Re: IndexSorter optimizer

2006-01-02 Thread Andrzej Bialecki

Doug Cutting wrote:


Andrzej Bialecki wrote:

Using the original index, it was possible for pages with high tf/idf 
of a term, but with a low "boost" value (the OPIC score), to outrank 
pages with high "boost" but lower tf/idf of a term. This phenomenon 
leads quite often to results that are perceived as "junk", e.g. pages 
with a lot of repeated terms, but with little other real content, 
like for example navigation bars.



Sounds like tf/idf might be de-emphasized in scoring.  Perhaps 
NutchSimilarity.tf() should use log() instead of sqrt() when 
field==content?



I don't think it's that simple, the OPIC score is what determined this 
behaviour, and it doesn't correspond at all to tf/idf, but to a human 
judgement.




To conclude, I will add the IndexSorter.java to the core classes, and 
I suggest to continue the experiments ...



I've updated the version of Lucene included with Nutch to have the 
required patch.  Would you like me to commit IndexSorter.java or would 
you?



Please do it. There are two typos in your version of IndexSorter, you 
used numDocs() in two places instead of maxDoc(), which for indexes with 
deleted docs (after dedup) leads to exceptions.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: IndexSorter optimizer

2006-01-02 Thread Doug Cutting

Andrzej Bialecki wrote:
Using the original index, it was possible for pages with high tf/idf of 
a term, but with a low "boost" value (the OPIC score), to outrank pages 
with high "boost" but lower tf/idf of a term. This phenomenon leads 
quite often to results that are perceived as "junk", e.g. pages with a 
lot of repeated terms, but with little other real content, like for 
example navigation bars.


Sounds like tf/idf might be de-emphasized in scoring.  Perhaps 
NutchSimilarity.tf() should use log() instead of sqrt() when field==content?


To conclude, I will add the IndexSorter.java to the core classes, and I 
suggest to continue the experiments ...


I've updated the version of Lucene included with Nutch to have the 
required patch.  Would you like me to commit IndexSorter.java or would you?


Doug


[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread KuroSaka TeruHiko (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361552 ] 

KuroSaka TeruHiko commented on NUTCH-138:
-

Sorry, my oversight, useBodyEncodingForURI did not work as I expected.  Setting 
URIEncoding is the only way.  I'll write this in Wiki.


> non-Latin-1 characters cannot be submitted for search
> -
>
>  Key: NUTCH-138
>  URL: http://issues.apache.org/jira/browse/NUTCH-138
>  Project: Nutch
> Type: Bug
>   Components: web gui
> Versions: 0.7.1
>  Environment: Windows XP, Tomcat 5.5.12
> Reporter: KuroSaka TeruHiko
> Priority: Minor

>
> The search.html currently specifies GET method for query submission.
> Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over 
> GET because of some restrictions of HTML or HTTP spec they discovered. (If my 
> memory is correct, non ISO-8859-1 characters were woking OK over GET with 
> older versions of Tomcat as far as setCharacterEncoding() is called properly.)
> To allow proper transmission of non-ISO-8859-1, POST method should be used.  
> Here's a proposed patch:
> *** search.html   Tue Dec 13 15:02:15 2005
> --- search-org.html   Tue Dec 13 15:02:07 2005
> ***
> *** 59,65 
>   
>   
>   
> !  
>    
>   help
>   
> --- 59,65 
>   
>   
>   
> !  
>    
>   help
>   
> BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well 
> as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src

2006-01-02 Thread Andrzej Bialecki

Doug Cutting wrote:


[EMAIL PROTECTED] wrote:


Now users can select their own page signature implementation, possibly
with better properties than the old one.

Two implementations are provided:

* MD5Signature: backward-compatible with the old schema.

* TextProfileSignature: an example implementation of a signature, which
  gives the same values for near-duplicate pages. Please see Javadoc for
  more information.



This looks great!  Thanks!

Shouldn't this also be used in DeleteDuplicates.java?



Yes, I missed that. No harm done (yet), because the two existing 
implementations both produce an MD5 digest, just differently. I'll fix it.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361549 ] 

Piotr Kosiorowski commented on NUTCH-138:
-

BTW - just create user for yourself in nutch Wiki and you shoudl be able to add 
a new page with information without problems. Thanks for checking and 
documenting it.

> non-Latin-1 characters cannot be submitted for search
> -
>
>  Key: NUTCH-138
>  URL: http://issues.apache.org/jira/browse/NUTCH-138
>  Project: Nutch
> Type: Bug
>   Components: web gui
> Versions: 0.7.1
>  Environment: Windows XP, Tomcat 5.5.12
> Reporter: KuroSaka TeruHiko
> Priority: Minor

>
> The search.html currently specifies GET method for query submission.
> Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over 
> GET because of some restrictions of HTML or HTTP spec they discovered. (If my 
> memory is correct, non ISO-8859-1 characters were woking OK over GET with 
> older versions of Tomcat as far as setCharacterEncoding() is called properly.)
> To allow proper transmission of non-ISO-8859-1, POST method should be used.  
> Here's a proposed patch:
> *** search.html   Tue Dec 13 15:02:15 2005
> --- search-org.html   Tue Dec 13 15:02:07 2005
> ***
> *** 59,65 
>   
>   
>   
> !  
>    
>   help
>   
> --- 59,65 
>   
>   
>   
> !  
>    
>   help
>   
> BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well 
> as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-138?page=all ]
 
Piotr Kosiorowski closed NUTCH-138:
---

Resolution: Invalid

Setting URIEncoding in tomcat config file fixes the problem.


> non-Latin-1 characters cannot be submitted for search
> -
>
>  Key: NUTCH-138
>  URL: http://issues.apache.org/jira/browse/NUTCH-138
>  Project: Nutch
> Type: Bug
>   Components: web gui
> Versions: 0.7.1
>  Environment: Windows XP, Tomcat 5.5.12
> Reporter: KuroSaka TeruHiko
> Priority: Minor

>
> The search.html currently specifies GET method for query submission.
> Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over 
> GET because of some restrictions of HTML or HTTP spec they discovered. (If my 
> memory is correct, non ISO-8859-1 characters were woking OK over GET with 
> older versions of Tomcat as far as setCharacterEncoding() is called properly.)
> To allow proper transmission of non-ISO-8859-1, POST method should be used.  
> Here's a proposed patch:
> *** search.html   Tue Dec 13 15:02:15 2005
> --- search-org.html   Tue Dec 13 15:02:07 2005
> ***
> *** 59,65 
>   
>   
>   
> !  
>    
>   help
>   
> --- 59,65 
>   
>   
>   
> !  
>    
>   help
>   
> BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well 
> as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread KuroSaka TeruHiko (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361546 ] 

KuroSaka TeruHiko commented on NUTCH-138:
-

You are right.  WIth this Tomcat config, UTF-8 characters can be passed.
Also works is having:   useBodyEncodingForURI="true"
in the  tag within $TOMCAT/conf/service.xml
This is documented in:
http://issues.apache.org/bugzilla/show_bug.cgi?id=29900

What I suggest is to add this note to:
http://lucene.apache.org/nutch/i18n.html
(which currently explains the GUI localization issue only, rather than 
internationalization proper),
or perhaps creating a new page:
http://wiki.apache.org/nutch/GettingNutchRunningUTF8Tomcat5

I am willing to write a draft if someone tell me where to submit.

Feel free to close this bug.


> non-Latin-1 characters cannot be submitted for search
> -
>
>  Key: NUTCH-138
>  URL: http://issues.apache.org/jira/browse/NUTCH-138
>  Project: Nutch
> Type: Bug
>   Components: web gui
> Versions: 0.7.1
>  Environment: Windows XP, Tomcat 5.5.12
> Reporter: KuroSaka TeruHiko
> Priority: Minor

>
> The search.html currently specifies GET method for query submission.
> Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over 
> GET because of some restrictions of HTML or HTTP spec they discovered. (If my 
> memory is correct, non ISO-8859-1 characters were woking OK over GET with 
> older versions of Tomcat as far as setCharacterEncoding() is called properly.)
> To allow proper transmission of non-ISO-8859-1, POST method should be used.  
> Here's a proposed patch:
> *** search.html   Tue Dec 13 15:02:15 2005
> --- search-org.html   Tue Dec 13 15:02:07 2005
> ***
> *** 59,65 
>   
>   
>   
> !  
>    
>   help
>   
> --- 59,65 
>   
>   
>   
> !  
>    
>   help
>   
> BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well 
> as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: IndexSorter optimizer

2006-01-02 Thread Doug Cutting

Andrzej Bialecki wrote:
I'm happy to report that further tests performed on a larger index seem 
to show that the overall impact of the IndexSorter is definitely 
positive: performance improvements are significant, and the overall 
quality of results seems at least comparable, if not actually better.


Great news!

I will submit the Lucene patches ASAP, now that we know they're useful.

Doug


Re: [bug?] PRC called emthod require parameter

2006-01-02 Thread Doug Cutting

Stefan Groschupf wrote:

I also note this line in client.java
public Writable[] call(Writable[] params, InetSocketAddress[] addresses)
throws IOException {
if (params.length == 0) return new Writable[0];

Do I understand it correct that in case the remote method does not  need 
any parameter no remote call is done?


Different parameters are sent to each address.  So params.length should 
equal addresses.length, and if params.length==0 then addresses.length==0 
and there's no call to be made.  Make sense?  It might be clearer if the 
test were changed to addresses.length==0.


Doug


Re: Bug in DeleteDuplicates.java ?

2006-01-02 Thread Doug Cutting

Andrzej Bialecki wrote:

Gal Nitzan wrote:


this function throws IOException. Why?

public long getPos() throws IOException {
   return (doc*INDEX_LENGTH)/maxDoc;
 }

It should be throwing ArithmeticException
 



The IOException is required by the API of RecordReader.


What happens when maxDoc is zero?
 



Ka-boom! ;-) You're right, this should be wrapped in an IOException and 
rethrown.


No, it should really just be fixed to not cause an ArithmeticException. 
 This is called to report progress.  In this case the input "file" for 
the map is a Lucene index whose documents we iterate through.  To 
simplify the construction of input splits (without opening each index) a 
constant "length" is used for each "file".  So we have to scale the 
document numbers to give progress in this range.


The problem is that progress may be reported even when there are no 
documents in the index.  So the call is valid and no exception should be 
thrown.


Doug


Re: java.io.IOException: Job failed

2006-01-02 Thread Doug Cutting

Gal Nitzan wrote:

I am using trunk. while trying to crawl I get the following:


[ ...]


050825 100222 task_m_ns3ehv  Error running child
050825 100222 task_m_ns3ehv java.lang.ArithmeticException: / by zero
050825 100222 task_m_ns3ehv at
org.apache.nutch.indexer.DeleteDuplicates
$1.getPos(DeleteDuplicates.java:193)


I just fixed this.

Doug


Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src

2006-01-02 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

Now users can select their own page signature implementation, possibly
with better properties than the old one.

Two implementations are provided:

* MD5Signature: backward-compatible with the old schema.

* TextProfileSignature: an example implementation of a signature, which
  gives the same values for near-duplicate pages. Please see Javadoc for
  more information.


This looks great!  Thanks!

Shouldn't this also be used in DeleteDuplicates.java?

Doug


[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl

2006-01-02 Thread byron miller (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361545 ] 

byron miller commented on NUTCH-159:


While it's from the mapred trunk, it is a non ndfs/local instance only.  
Mapred.temp.dir was left at it's defaults.. (which didn't exist)



  mapred.temp.dir
  /tmp/nutch/mapred/temp
  A shared directory for temporary files.
  


I'm going to modify this and re-run my fetch and let you know how that works.  


> Specify temp/working directory for crawl
> 
>
>  Key: NUTCH-159
>  URL: http://issues.apache.org/jira/browse/NUTCH-159
>  Project: Nutch
> Type: Bug
>   Components: fetcher, indexer
> Versions: 0.8-dev
>  Environment: Linux/Debian
> Reporter: byron miller

>
> I ran a crawl of 100k web pages and got:
> org.apache.nutch.fs.FSError: java.io.IOException: No space left on device
> at 
> org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149)
> at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65)
> at 
> org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178)
> at 
> org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224)
> at 
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
> Caused by: java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:260)
> at 
> org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147)
> ... 4 more
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:107)
> [EMAIL PROTECTED]:/data/nutch$ df -k
> It appears crawl created a /tmp/nutch directory that filled up even though i 
> specified a db directory.
> Need to add a parameter to the command line or make a globaly configurable 
> /tmp (work area) for the nutch instance so that crawls won't fail.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl

2006-01-02 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361541 ] 

Doug Cutting commented on NUTCH-159:


mapred.local.dir is the thing to set.  if that fails, then there is a bug.  
what did you have this set to?

> Specify temp/working directory for crawl
> 
>
>  Key: NUTCH-159
>  URL: http://issues.apache.org/jira/browse/NUTCH-159
>  Project: Nutch
> Type: Bug
>   Components: fetcher, indexer
> Versions: 0.8-dev
>  Environment: Linux/Debian
> Reporter: byron miller

>
> I ran a crawl of 100k web pages and got:
> org.apache.nutch.fs.FSError: java.io.IOException: No space left on device
> at 
> org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149)
> at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65)
> at 
> org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178)
> at 
> org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224)
> at 
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
> Caused by: java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:260)
> at 
> org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147)
> ... 4 more
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:107)
> [EMAIL PROTECTED]:/data/nutch$ df -k
> It appears crawl created a /tmp/nutch directory that filled up even though i 
> specified a db directory.
> Need to add a parameter to the command line or make a globaly configurable 
> /tmp (work area) for the nutch instance so that crawls won't fail.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Trunk is broken

2006-01-02 Thread Thomas Jaeger
Hi Andrzej,

Gal Nitzan wrote:

>> It seems that Trunk is now broken...
>>


DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.


TJ



Re: Trunk is broken

2006-01-02 Thread Thomas Jaeger
Hi Andrzej,

Gal Nitzan wrote:
> It seems that Trunk is now broken...
> 

DmozParser seems to be broken, too. It's package declaration is still
org.apache.nutch.crawl instead of org.apache.nutch.tools.


TJ


Re: Mega-cleanup in trunk/

2006-01-02 Thread Andrzej Bialecki

Piotr Kosiorowski wrote:


Andrzej Bialecki wrote:


Hi,

I just commited a large patch to cleanup the trunk/ of obsolete and 
broken classes remaining from the 0.7.x development line. Please test 
that things still work as they should ...



Hi,
I am not sure what is wrong but a lot of JUnit test simply does not 
compile - I did svn checkout to new directory to be sure I do not 
anything left from my experiments.



Yes, you are right - I would welcome any help, I'm a bit tight on time...



I am looking at it right now but - I would suggest to temporarily do a 
quick cleanup to make trunk testable:




Agreed.



3) Remove unused import in:
src/test/org/apache/nutch/parse/TestParseText.java



Ok.


4) Fix (as it looks simple to fix it - I will look at it in meantime):

src/plugin/parse-msword/src/test/org/apache/nutch/parse/msword/TestMSWordParser.java 

src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip/TestZipParser.java 

src/plugin/parse-rss/src/test/org/apache/nutch/parse/rss/TestRSSParser.java 

src/plugin/parse-pdf/src/test/org/apache/nutch/parse/pdf/TestPdfParser.java 

src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java 

src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/TestMSPowerPointParser.java 

src/plugin/parse-mspowerpoint/src/test/org/apache/nutch/parse/mspowerpoint/AllTests.java 




Yes, they are just one-line fixes. I removed the 
getProtocolContent(urlString) methods, you need to replace them with 
getProtocolContent(new UTF8(urlString), new CrawlDatum()).




After removal of all these not compiling classes tests of trunk 
complete sucessfully on my machine (JDK 1.4.2).


If no objections - especially from Andrzej would be raised I can do 
the cleanup tommorow.



Your help would be most welcome, no objections here.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Commented: (NUTCH-138) non-Latin-1 characters cannot be submitted for search

2006-01-02 Thread Piotr Kosiorowski (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-138?page=comments#action_12361520 ] 

Piotr Kosiorowski commented on NUTCH-138:
-

I am not sure but I would suspect it is a problem of bad tomcat configuration. 
To handle special characters in query urls one have to change default tomcat 
configuration - especially URIEncoding attribute to UTF8. See:

http://tomcat.apache.org/faq/connectors.html#utf8

Please check if it helps in your particular case so we can close the issue.


> non-Latin-1 characters cannot be submitted for search
> -
>
>  Key: NUTCH-138
>  URL: http://issues.apache.org/jira/browse/NUTCH-138
>  Project: Nutch
> Type: Bug
>   Components: web gui
> Versions: 0.7.1
>  Environment: Windows XP, Tomcat 5.5.12
> Reporter: KuroSaka TeruHiko
> Priority: Minor

>
> The search.html currently specifies GET method for query submission.
> Tomcat 5.x only allows ISO-8859-1 (aka Latin-1) code set to be submitted over 
> GET because of some restrictions of HTML or HTTP spec they discovered. (If my 
> memory is correct, non ISO-8859-1 characters were woking OK over GET with 
> older versions of Tomcat as far as setCharacterEncoding() is called properly.)
> To allow proper transmission of non-ISO-8859-1, POST method should be used.  
> Here's a proposed patch:
> *** search.html   Tue Dec 13 15:02:15 2005
> --- search-org.html   Tue Dec 13 15:02:07 2005
> ***
> *** 59,65 
>   
>   
>   
> !  
>    
>   help
>   
> --- 59,65 
>   
>   
>   
> !  
>    
>   help
>   
> BTW, I am aware that Nutch and Lucene won't hanlde non Western languages well 
> as packaged.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira