RE: implement thai lanaguage analyzer in nutch

2006-11-10 Thread Teruhiko Kurosaka
Oh, Thai words are not space delimited?
OK, in that case, you'd need to study how ThaiAnalyzer works and
then modify the rules in NutchAnalysis.jj (if you are going to use
the web search GUI from Nutch).  This is because the search
expressions are parsed by the parser generated from NutchAnalysis.jj
first before each term is handed to the language specific analyzer,
and currently if a character belongs to the CJK category, each character
is treated as though it were a word.  If ThaiAnalyzer does not do the
same,
you can index the Thai docs but you won't be able to find any doc unless
the search term is one Unicode character.


-kuro

> -Original Message-
> From: sanjeev [mailto:[EMAIL PROTECTED] 
> Sent: 2006-11-08 19:28
> To: nutch-dev@lucene.apache.org
> Subject: Re: implement thai lanaguage analyzer in nutch
> 
> 
> I need a Thai Analyzer for Nutch. I want the crawler to be 
> intelligent enough
> to split thai words correctly since thai don't have spaces 
> between words.
> :-(
> 
> 
> 
> 
> ogjunk-nutch wrote:
> > 
> > Regarding Thai, there is a Thai Analyzer in Lucene already:
> > 
> > $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
> > total 24
> > drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
> > -rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
> > -rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java
> > 
> > Otis
> > 
> > - Original Message 
> > From: Teruhiko Kurosaka <[EMAIL PROTECTED]>
> > To: sanjeev <[EMAIL PROTECTED]>; 
> nutch-dev@lucene.apache.org
> > Sent: Wednesday, November 8, 2006 2:16:38 PM
> > Subject: RE: implement thai lanaguage analyzer in nutch
> > 
> > Sanjay,
> > I don't think you should follow the Chinese example and 
> extend the CJK
> > range. 
> > This was needed because Chinese and Japanese don't use 
> space to separate
> > words.  I believe Thai uses spaces, right? If so, you should extend
> > LETTER
> > range to include Thai character rather than CJK.
> > 
> > Another place you would need to change is the LanguageIdentifier. 
> > You would either train it, or implement some hack,  in 
> order for it to
> > be able to 
> > detect Thai language documents that are not of HTML with lang="th"
> > attribute.
> > 
> > -kuro
> > 
> > 
> > 
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut
> ch-tf2587282.html#a7251826
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 
> 


RE: implement thai lanaguage analyzer in nutch

2006-11-08 Thread Teruhiko Kurosaka
Sanjay,
I don't think you should follow the Chinese example and extend the CJK
range. 
This was needed because Chinese and Japanese don't use space to separate
words.  I believe Thai uses spaces, right? If so, you should extend
LETTER
range to include Thai character rather than CJK.

Another place you would need to change is the LanguageIdentifier. 
You would either train it, or implement some hack,  in order for it to
be able to 
detect Thai language documents that are not of HTML with lang="th"
attribute.

-kuro


RE: What javacc options should I use to compile NutchAnalysis.jj?

2006-10-19 Thread Teruhiko Kurosaka
Please disregard this posting.  It was my oversight.  build.xml does
have a javacc rule.
So this is just a version difference of javacc? 
-kuro

> -Original Message-
> From: Teruhiko Kurosaka 
> Sent: 2006-10-18 17:42
> To: nutch-dev@lucene.apache.org
> Cc: Teruhiko Kurosaka
> Subject: What javacc options should I use to compile NutchAnalysis.jj?
> 
> I am trying to modify the java CC rules in NutchAnalysis.jj.
> As a preparation, I ran javacc (ver 3.2) to "compile" 
> NutchAnalysis.jj of Nutch 0.8 but the generated
> Java files are little bit different than those
> found in the src/java directory.  Am I supposed to use 
> some javacc command line options? 
> 
> BTW, shouldn't build.xml have rules that can build
> the .java files from the .jj file, to be complete?
> 
> 
> Below are the diffs:
> 
> $ diff -bw CharStream.java
> /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
> 19c19
> < public interface CharStream {
> ---
> > interface CharStream {
> 
> $ diff -bw NutchAnalysis.java
> /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
> 911a912
> > try {
> 923a925
> >   } catch(LookaheadSuccess ls) { }
> 
> $ diff -bw NutchAnalysisTokenManager.java
> /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
> 319,320c319
> < public NutchAnalysisTokenManager(CharStream stream)
> < {
> ---
> > public NutchAnalysisTokenManager(CharStream stream){
> 323,324c322
> < public NutchAnalysisTokenManager(CharStream stream, int lexState)
> < {
> ---
> > public NutchAnalysisTokenManager(CharStream stream, int lexState){
> 442,443c440
> < image = new StringBuffer(new
> String(input_stream.GetSuffix(jjimageLen + (lengthOfMatch = 
> jjmatchedPos
> + 1;
> <  else
> ---
> > image = new StringBuffer();
> 449,450c446
> < image = new StringBuffer(new
> String(input_stream.GetSuffix(jjimageLen + (lengthOfMatch = 
> jjmatchedPos
> + 1;
> <  else
> ---
> > image = new StringBuffer();
> 
> $ diff -bw ParseException.java
> /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
> 13c13
> < public class ParseException extends Exception {
> ---
> > class ParseException extends java.io.IOException  {
> 
> $ diff -bw Token.java
> /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
> 8c8
> < public class Token {
> ---
> > class Token {
> 
> 
> $ diff -bw TokenMgrError.java
> /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
> 4c4
> < public class TokenMgrError extends Error
> ---
> > class TokenMgrError extends Error
> 
> 
> 
> -kuro
> 


RE: I modify NutchAnalysis.jj and NutchDocumentTokenizer.java to let nutch support chinese word.

2006-10-19 Thread Teruhiko Kurosaka
 
> From: heack [mailto:[EMAIL PROTECTED] 
> Sent: 2006-9-13 7:03
> To: nutch-dev@lucene.apache.org
> Subject: I modify NutchAnalysis.jj and 
> NutchDocumentTokenizer.java to let nutch support chinese word.
> 
> After that I test it, and I use luke to see the index, The 
> word is parsed in my way,  but I cannot search any results if 
> my keyword is chinese, but not english words.

Heak,
I was having the same experience. I guessed that NutchAnalysis.jj
needs to be modified so that it does not break the CJK words into
individual characters. That is, getting rid of SIGRAM and make the
CJK characters a part of LETTER. Is this what you did, and you
didn't get the result you want?


-kuro


What javacc options should I use to compile NutchAnalysis.jj?

2006-10-18 Thread Teruhiko Kurosaka
I am trying to modify the java CC rules in NutchAnalysis.jj.
As a preparation, I ran javacc (ver 3.2) to "compile" 
NutchAnalysis.jj of Nutch 0.8 but the generated
Java files are little bit different than those
found in the src/java directory.  Am I supposed to use 
some javacc command line options? 

BTW, shouldn't build.xml have rules that can build
the .java files from the .jj file, to be complete?


Below are the diffs:

$ diff -bw CharStream.java
/c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
19c19
< public interface CharStream {
---
> interface CharStream {

$ diff -bw NutchAnalysis.java
/c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
911a912
> try {
923a925
>   } catch(LookaheadSuccess ls) { }

$ diff -bw NutchAnalysisTokenManager.java
/c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
319,320c319
< public NutchAnalysisTokenManager(CharStream stream)
< {
---
> public NutchAnalysisTokenManager(CharStream stream){
323,324c322
< public NutchAnalysisTokenManager(CharStream stream, int lexState)
< {
---
> public NutchAnalysisTokenManager(CharStream stream, int lexState){
442,443c440
< image = new StringBuffer(new
String(input_stream.GetSuffix(jjimageLen + (lengthOfMatch = jjmatchedPos
+ 1;
<  else
---
> image = new StringBuffer();
449,450c446
< image = new StringBuffer(new
String(input_stream.GetSuffix(jjimageLen + (lengthOfMatch = jjmatchedPos
+ 1;
<  else
---
> image = new StringBuffer();

$ diff -bw ParseException.java
/c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
13c13
< public class ParseException extends Exception {
---
> class ParseException extends java.io.IOException  {

$ diff -bw Token.java
/c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
8c8
< public class Token {
---
> class Token {


$ diff -bw TokenMgrError.java
/c/opt/nutch-0.8/src/java/org/apache/nutch/analysis
4c4
< public class TokenMgrError extends Error
---
> class TokenMgrError extends Error



-kuro


Why "nutch plugin" says the plugin is "not present or inactive"?

2006-09-05 Thread Teruhiko Kurosaka
I developed a plugin and tried to run it using "nutch plugin
" of
Nutch 0.8.

But it says my plugin is not present or inactive.

I tried the "nutch plugin" command with a known plugin
"language-identifier" as:

./nutch plugin languageidentifier
org.apache.nutch.analysis.lang.NGramProfile

and got the same result:
Plugin 'language-identifier' not present or inactive.

This log message suggests that the plugin is recognized by the nutch
command:

2006-09-01 17:05:46,772 DEBUG plugin.PluginRepository
(PluginManifestParser.java:parsePluginFolder(93)) - parsing:
C:\opt\nutch-0.8\plugins\language-identifier\plugin.xml

Is the "nutch plugin" command working for any of you?

-kuro


Why are lib- plugins needed?

2006-08-31 Thread Teruhiko Kurosaka
Hello,
I see many plugins named lib- which are wrappers around 
other non-plugin .jar files.

For example, analysis-de plugin uses lib-lucene-analyzers plugin,
which in turn reference to the jar file that contains GermanAnalyzer.

What is the reason for this indirection? The plugins called
by Nutch directly cannot reference non-plugin .jar directly?

-kuro


RE: 0.8 release

2006-07-07 Thread Teruhiko Kurosaka
May I suggest someone take a look at NUTCH-266 before releaseing 0.8?
Nutch build as of half a month ago was not working for me and another
person.

-kuro 

> -Original Message-
> From: Stefan Groschupf [mailto:[EMAIL PROTECTED] 
> Sent: 2006-7-05 11:53
> To: nutch-dev@lucene.apache.org
> Subject: Re: 0.8 release
> 
> +1, but I really would love to see NUTCH-293 as part of nutch .8  
> since this all about being more polite.
> Thanks.
> Stefan
> 
> On 05.07.2006, at 03:46, Doug Cutting wrote:
> 
> > +1
> >
> > Piotr Kosiorowski wrote:
> >> +1.
> >> P.
> >> Andrzej Bialecki wrote:
> >>> Sami Siren wrote:
>  How would folks feel about releasing 0.8 now, there has been  
>  quite a lot of improvements/new features
>  since 0.7 series and I strongly feel that we should push the  
>  first 0.8 series release (alfa/beta)
>  out the door now. It would IMO lower the barrier to 
> first timers  
>  try the 0.8 series and that would
>  give us more feedback about the overall quality.
> >>>
> >>> Definitely +1. Let's do some testing, however, after the upgrade  
> >>> to hadoop 0.3.2 - hadoop had many, many changes, so we just need  
> >>> to make sure it's stable when used with Nutch ...
> >>>
> >>> We should also check JIRA and apply any trivial fixes before the  
> >>> release.
> >>>
> 
>  If there is a consensus about this I can volunteer to be the RM.
> >>>
> >>> That would be great, thanks!
> >>>
> >
> 
> 


RE: [jira] Commented: (NUTCH-266) hadoop bug when doing updatedb

2006-06-22 Thread Teruhiko Kurosaka
Thank you for your reply, Sami.

> >I am not intend to run hadoop at all, so this 
> hadoop-site.xlm is empty.
...
> You should at least set values for 'mapred.system.dir' and
'mapred.local.dir'
> and point them to a dir that has enough space available (I think they 
> default to under /tmp at least on my system wich is far too small for 
> larger jobs)

OK, I just copied the definitions for these properties from
hadoop-default.xml 
and prepended "C:" to each value so that they really refer to C:\tmp. 
C: has 65 GB free space and this practice crawl crawls a directory that
contain 20 documents with total byte count less than 10 MB. So I figure
C: has more than adequate free space.

But I've still got the same error:
2006-06-22 10:54:01,548 WARN  mapred.LocalJobRunner
(LocalJobRunner.java:run(119)) - job_x5jmir
java.io.IOException: Couldn't rename
C:/tmp/hadoop/mapred/local/map_ye7oza/part-0.out
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:102)
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:55)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

After the nutch exited, I checked the directory;
C:/tmp/hadoop/mapred/local/map_ye7oza/
does exist but there was not a file called part-0.out.  The directory
was empty.

I'd appreciate any other suggestions you might have.

-kuro





RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread Teruhiko Kurosaka
How about introducing these changes in an effort to force the nutch
admins
to properly edit the bot identity strings?
1. Add the http.agent.* entries to nutch-site.xml with the value being
"EDITME".
The description should clearly state that these values *must* be
edited
to reflect the true identity of the site.
2. Add a piece of code to the HTTP crawler that checks the
configuration.
If any of the http.agent.* entries are EDITME, the code would log
the error and exit.

-kuro
p.s. I'm subscribing to the digest version of the ML.  If the same or
better idea
has been raised already, please ignore this.




i18n in nutch home page is misnomor

2006-06-01 Thread Teruhiko Kurosaka
Dear Webmaster of
http://lucene.apache.org/nutch/

In the menu bar, under the Documentation heading
there is an item called "i18n".  The web page
linked from "i18n" talks about how to translate 
(localize) the search GUI.  This is not i18n 
(internationalization) which should mean designing
and implementing a program so that works with different
character encodings of the world, and it can be
localizaed.  The localization tasks should not be
confused with internationalization tasks.

I suggest "i18n" be renamed to "l10n", short for
localization.

-kuro


how to turn on logging, excersize analyzer, tips on debugging plugins?

2006-06-01 Thread Teruhiko Kurosaka
Nutch develpers,
I'm writing my a language analyzer and have three questions.
Any pointer will be appreciated.

1. How do I turn on the logging facility?

2. Is there an easy way to run just an analyzer plugin, rather than
running "nutch crawl"?

3. How do I run debugger (eclipse, in may case) over the plugin
code that would be loaded later? I.e. where do I set the break point
before the plugin gets loaded?

Thank you in advance.

-kuro


Do analyzer plugins have acces to the Configuration?

2006-05-30 Thread Teruhiko Kurosaka
Jérôme, or anybody familiar with language plugin architecture,

I am writing a language analyzer plugin. This plugin has configurable
parameters, which I am hoping I can add to nutch-site.xml. But
the German and French plugin examples don't access to the
Configuration object.  Does the current analyzer plugin architecture
allows each plugin implementation to access the Configuration
object? If not, what would it take to allow such access? It would be 
best if it is allowed at the plugin class loading time and insantiation 
time separately.

-kuro


Status of language plugin

2006-05-19 Thread Teruhiko Kurosaka
Hello Jérôme,
Because of other issues at work, I was away from Nutch.
Now I'm back, and I see you are making progresses according 
to your notes in jira.

Is there an API doc or design doc that I can read to 
understand where you are? Is the language plugin architecture
already in the main trunk? 

Here are some issues that I've been worried about:
* Support of multilingual plugin?
** If one plugin can support more than one languages,
   the language needs to be passed at each analyzsis.
** This assumes language identification is done before
   analysis.  Is it the case ?

* Support of a different analyzer for query than index
** Analyzer for query may need to behave differently than
   analyzer for indexinging.  Can your architecture
   specify different analyzers for indexing and query? 

Thanks.

-kuro


RE: Content-Type inconsistency?

2006-04-27 Thread Teruhiko Kurosaka
Jérôme,

>>  Why should Nutch treat it as HTML? 
>
>   Simply because it is a HTML file, with a strange name, of course, but 
> it is a HTML file.
>   My example is a kind of "caricature". But some more real case could be 
> : a HTML file with a text/plain content-type, or with an text/xml 

These cases don't sound "real" to me either.  
In the first case (text/plain), the page would be displayed with all HTML tags 
visible; only very patients readers would try to decipher it.
In the second case (text/xml), the document would most likely be not displayed 
at all because most HTML documents are not well formed as XML.  

The site admins, not Nutch, must fix this incosistency; I don't think Nutch 
needs to be "smarter" than browsers.
It's actually better for Nutch to miss these pages. I don't want to see a hit 
that leads me to a page that cannot be viewed.

-kuro



RE: Content-Type inconsistency?

2006-04-27 Thread Teruhiko Kurosaka
Jérôme,
Thank you for the explanation.

Here is an easy way to reproduce what I mean by content-type 
inconsistency:
1. Perform a crawl of the following URL : 
http://jerome.charron.free.fr/nutch/fake.zip
(fake.zip is a fake zip file, in fact it is a html one)
2. While crawling, you can see that the content-type returned by the 
server is application/zip 
3. But you can see that Nutch correctly guess the content-type to 
text/html (it uses the HtmlParser)
4. At this step, all is ok.
5. Then start your tomcat and try the following search : zip
6. You can see the fake.zip file in results. Click on details ; if the 
index-more plugin was activated then you can see that the stored content-type 
is application/zip and not text/html

What I suggest is simply to use the content-type used by nutch to find 
which parser to use instead of the one returned by the server. 

I'm not sure if that is the right thing.
If the site administrator did a poort job and a wrong media type is advertized, 
it's the site 
problem and Nutch shouldn't be fixing it, in my opinion.  Those sites would
not work properly with the browsers any way, and Nutch doesn't need to work 
properly
except that it should protect itself from crashing.  I tried to visit your 
fake.zip page with 
IE and Firefox, and both faithfully trusted the media type as advertised by the 
server, and 
asked me if I want to open it with WinZip or save it; there was no option to 
open it as an HTML.  
Why should Nutch treat it as HTML? Sorry but I don't see a practical value 
here. 

-kuro


RE: Content-Type inconsistency?

2006-04-17 Thread Teruhiko Kurosaka
Jérôme,
Are you mainly concerned with charset in Content-Type?
Currently, what happens when Content-Type exists in both HTTP layer and in META 
tag (if contents is HTML)?
How does Nutch guesses Content-Type, and when does it need to do that?
Is there a situation where the guessed content-type differs from the 
content-type in the metadata?
If so, what class uses which? 
-kuro

> -Original Message-
> From: Jérôme Charron [mailto:[EMAIL PROTECTED] 
> Sent: 2006-4-13 12:57
> To: nutch-dev@lucene.apache.org
> Subject: Re: Content-Type inconsistency?
> 
> I would like to come back on this issue:
> The Content object holds two content-types:
> 1. The raw content-type from the protocol layer (http header 
> in case of
> http) in the Content's metadata
> 2. The guessed content-type in a private field content-type.
> 
> When a ParseData object is created, it takes only the 
> Content's metadata.
> So, the ParseData can only access the raw content type and not the one
> guessed.
> 
> What I suggest is :
> 1. add a content-type parameter in the ParseData constructors (so that
> Parsers  can pass the guessed content-type to ParseData).
> 2. The Content object stores the guessed content-type in it's 
> metadata in a
> special attribute named for instance GUESSED_CONTENT_TYPE, so that the
> ParseData can access it
> 
> I think 1. is really cleanest way to implement this, but 
> there is a lot of
> code impacted => all the parsers.
> Solution 2. have no impact on APIs, so the code changes are 
> very small.
> 
> Suggestions? Comments?
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
> 


RE: Authentication / Content-type

2006-02-10 Thread Teruhiko Kurosaka
Sorry for a late response.
 
Do you mean there are two kinds of headers, one with lowercase "t"
and the other with the uppercase "T"?  If you mean that,
there are more possiblity such as "CONTENT-TYPE", "content-type",
or even "cONtenT-tYPe" because HTTP spec says the header field names
are case-insensitive.

4.2 Message Headers

HTTP header fields,...  follow the same generic format as that given in
Section 3.1 
of RFC 822 [9]. ...Field names are case-insensitive. 


So the right way seems to change the getHeader method implementation
to compare names in a case-insensitive manner.

Sorry if I missed your point.

-Kuro

> -Original Message-
> From: Thushara Wijeratna [mailto:[EMAIL PROTECTED] 
> Sent: 2006-1-19 14:08
> To: nutch-dev@lucene.apache.org
> Subject: Authentication / Content-type 
> 
> Hi,
> 
> I used nutch-0.7.1 to index an intranet. It is a really great tool,
> thanks for developing it! I had to hack something quick for
> Authentication (somehow couldn't get the crawler to accept the
> http.auth.basic.user etc). I also found an issue where parsing an html
> page returned an error "Content type is xml not html". Turns out that
> sometimes the string "Content-Type" is used instead of "Content-type".
> So I hacked HttpResponse.java - toContent method like this:
> 
>  
> 
> String contentType = getHeader("Content-type");
> 
> If (contentType == null) {
> 
> contentType = getHeader("Content-Type");
> 
> }
> 
> Just thought I'll share with you all.