RE: implement thai lanaguage analyzer in nutch
Oh, Thai words are not space delimited? OK, in that case, you'd need to study how ThaiAnalyzer works and then modify the rules in NutchAnalysis.jj (if you are going to use the web search GUI from Nutch). This is because the search expressions are parsed by the parser generated from NutchAnalysis.jj first before each term is handed to the language specific analyzer, and currently if a character belongs to the CJK category, each character is treated as though it were a word. If ThaiAnalyzer does not do the same, you can index the Thai docs but you won't be able to find any doc unless the search term is one Unicode character. -kuro > -Original Message- > From: sanjeev [mailto:[EMAIL PROTECTED] > Sent: 2006-11-08 19:28 > To: nutch-dev@lucene.apache.org > Subject: Re: implement thai lanaguage analyzer in nutch > > > I need a Thai Analyzer for Nutch. I want the crawler to be > intelligent enough > to split thai words correctly since thai don't have spaces > between words. > :-( > > > > > ogjunk-nutch wrote: > > > > Regarding Thai, there is a Thai Analyzer in Lucene already: > > > > $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ > > total 24 > > drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ > > -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java > > -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java > > > > Otis > > > > - Original Message > > From: Teruhiko Kurosaka <[EMAIL PROTECTED]> > > To: sanjeev <[EMAIL PROTECTED]>; > nutch-dev@lucene.apache.org > > Sent: Wednesday, November 8, 2006 2:16:38 PM > > Subject: RE: implement thai lanaguage analyzer in nutch > > > > Sanjay, > > I don't think you should follow the Chinese example and > extend the CJK > > range. > > This was needed because Chinese and Japanese don't use > space to separate > > words. I believe Thai uses spaces, right? If so, you should extend > > LETTER > > range to include Thai character rather than CJK. > > > > Another place you would need to change is the LanguageIdentifier. > > You would either train it, or implement some hack, in > order for it to > > be able to > > detect Thai language documents that are not of HTML with lang="th" > > attribute. > > > > -kuro > > > > > > > > > > > > -- > View this message in context: > http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut > ch-tf2587282.html#a7251826 > Sent from the Nutch - Dev mailing list archive at Nabble.com. > >
RE: implement thai lanaguage analyzer in nutch
Sanjay, I don't think you should follow the Chinese example and extend the CJK range. This was needed because Chinese and Japanese don't use space to separate words. I believe Thai uses spaces, right? If so, you should extend LETTER range to include Thai character rather than CJK. Another place you would need to change is the LanguageIdentifier. You would either train it, or implement some hack, in order for it to be able to detect Thai language documents that are not of HTML with lang="th" attribute. -kuro
RE: What javacc options should I use to compile NutchAnalysis.jj?
Please disregard this posting. It was my oversight. build.xml does have a javacc rule. So this is just a version difference of javacc? -kuro > -Original Message- > From: Teruhiko Kurosaka > Sent: 2006-10-18 17:42 > To: nutch-dev@lucene.apache.org > Cc: Teruhiko Kurosaka > Subject: What javacc options should I use to compile NutchAnalysis.jj? > > I am trying to modify the java CC rules in NutchAnalysis.jj. > As a preparation, I ran javacc (ver 3.2) to "compile" > NutchAnalysis.jj of Nutch 0.8 but the generated > Java files are little bit different than those > found in the src/java directory. Am I supposed to use > some javacc command line options? > > BTW, shouldn't build.xml have rules that can build > the .java files from the .jj file, to be complete? > > > Below are the diffs: > > $ diff -bw CharStream.java > /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis > 19c19 > < public interface CharStream { > --- > > interface CharStream { > > $ diff -bw NutchAnalysis.java > /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis > 911a912 > > try { > 923a925 > > } catch(LookaheadSuccess ls) { } > > $ diff -bw NutchAnalysisTokenManager.java > /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis > 319,320c319 > < public NutchAnalysisTokenManager(CharStream stream) > < { > --- > > public NutchAnalysisTokenManager(CharStream stream){ > 323,324c322 > < public NutchAnalysisTokenManager(CharStream stream, int lexState) > < { > --- > > public NutchAnalysisTokenManager(CharStream stream, int lexState){ > 442,443c440 > < image = new StringBuffer(new > String(input_stream.GetSuffix(jjimageLen + (lengthOfMatch = > jjmatchedPos > + 1; > < else > --- > > image = new StringBuffer(); > 449,450c446 > < image = new StringBuffer(new > String(input_stream.GetSuffix(jjimageLen + (lengthOfMatch = > jjmatchedPos > + 1; > < else > --- > > image = new StringBuffer(); > > $ diff -bw ParseException.java > /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis > 13c13 > < public class ParseException extends Exception { > --- > > class ParseException extends java.io.IOException { > > $ diff -bw Token.java > /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis > 8c8 > < public class Token { > --- > > class Token { > > > $ diff -bw TokenMgrError.java > /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis > 4c4 > < public class TokenMgrError extends Error > --- > > class TokenMgrError extends Error > > > > -kuro >
RE: I modify NutchAnalysis.jj and NutchDocumentTokenizer.java to let nutch support chinese word.
> From: heack [mailto:[EMAIL PROTECTED] > Sent: 2006-9-13 7:03 > To: nutch-dev@lucene.apache.org > Subject: I modify NutchAnalysis.jj and > NutchDocumentTokenizer.java to let nutch support chinese word. > > After that I test it, and I use luke to see the index, The > word is parsed in my way, but I cannot search any results if > my keyword is chinese, but not english words. Heak, I was having the same experience. I guessed that NutchAnalysis.jj needs to be modified so that it does not break the CJK words into individual characters. That is, getting rid of SIGRAM and make the CJK characters a part of LETTER. Is this what you did, and you didn't get the result you want? -kuro
What javacc options should I use to compile NutchAnalysis.jj?
I am trying to modify the java CC rules in NutchAnalysis.jj. As a preparation, I ran javacc (ver 3.2) to "compile" NutchAnalysis.jj of Nutch 0.8 but the generated Java files are little bit different than those found in the src/java directory. Am I supposed to use some javacc command line options? BTW, shouldn't build.xml have rules that can build the .java files from the .jj file, to be complete? Below are the diffs: $ diff -bw CharStream.java /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis 19c19 < public interface CharStream { --- > interface CharStream { $ diff -bw NutchAnalysis.java /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis 911a912 > try { 923a925 > } catch(LookaheadSuccess ls) { } $ diff -bw NutchAnalysisTokenManager.java /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis 319,320c319 < public NutchAnalysisTokenManager(CharStream stream) < { --- > public NutchAnalysisTokenManager(CharStream stream){ 323,324c322 < public NutchAnalysisTokenManager(CharStream stream, int lexState) < { --- > public NutchAnalysisTokenManager(CharStream stream, int lexState){ 442,443c440 < image = new StringBuffer(new String(input_stream.GetSuffix(jjimageLen + (lengthOfMatch = jjmatchedPos + 1; < else --- > image = new StringBuffer(); 449,450c446 < image = new StringBuffer(new String(input_stream.GetSuffix(jjimageLen + (lengthOfMatch = jjmatchedPos + 1; < else --- > image = new StringBuffer(); $ diff -bw ParseException.java /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis 13c13 < public class ParseException extends Exception { --- > class ParseException extends java.io.IOException { $ diff -bw Token.java /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis 8c8 < public class Token { --- > class Token { $ diff -bw TokenMgrError.java /c/opt/nutch-0.8/src/java/org/apache/nutch/analysis 4c4 < public class TokenMgrError extends Error --- > class TokenMgrError extends Error -kuro
Why "nutch plugin" says the plugin is "not present or inactive"?
I developed a plugin and tried to run it using "nutch plugin " of Nutch 0.8. But it says my plugin is not present or inactive. I tried the "nutch plugin" command with a known plugin "language-identifier" as: ./nutch plugin languageidentifier org.apache.nutch.analysis.lang.NGramProfile and got the same result: Plugin 'language-identifier' not present or inactive. This log message suggests that the plugin is recognized by the nutch command: 2006-09-01 17:05:46,772 DEBUG plugin.PluginRepository (PluginManifestParser.java:parsePluginFolder(93)) - parsing: C:\opt\nutch-0.8\plugins\language-identifier\plugin.xml Is the "nutch plugin" command working for any of you? -kuro
Why are lib- plugins needed?
Hello, I see many plugins named lib- which are wrappers around other non-plugin .jar files. For example, analysis-de plugin uses lib-lucene-analyzers plugin, which in turn reference to the jar file that contains GermanAnalyzer. What is the reason for this indirection? The plugins called by Nutch directly cannot reference non-plugin .jar directly? -kuro
RE: 0.8 release
May I suggest someone take a look at NUTCH-266 before releaseing 0.8? Nutch build as of half a month ago was not working for me and another person. -kuro > -Original Message- > From: Stefan Groschupf [mailto:[EMAIL PROTECTED] > Sent: 2006-7-05 11:53 > To: nutch-dev@lucene.apache.org > Subject: Re: 0.8 release > > +1, but I really would love to see NUTCH-293 as part of nutch .8 > since this all about being more polite. > Thanks. > Stefan > > On 05.07.2006, at 03:46, Doug Cutting wrote: > > > +1 > > > > Piotr Kosiorowski wrote: > >> +1. > >> P. > >> Andrzej Bialecki wrote: > >>> Sami Siren wrote: > How would folks feel about releasing 0.8 now, there has been > quite a lot of improvements/new features > since 0.7 series and I strongly feel that we should push the > first 0.8 series release (alfa/beta) > out the door now. It would IMO lower the barrier to > first timers > try the 0.8 series and that would > give us more feedback about the overall quality. > >>> > >>> Definitely +1. Let's do some testing, however, after the upgrade > >>> to hadoop 0.3.2 - hadoop had many, many changes, so we just need > >>> to make sure it's stable when used with Nutch ... > >>> > >>> We should also check JIRA and apply any trivial fixes before the > >>> release. > >>> > > If there is a consensus about this I can volunteer to be the RM. > >>> > >>> That would be great, thanks! > >>> > > > >
RE: [jira] Commented: (NUTCH-266) hadoop bug when doing updatedb
Thank you for your reply, Sami. > >I am not intend to run hadoop at all, so this > hadoop-site.xlm is empty. ... > You should at least set values for 'mapred.system.dir' and 'mapred.local.dir' > and point them to a dir that has enough space available (I think they > default to under /tmp at least on my system wich is far too small for > larger jobs) OK, I just copied the definitions for these properties from hadoop-default.xml and prepended "C:" to each value so that they really refer to C:\tmp. C: has 65 GB free space and this practice crawl crawls a directory that contain 20 documents with total byte count less than 10 MB. So I figure C: has more than adequate free space. But I've still got the same error: 2006-06-22 10:54:01,548 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(119)) - job_x5jmir java.io.IOException: Couldn't rename C:/tmp/hadoop/mapred/local/map_ye7oza/part-0.out at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:102) Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342) at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:55) at org.apache.nutch.crawl.Crawl.main(Crawl.java:114) After the nutch exited, I checked the directory; C:/tmp/hadoop/mapred/local/map_ye7oza/ does exist but there was not a file called part-0.out. The directory was empty. I'd appreciate any other suggestions you might have. -kuro
RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
How about introducing these changes in an effort to force the nutch admins to properly edit the bot identity strings? 1. Add the http.agent.* entries to nutch-site.xml with the value being "EDITME". The description should clearly state that these values *must* be edited to reflect the true identity of the site. 2. Add a piece of code to the HTTP crawler that checks the configuration. If any of the http.agent.* entries are EDITME, the code would log the error and exit. -kuro p.s. I'm subscribing to the digest version of the ML. If the same or better idea has been raised already, please ignore this.
i18n in nutch home page is misnomor
Dear Webmaster of http://lucene.apache.org/nutch/ In the menu bar, under the Documentation heading there is an item called "i18n". The web page linked from "i18n" talks about how to translate (localize) the search GUI. This is not i18n (internationalization) which should mean designing and implementing a program so that works with different character encodings of the world, and it can be localizaed. The localization tasks should not be confused with internationalization tasks. I suggest "i18n" be renamed to "l10n", short for localization. -kuro
how to turn on logging, excersize analyzer, tips on debugging plugins?
Nutch develpers, I'm writing my a language analyzer and have three questions. Any pointer will be appreciated. 1. How do I turn on the logging facility? 2. Is there an easy way to run just an analyzer plugin, rather than running "nutch crawl"? 3. How do I run debugger (eclipse, in may case) over the plugin code that would be loaded later? I.e. where do I set the break point before the plugin gets loaded? Thank you in advance. -kuro
Do analyzer plugins have acces to the Configuration?
Jérôme, or anybody familiar with language plugin architecture, I am writing a language analyzer plugin. This plugin has configurable parameters, which I am hoping I can add to nutch-site.xml. But the German and French plugin examples don't access to the Configuration object. Does the current analyzer plugin architecture allows each plugin implementation to access the Configuration object? If not, what would it take to allow such access? It would be best if it is allowed at the plugin class loading time and insantiation time separately. -kuro
Status of language plugin
Hello Jérôme, Because of other issues at work, I was away from Nutch. Now I'm back, and I see you are making progresses according to your notes in jira. Is there an API doc or design doc that I can read to understand where you are? Is the language plugin architecture already in the main trunk? Here are some issues that I've been worried about: * Support of multilingual plugin? ** If one plugin can support more than one languages, the language needs to be passed at each analyzsis. ** This assumes language identification is done before analysis. Is it the case ? * Support of a different analyzer for query than index ** Analyzer for query may need to behave differently than analyzer for indexinging. Can your architecture specify different analyzers for indexing and query? Thanks. -kuro
RE: Content-Type inconsistency?
Jérôme, >> Why should Nutch treat it as HTML? > > Simply because it is a HTML file, with a strange name, of course, but > it is a HTML file. > My example is a kind of "caricature". But some more real case could be > : a HTML file with a text/plain content-type, or with an text/xml These cases don't sound "real" to me either. In the first case (text/plain), the page would be displayed with all HTML tags visible; only very patients readers would try to decipher it. In the second case (text/xml), the document would most likely be not displayed at all because most HTML documents are not well formed as XML. The site admins, not Nutch, must fix this incosistency; I don't think Nutch needs to be "smarter" than browsers. It's actually better for Nutch to miss these pages. I don't want to see a hit that leads me to a page that cannot be viewed. -kuro
RE: Content-Type inconsistency?
Jérôme, Thank you for the explanation. Here is an easy way to reproduce what I mean by content-type inconsistency: 1. Perform a crawl of the following URL : http://jerome.charron.free.fr/nutch/fake.zip (fake.zip is a fake zip file, in fact it is a html one) 2. While crawling, you can see that the content-type returned by the server is application/zip 3. But you can see that Nutch correctly guess the content-type to text/html (it uses the HtmlParser) 4. At this step, all is ok. 5. Then start your tomcat and try the following search : zip 6. You can see the fake.zip file in results. Click on details ; if the index-more plugin was activated then you can see that the stored content-type is application/zip and not text/html What I suggest is simply to use the content-type used by nutch to find which parser to use instead of the one returned by the server. I'm not sure if that is the right thing. If the site administrator did a poort job and a wrong media type is advertized, it's the site problem and Nutch shouldn't be fixing it, in my opinion. Those sites would not work properly with the browsers any way, and Nutch doesn't need to work properly except that it should protect itself from crashing. I tried to visit your fake.zip page with IE and Firefox, and both faithfully trusted the media type as advertised by the server, and asked me if I want to open it with WinZip or save it; there was no option to open it as an HTML. Why should Nutch treat it as HTML? Sorry but I don't see a practical value here. -kuro
RE: Content-Type inconsistency?
Jérôme, Are you mainly concerned with charset in Content-Type? Currently, what happens when Content-Type exists in both HTTP layer and in META tag (if contents is HTML)? How does Nutch guesses Content-Type, and when does it need to do that? Is there a situation where the guessed content-type differs from the content-type in the metadata? If so, what class uses which? -kuro > -Original Message- > From: Jérôme Charron [mailto:[EMAIL PROTECTED] > Sent: 2006-4-13 12:57 > To: nutch-dev@lucene.apache.org > Subject: Re: Content-Type inconsistency? > > I would like to come back on this issue: > The Content object holds two content-types: > 1. The raw content-type from the protocol layer (http header > in case of > http) in the Content's metadata > 2. The guessed content-type in a private field content-type. > > When a ParseData object is created, it takes only the > Content's metadata. > So, the ParseData can only access the raw content type and not the one > guessed. > > What I suggest is : > 1. add a content-type parameter in the ParseData constructors (so that > Parsers can pass the guessed content-type to ParseData). > 2. The Content object stores the guessed content-type in it's > metadata in a > special attribute named for instance GUESSED_CONTENT_TYPE, so that the > ParseData can access it > > I think 1. is really cleanest way to implement this, but > there is a lot of > code impacted => all the parsers. > Solution 2. have no impact on APIs, so the code changes are > very small. > > Suggestions? Comments? > > Jérôme > > -- > http://motrech.free.fr/ > http://www.frutch.org/ >
RE: Authentication / Content-type
Sorry for a late response. Do you mean there are two kinds of headers, one with lowercase "t" and the other with the uppercase "T"? If you mean that, there are more possiblity such as "CONTENT-TYPE", "content-type", or even "cONtenT-tYPe" because HTTP spec says the header field names are case-insensitive. 4.2 Message Headers HTTP header fields,... follow the same generic format as that given in Section 3.1 of RFC 822 [9]. ...Field names are case-insensitive. So the right way seems to change the getHeader method implementation to compare names in a case-insensitive manner. Sorry if I missed your point. -Kuro > -Original Message- > From: Thushara Wijeratna [mailto:[EMAIL PROTECTED] > Sent: 2006-1-19 14:08 > To: nutch-dev@lucene.apache.org > Subject: Authentication / Content-type > > Hi, > > I used nutch-0.7.1 to index an intranet. It is a really great tool, > thanks for developing it! I had to hack something quick for > Authentication (somehow couldn't get the crawler to accept the > http.auth.basic.user etc). I also found an issue where parsing an html > page returned an error "Content type is xml not html". Turns out that > sometimes the string "Content-Type" is used instead of "Content-type". > So I hacked HttpResponse.java - toContent method like this: > > > > String contentType = getHeader("Content-type"); > > If (contentType == null) { > > contentType = getHeader("Content-Type"); > > } > > Just thought I'll share with you all.