Hi again,
Sorry for the slew of messages. I think I figured out the answer to
my question. It looks like the basic point of entry for all token
parsing, in the default config, is in NutchAnalysis.jj, lines 112-193.
These Backus-Naur Form expressions seem to be backbone for all the
other parsing and analyzing that's going on - please correct me if I'm
wrong.
nutch-0.8.1/src/java/org/apache/nutch/analysis/NutchAnalysis.jj(ln112-ln193:
TOKEN : { // token regular expressions
// basic word -- lowercase it
<WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ | <IRREGULAR_WORD>)>
{ matchedToken.image = matchedToken.image.toLowerCase(); }
// special handling for acronyms: U.S.A., I.B.M., etc: dots are removed
| <ACRONYM: <LETTER> "." (<LETTER> ".")+ >
{ // remove dots
for (int i = 0; i < image.length(); i++) {
if (image.charAt(i) == '.')
image.deleteCharAt(i--);
}
matchedToken.image = image.toString().toLowerCase();
}
// chinese, japanese and korean characters
| <SIGRAM: <CJK> >
// irregular words
| <#IRREGULAR_WORD: (<C_PLUS_PLUS>|<C_SHARP>)>
| <#C_PLUS_PLUS: ("C"|"c") "++" >
| <#C_SHARP: ("C"|"c") "#" >
// query syntax characters
| <PLUS: "-" >
| <MINUS: "+" >
| <QUOTE: "\"" >
| <COLON: ":" >
| <SLASH: "/" >
| <DOT: "." >
| <ATSIGN: "@" >
| <APOSTROPHE: "'" >
| <WHITE: ~[] > // treat unrecognized chars
// as whitespace
// primitive, non-token patterns
| <#WORD_PUNCT: ("_"|"&")> // allowed anywhere in words
| < #LETTER: // alphabets
[
"\u0041"-"\u005a",
"\u0061"-"\u007a",
"\u00c0"-"\u00d6",
"\u00d8"-"\u00f6",
"\u00f8"-"\u00ff",
"\u0100"-"\u1fff"
]
>
| <#CJK: // non-alphabets
[
"\u3040"-"\u318f",
"\u3300"-"\u337f",
"\u3400"-"\u3d2d",
"\u4e00"-"\u9fff",
"\uf900"-"\ufaff"
]
>
| < #DIGIT: // unicode digits
[
"\u0030"-"\u0039",
"\u0660"-"\u0669",
"\u06f0"-"\u06f9",
"\u0966"-"\u096f",
"\u09e6"-"\u09ef",
"\u0a66"-"\u0a6f",
"\u0ae6"-"\u0aef",
"\u0b66"-"\u0b6f",
"\u0be7"-"\u0bef",
"\u0c66"-"\u0c6f",
"\u0ce6"-"\u0cef",
"\u0d66"-"\u0d6f",
"\u0e50"-"\u0e59",
"\u0ed0"-"\u0ed9",
"\u1040"-"\u1049"
]
>
}
On 11/4/06, Josef Novak <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I was wondering if anyone knew of a resource, or could concisely
> explain, how the javacc-generated default nutch analyzer goes about
> tokenizing text. What I'm really looking for is a plain, nuts'n'bolts
> explanation of what gets tokenized, and what doesn't. I searched the
> web for a while but found no good resource. (I'm not looking for the
> JAVA docs)
>
> Unfortunately, NutchAnalysis.jj, and NutchAnalysis.java are somewhat
> opaque to me, and documentation in these files is minimal.
>
>
>
> Files are located here (I'm using v.0.8.1):
> nutch-0.8.1/src/java/org/apache/nutch/analysis/
>
> Any input will be greatly appreciated!
>
> joe
>
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general