Hi again,

  Sorry for the slew of messages.  I think I figured out the answer to
my question.  It looks like the basic point of entry for all token
parsing, in the default config, is in NutchAnalysis.jj, lines 112-193.
 These Backus-Naur Form expressions seem to be backbone for all the
other parsing and analyzing that's going on - please correct me if I'm
wrong.


nutch-0.8.1/src/java/org/apache/nutch/analysis/NutchAnalysis.jj(ln112-ln193:

TOKEN : {       // token regular expressions

  // basic word -- lowercase it
<WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ | <IRREGULAR_WORD>)>
  { matchedToken.image = matchedToken.image.toLowerCase(); }

  // special handling for acronyms: U.S.A., I.B.M., etc: dots are removed
| <ACRONYM: <LETTER> "." (<LETTER> ".")+ >
    {                                             // remove dots
      for (int i = 0; i < image.length(); i++) {
        if (image.charAt(i) == '.')
          image.deleteCharAt(i--);
      }
      matchedToken.image = image.toString().toLowerCase();
    }

  // chinese, japanese and korean characters
| <SIGRAM: <CJK> >

   // irregular words
| <#IRREGULAR_WORD: (<C_PLUS_PLUS>|<C_SHARP>)>
| <#C_PLUS_PLUS: ("C"|"c") "++" >
| <#C_SHARP: ("C"|"c") "#" >

  // query syntax characters
| <PLUS: "-" >
| <MINUS: "+" >
| <QUOTE: "\"" >
| <COLON: ":" >
| <SLASH: "/" >
| <DOT: "." >
| <ATSIGN: "@" >
| <APOSTROPHE: "'" >

| <WHITE: ~[] >                                   // treat unrecognized chars
                                                  // as whitespace
// primitive, non-token patterns

| <#WORD_PUNCT: ("_"|"&")>                        // allowed anywhere in words

| < #LETTER:                                      // alphabets
    [
        "\u0041"-"\u005a",
        "\u0061"-"\u007a",
        "\u00c0"-"\u00d6",
        "\u00d8"-"\u00f6",
        "\u00f8"-"\u00ff",
        "\u0100"-"\u1fff"
    ]
    >

|  <#CJK:                                        // non-alphabets
      [
       "\u3040"-"\u318f",
       "\u3300"-"\u337f",
       "\u3400"-"\u3d2d",
       "\u4e00"-"\u9fff",
       "\uf900"-"\ufaff"
      ]
    >

| < #DIGIT:                                       // unicode digits
      [
       "\u0030"-"\u0039",
       "\u0660"-"\u0669",
       "\u06f0"-"\u06f9",
       "\u0966"-"\u096f",
       "\u09e6"-"\u09ef",
       "\u0a66"-"\u0a6f",
       "\u0ae6"-"\u0aef",
       "\u0b66"-"\u0b6f",
       "\u0be7"-"\u0bef",
       "\u0c66"-"\u0c6f",
       "\u0ce6"-"\u0cef",
       "\u0d66"-"\u0d6f",
       "\u0e50"-"\u0e59",
       "\u0ed0"-"\u0ed9",
       "\u1040"-"\u1049"
      ]
  >

}

On 11/4/06, Josef Novak <[EMAIL PROTECTED]> wrote:
> Hi,
>
>   I was wondering if anyone knew of a resource, or could concisely
> explain, how the javacc-generated default nutch analyzer goes about
> tokenizing text.  What I'm really looking for is a plain, nuts'n'bolts
> explanation of what gets tokenized, and what doesn't.  I searched the
> web for a while but found no good resource.  (I'm not looking for the
> JAVA docs)
>
>   Unfortunately, NutchAnalysis.jj, and NutchAnalysis.java are somewhat
> opaque to me, and documentation in these files is minimal.
>
>
>
>   Files are located here (I'm using v.0.8.1):
>   nutch-0.8.1/src/java/org/apache/nutch/analysis/
>
>   Any input will be greatly appreciated!
>
>           joe
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to