[ 
https://issues.apache.org/jira/browse/CONNECTORS-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549428#comment-13549428
 ] 

Karl Wright edited comment on CONNECTORS-601 at 1/10/13 8:29 AM:
-----------------------------------------------------------------

Odd, I crawled your site just fine here.

Can you check that you have the new code for isStrange()?

{code}
  /** Check if character is not typical ASCII or utf-8. */
  protected static boolean isStrange(byte x)
  {
    return (x < 32) && (!isWhiteSpace(x));
  }
{code}

This new isStrange() should work well for both utf8 and sjis.  If you are 
getting the same number as before I am suspicious.


                
      was (Author: kwri...@metacarta.com):
    Odd, I crawled your site just fine here.

Can you check that you have the new code for isStrange()?

{code}
  /** Check if character is not typical ASCII or utf-8. */
  protected static boolean isStrange(byte x)
  {
    return (x < 32) && (!isWhiteSpace(x));
  }
{code}


                  
> make the thresholds of isText() input-able
> ------------------------------------------
>
>                 Key: CONNECTORS-601
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-601
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>    Affects Versions: ManifoldCF 1.0.1
>            Reporter: Shinichiro Abe
>            Assignee: Karl Wright
>            Priority: Minor
>
> Currently the thresholds of isText() is 0.30 as default.
> This is too strict value for Japanese sites because those sites don't often 
> have ASCII characters.
> As a result some sites is judged as not-text then MCF can't extract links 
> from those documents.
> I'd like to make this value input-able at Repository connection. 
> There is no patch from me now.
> {code:title=WebcrawlerConnector.java|borderStyle=solid}
>   /** Test to see if a document is text or not.  The first n bytes are passed
>   * in, and this code returns "true" if it thinks they represent text.  The 
> code
>   * has been lifted algorithmically from 
> products/Sharecrawler/Fingerprinter.pas,
>   * which was based on "perldoc -f -T".
>   */
>   protected static boolean isText(byte[] beginChunk, int chunkLength)
>   {
>     if (chunkLength == 0)
>       return true;
>     int i = 0;
>     int count = 0;
>     while (i < chunkLength)
>     {
>       byte x = beginChunk[i++];
>       if (x == 0)
>         return false;
>       if (isStrange(x))
>         count++;
>     }
>     return ((double)count)/((double)chunkLength) < 0.30;
>   }
>   /** Check if character is not typical ASCII. */
>   protected static boolean isStrange(byte x)
>   {
>     return (x > 127 || x < 32) && (!isWhiteSpace(x));
>   }
> {code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to