[ https://issues.apache.org/jira/browse/CONNECTORS-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549428#comment-13549428 ]
Karl Wright edited comment on CONNECTORS-601 at 1/10/13 8:29 AM: ----------------------------------------------------------------- Odd, I crawled your site just fine here. Can you check that you have the new code for isStrange()? {code} /** Check if character is not typical ASCII or utf-8. */ protected static boolean isStrange(byte x) { return (x < 32) && (!isWhiteSpace(x)); } {code} This new isStrange() should work well for both utf8 and sjis. If you are getting the same number as before I am suspicious. was (Author: kwri...@metacarta.com): Odd, I crawled your site just fine here. Can you check that you have the new code for isStrange()? {code} /** Check if character is not typical ASCII or utf-8. */ protected static boolean isStrange(byte x) { return (x < 32) && (!isWhiteSpace(x)); } {code} > make the thresholds of isText() input-able > ------------------------------------------ > > Key: CONNECTORS-601 > URL: https://issues.apache.org/jira/browse/CONNECTORS-601 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector > Affects Versions: ManifoldCF 1.0.1 > Reporter: Shinichiro Abe > Assignee: Karl Wright > Priority: Minor > > Currently the thresholds of isText() is 0.30 as default. > This is too strict value for Japanese sites because those sites don't often > have ASCII characters. > As a result some sites is judged as not-text then MCF can't extract links > from those documents. > I'd like to make this value input-able at Repository connection. > There is no patch from me now. > {code:title=WebcrawlerConnector.java|borderStyle=solid} > /** Test to see if a document is text or not. The first n bytes are passed > * in, and this code returns "true" if it thinks they represent text. The > code > * has been lifted algorithmically from > products/Sharecrawler/Fingerprinter.pas, > * which was based on "perldoc -f -T". > */ > protected static boolean isText(byte[] beginChunk, int chunkLength) > { > if (chunkLength == 0) > return true; > int i = 0; > int count = 0; > while (i < chunkLength) > { > byte x = beginChunk[i++]; > if (x == 0) > return false; > if (isStrange(x)) > count++; > } > return ((double)count)/((double)chunkLength) < 0.30; > } > /** Check if character is not typical ASCII. */ > protected static boolean isStrange(byte x) > { > return (x > 127 || x < 32) && (!isWhiteSpace(x)); > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira