[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-09-29 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531290 ] Hudson commented on NUTCH-25: - Integrated in Nutch-Nightly #222 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-09-27 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530796 ] Hudson commented on NUTCH-25: - Integrated in Nutch-Nightly #219 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-08-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517170 ] Doğacan Güney commented on NUTCH-25: > At a very quick look, one potential drawback of the private EncodingClue +

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-08-01 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066 ] Doug Cook commented on NUTCH-25: Cool -- will take a look at the new patch (and will try to make stripGarbage more rob

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515461 ] Doug Cook commented on NUTCH-25: > Can you provide a link on icu4j's language detection? http://www.icu-project.org/a

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515365 ] Doğacan Güney commented on NUTCH-25: [snip snip] > Internal to guessEncoding, we could certainly add the clue valu

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342 ] Doug Cook commented on NUTCH-25: Doğacan, Thanks for the quick feedback. > * EncodingDetector api is way too open. IM

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-25 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515230 ] Doğacan Güney commented on NUTCH-25: Overall I think the idea behind EncodingDetector is very solid. I will take a

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-24 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026 ] Doug Cook commented on NUTCH-25: OK, I've got more data, and a proposed solution. I created a test set with a number o

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438 ] Doug Cook commented on NUTCH-25: As far as the problem cases, I'm running a test now on my test DB (the ~60K doc one),

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514433 ] Doğacan Güney commented on NUTCH-25: Doug, thanks for the (very) detailed feedback! This is incredibly helpful. >

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426 ] Doug Cook commented on NUTCH-25: Not sure where this belongs architecturally and aesthetically -- will think about tha

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382 ] Doug Cook commented on NUTCH-25: Oops, spoke to soon. On running a more extensive test, I saw quite a few ArrayIndexOu

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514377 ] Doug Cook commented on NUTCH-25: I should also add that a significant number of the URLs seem to have been fixed by th

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-07-20 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375 ] Doug Cook commented on NUTCH-25: Hi, Doğacan. My sincere apologies for the slow response, especially given the alacrit

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-06-23 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507593 ] Doğacan Güney commented on NUTCH-25: Doug, have you been able to look at my patch? > needs 'character encoding' de

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-22 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498041 ] Doug Cook commented on NUTCH-25: Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye shall r

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525 ] Ken Krugler commented on NUTCH-25: -- I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like this. The

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Doug Cook (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507 ] Doug Cook commented on NUTCH-25: We might want to think about raising the priority of this. I've seen encoding problem

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2006-04-26 Thread Chris Fellows (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-25?page=comments#action_12376611 ] Chris Fellows commented on NUTCH-25: This was last updated May '05. Has this charset and language detection been integrated into Nutch yet? If not, at what point should th