[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531290
]
Hudson commented on NUTCH-25:
-
Integrated in Nutch-Nightly #222 (See
[http://lucene.zones.apache.org:8080/hudson/job/Nutch
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530796
]
Hudson commented on NUTCH-25:
-
Integrated in Nutch-Nightly #219 (See
[http://lucene.zones.apache.org:8080/hudson/job/Nutch
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517170
]
Doğacan Güney commented on NUTCH-25:
> At a very quick look, one potential drawback of the private EncodingClue +
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066
]
Doug Cook commented on NUTCH-25:
Cool -- will take a look at the new patch (and will try to make stripGarbage
more rob
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515461
]
Doug Cook commented on NUTCH-25:
> Can you provide a link on icu4j's language detection?
http://www.icu-project.org/a
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515365
]
Doğacan Güney commented on NUTCH-25:
[snip snip]
> Internal to guessEncoding, we could certainly add the clue valu
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515342
]
Doug Cook commented on NUTCH-25:
Doğacan,
Thanks for the quick feedback.
> * EncodingDetector api is way too open. IM
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515230
]
Doğacan Güney commented on NUTCH-25:
Overall I think the idea behind EncodingDetector is very solid. I will take a
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515026
]
Doug Cook commented on NUTCH-25:
OK, I've got more data, and a proposed solution.
I created a test set with a number o
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514438
]
Doug Cook commented on NUTCH-25:
As far as the problem cases, I'm running a test now on my test DB (the ~60K doc
one),
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514433
]
Doğacan Güney commented on NUTCH-25:
Doug, thanks for the (very) detailed feedback! This is incredibly helpful.
>
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514426
]
Doug Cook commented on NUTCH-25:
Not sure where this belongs architecturally and aesthetically -- will think
about tha
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514382
]
Doug Cook commented on NUTCH-25:
Oops, spoke to soon. On running a more extensive test, I saw quite a few
ArrayIndexOu
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514377
]
Doug Cook commented on NUTCH-25:
I should also add that a significant number of the URLs seem to have been fixed
by th
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514375
]
Doug Cook commented on NUTCH-25:
Hi, Doğacan.
My sincere apologies for the slow response, especially given the alacrit
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507593
]
Doğacan Güney commented on NUTCH-25:
Doug, have you been able to look at my patch?
> needs 'character encoding' de
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498041
]
Doug Cook commented on NUTCH-25:
Thanks! I'll take a look at your proposed patch... (that was fast! ask and ye
shall r
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525
]
Ken Krugler commented on NUTCH-25:
--
I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like this.
The
[
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507
]
Doug Cook commented on NUTCH-25:
We might want to think about raising the priority of this. I've seen encoding
problem
[
http://issues.apache.org/jira/browse/NUTCH-25?page=comments#action_12376611 ]
Chris Fellows commented on NUTCH-25:
This was last updated May '05. Has this charset and language detection been
integrated into Nutch yet?
If not, at what point should th
20 matches
Mail list logo