Re: Garbage with languageidentifier

lewis john mcgibbney Sun, 17 Jul 2011 09:08:21 -0700

Hi Markus,

I think this is a good shout, and it is not hard to understand the points
you make. Quite clearly, good practice relating to the inclusion of accurate
and useful language information (as well as other types of information) in
HTTP headers is not a reality and it wouldn't be suitable for us to pretend
as if this was not the case.

One thing to note though, I just found out yesterday that language detection
in trunk has been passed to Tika but this is not the case with branch 1.4.
It's not my intention to put words into peoples mouth's, however by the
looks of the conversation in NUTCH-657 I foresee that delegating
language-identification to Tika and making branch-1.4 consistent with trunk
would be the next move? Am I correct here? please say otherwise if this is
not the case.

If this is the plan then is there any requirement for Nutch to have an
independent language detection plugin? If we can address why the decision
was made for trunk to rely upon tika for language detection then we can
justify where we are with the comments you make. To be honest I am seeing a
medium sized grey area here, however this has to do with my inexperience
dealing with the language detection plugin and of the problems you mention.

On Sun, Jul 17, 2011 at 2:04 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> The proposal is to configure the order of detection: meta,header,identifier
> (which is the current order).
>
> > Hi,
> >
> > I've found a lot of garbage produced by the language identifier, most
> > likely caused by it relying on HTTP-header as the first hint for the
> > language.
> >
> > Instead of a nice tight list of ISO-codes i've got an index full of
> garbage
> > making me unable to select a language. The lang field now contains a mess
> > including ISO-codes of various types (nl | ned, nl-NL | nederlands |
> > Nederlands | dutch | Dutch etc etc) and even comma-separated
> combinations.
> > It's impossible to do a simple fq:lang:nl due to this undeterminable set
> of
> > language identifiers. Apart from language identifiers that we as human
> > understand the headers also contains values such as
> {$plugin.meta.language}
> > | Weerribben zuivel | Array or complete sentences and even MIME-types and
> > more nonsens you can laugh about.
> >
> > Why do we rely on HTTP-header at all? Isn't it well-known that only very
> > few developers and content management systems actually care about
> > returning proper information in HTTP headers?  This actually also goes
> for
> > finding out content- type, which is a similar problem in the index.
> >
> > I know work is going on in Tika for improving MIME-type detection i'm not
> > sure if this is true for language identification. We still have to rely
> on
> > the Nutch plugin to do this work, right? If so, i propose to make it
> > configurable so we can choose if we wan't to rely on the current
> behaviour
> > or do N-gram detection straight-away.
> >
> > Comments?
> >
> > Thanks
>

-- 
*Lewis*

Re: Garbage with languageidentifier

Reply via email to