Tika has a classifier which I think has been updated to use competitive
techniques.

See https://issues.apache.org/jira/browse/TIKA-369 for details.

On Fri, Jan 14, 2011 at 1:04 AM, Lance Norskog <[email protected]> wrote:

> Here's the use case: deciding the language of a mid-size document like
> a newspaper article or a technical report. The problem has been
> tackled fairly successfully by pulling 2- and 3-letter sequences from
> bodies of text in various languages, and comparing the set of 2- and
> 3-letter sequences from the document.
>
> This would be for text indexing in Lucene, so it should be
> memory-resident. The implementation should have a small dataset. It is
> better if the computation is front-loaded, like video compression: the
> heavy lifting happens in a model preparation phase, and then working
> from the model is fast. A confidence rating for the classification
> would be nice.
>
> Open license (Apache-compatible) code would be great, as are
> non-patented algorithms.
>
> Any suggestions?
>
> --
> Lance Norskog
> [email protected]
>

Reply via email to