The spaces are only used to cut words. This he already did
by just looking at a domain at a time.

To go back to his sample:
boysandgirls.com

This you can easily turn into training data like this:
boys<SPLIT>and<SPLIT>girls<SPLIT>.<SPLIT>com

I would try to get some good amount of English text,
perform tokenization on it, and then just assume every
token is written together without a space in between.
Then you should be able to generate training strings like
the one above. The TLD can easily be attached randomly.

I guess that might already work well.

To evaluate it, you should make a file with real domains
and split them manually. The tokenizer has an evaluator
which can calculate for you how accurate it is.

Hope this helps,
Jörn


On 11/16/11 8:02 PM, Aliaksandr Autayeu wrote:
Hi Ryan,

Learnable tokenizer is trained on standard text, where words are separated
by fair amount of spaces. Your data looks different and one way to tackle
it is to tag a fair amount of samples, creating your own corpus, and then
train a model on it. Tagging might take some time, though. Another approach
might be to use a dictionary, like WordNet, and look up potential tokens
there. A fairly simple approach might be starting from empty string, adding
char-by-char to it and looking up in WordNet. If it returns something -
make that string a token and start again from empty string. The suffixes
(.com, .net, etc) are well-known and can be cut. With this approach you'll
encounter difficulties with something like "hotelchain" ->  "hot" is a word
and is present in WordNet. Well, these might not be the only approaches out
there, this is just what came to mind quickly.

Aliaksandr

On Wed, Nov 16, 2011 at 7:44 PM, Ryan L. Sun<[email protected]>  wrote:

Hi all,

I'm facing a problem to split concatenated English text, more
specifically, domain name.
For example:
boysandgirls.com ->  boy(s)|and|girl(s)|.com
haveaniceday.net ->  have|a|nice|day|.net

Can I use opennlp to do this? I checked the opennlp documentation and
looks like "Learnable Tokenizer" is promising, but i couldn't get it
to work.
Any help is appreciated.


Reply via email to