The average Chinese word has 1.5 ideograms. The multi-character words
do not have a simple dictionary, but "clump" in context-sensitive
ways. The "Smart Chinese" package in Lucene uses hidden markov models
to decide where to break word groups. Is this the kind of multi-word
finder you had in mind?

On Wed, Jun 27, 2012 at 1:03 AM, Jörn Kottmann <[email protected]> wrote:
> On 06/22/2012 11:13 AM, Nicolas Hernandez wrote:
>>
>> On Thu, Jun 21, 2012 at 5:55 PM, Jörn Kottmann <[email protected]> wrote:
>>>
>>> To make this reliable the tokenization on new unseen
>>> text must be done correctly. For the spanish data we had
>>> a special chunker to put the multi word units into one token.
>>>
>>> Do you use something like that?
>>
>> A kind of.
>>
>> A POS tagger should always be used with the tokenizer which was used
>> for the POS tagger trainer.
>> The more you manage to automatically recognize multi word units, the
>> more you should consider them for your training. It makes easier the
>> further syntax analysis.
>> (interaction can exist between the two processes (i.e. recognizing
>> multi words units can require a pos tagging of simple words) but I won
>> t speak about that).
>
>
> At some place we need multi-word-unit detection. If you do that
> in our tokenizer than it will affect all components which rely on the
> tokenization e.g. also NER.
>
>
>
>>> What do you think about outputting a special pos tag to indicate
>>> that it is a multi-word tag?
>>
>> I do not see the point at the pos analysis level. We must try to
>> re-use what it exists to preserve the compatibility.
>> Here I should mention that [Green:2011:MEI:2145432.2145516] (see
>> below) introduced multi-word tags (one for each pos label mw noun, mw
>> verb, mw preposition...) at the chunk and syntax levels. This is
>> probably a nice idea.
>
>
> That would move the responsibility to detect multi-word-units
> to the POS Tagger.
>
> A simple transformation step could convert the output to the
> non-multi-word-tags format.
>
> This has the advantage that a user can detect a multi-word unit.
>
>
>>
>> The cli should offer a way to specify what are multi-word expressions
>> in the data.
>> This can be done by using a parameter to set what is the token
>> separator character.
>>
>> Models built from the cli or the API should be the same.
>> One way to do that is to use a parameter to set what is the multi-word
>> separator character and to turn this separator character into
>> whitespace before training the model.
>> For example with " " as the token separator character, "_" as the
>> multi-word separator character and "/" as the pos tag separator, the
>> following sentence
>> Nico/NNP wants/VBP to/TO get/VB to/TO bed/NN earlier/RB in_order_to/IN
>> sleep/VB longer/RB
>> should be turn into
>> String[] sentence = {"Nico", "wants", "to", "get", "to", "bed",
>> "earlier", "in order to", "sleep", "longer"};
>> (note "in order to")
>> What do you think about ?
>>
>>
>
> I think that could be problematic if your training or test data contains
> the multi-word-separator character. In this case you might consider
> something
> as a multi-word-unit which should not be one.
> What do you think about using SGML style tags as we do in the NER training
> format?
> For example: <MW>in order to</MW>/IN.
>
> Would you prefer, dealing with multi-word-units at the tokenizer level
> or at the POS Tagger level?
> Or do we need support for both?
>
> Jörn
>



-- 
Lance Norskog
[email protected]

Reply via email to