The average Chinese word has 1.5 ideograms. The multi-character words do not have a simple dictionary, but "clump" in context-sensitive ways. The "Smart Chinese" package in Lucene uses hidden markov models to decide where to break word groups. Is this the kind of multi-word finder you had in mind?
On Wed, Jun 27, 2012 at 1:03 AM, Jörn Kottmann <[email protected]> wrote: > On 06/22/2012 11:13 AM, Nicolas Hernandez wrote: >> >> On Thu, Jun 21, 2012 at 5:55 PM, Jörn Kottmann <[email protected]> wrote: >>> >>> To make this reliable the tokenization on new unseen >>> text must be done correctly. For the spanish data we had >>> a special chunker to put the multi word units into one token. >>> >>> Do you use something like that? >> >> A kind of. >> >> A POS tagger should always be used with the tokenizer which was used >> for the POS tagger trainer. >> The more you manage to automatically recognize multi word units, the >> more you should consider them for your training. It makes easier the >> further syntax analysis. >> (interaction can exist between the two processes (i.e. recognizing >> multi words units can require a pos tagging of simple words) but I won >> t speak about that). > > > At some place we need multi-word-unit detection. If you do that > in our tokenizer than it will affect all components which rely on the > tokenization e.g. also NER. > > > >>> What do you think about outputting a special pos tag to indicate >>> that it is a multi-word tag? >> >> I do not see the point at the pos analysis level. We must try to >> re-use what it exists to preserve the compatibility. >> Here I should mention that [Green:2011:MEI:2145432.2145516] (see >> below) introduced multi-word tags (one for each pos label mw noun, mw >> verb, mw preposition...) at the chunk and syntax levels. This is >> probably a nice idea. > > > That would move the responsibility to detect multi-word-units > to the POS Tagger. > > A simple transformation step could convert the output to the > non-multi-word-tags format. > > This has the advantage that a user can detect a multi-word unit. > > >> >> The cli should offer a way to specify what are multi-word expressions >> in the data. >> This can be done by using a parameter to set what is the token >> separator character. >> >> Models built from the cli or the API should be the same. >> One way to do that is to use a parameter to set what is the multi-word >> separator character and to turn this separator character into >> whitespace before training the model. >> For example with " " as the token separator character, "_" as the >> multi-word separator character and "/" as the pos tag separator, the >> following sentence >> Nico/NNP wants/VBP to/TO get/VB to/TO bed/NN earlier/RB in_order_to/IN >> sleep/VB longer/RB >> should be turn into >> String[] sentence = {"Nico", "wants", "to", "get", "to", "bed", >> "earlier", "in order to", "sleep", "longer"}; >> (note "in order to") >> What do you think about ? >> >> > > I think that could be problematic if your training or test data contains > the multi-word-separator character. In this case you might consider > something > as a multi-word-unit which should not be one. > What do you think about using SGML style tags as we do in the NER training > format? > For example: <MW>in order to</MW>/IN. > > Would you prefer, dealing with multi-word-units at the tokenizer level > or at the POS Tagger level? > Or do we need support for both? > > Jörn > -- Lance Norskog [email protected]
