Suggestions Needed : Developing application using Mahout

Paritosh Ranjan Mon, 23 Jan 2012 23:35:14 -0800

Hi,

I need some suggestions regarding the possibility of developing an application 
using Mahout.


The application is regarding person names. We have all the information about 
which name part is of what type, and how often the name part is used as a 
particular type ( known as frequency )

i.e.

Dr - Title preceding (frequency = 1100)
Dr - FamilyName (frequency = 200)
Señor - Salutation ( frequency = 500 )
Paritosh -  Given Name ( frequency = 900 )
Ranjan - Family Name ( frequency = 800 )
Ranjan - Given Name ( frequency = 200 )

As you can see, same names can be found as different types. But, the relevance 
( frequency ) of finding it in each type is different.

Other background information that we have is name type patterns which are 
commonly found.

i.e.

Paritosh Ranjan can be interpreted as :
a) Paritosh [GivenName], and Ranjan [FamilyName]
b) Paritosh [GivenName], and Ranjan [GivenName ]

But we know that [GivenName,FamilyName] is more common than 
[GivenName,GivenName].

Similarly there are many other patterns involving other types like Salution, 
TitlePreceeding, TitleSucceeding, MiddleName etc.
The patterns also involve regex i.e. [GivenName+][FamilyName]. i.e. One or more 
[GivenName] followed by a [FamilyName].

These patterns have a priority, some patterns are more popular and some are 
less popular.

The user enters a name eg. Mr. Paritosh Ranjan.
And the output is :

Mr.[Salutation], Paritosh[GivenName],Ranjan[FamilyName]
Mr.[TitlePreceeding], Paritosh[GivenName],Ranjan[FamilyName]
Mr.[Salutation], Paritosh[GivenName],Ranjan[GivenName]

These patterns in combination with frequency form a combined score of the name 
found. And the results are sorted in that order.

The total number of names and type information stored is around 50 million.

Question:

Can Mahout help in building such an application? The expectation from the 
application is to be fast and scalable.
If yes, then what all (techniques, algorithms) should be used.

Thanks and Regards,
Paritosh Ranjan

Suggestions Needed : Developing application using Mahout

Reply via email to