Re: Question about clustering

Robert Ehrlich Thu, 13 Mar 2003 10:18:19 -0800

Saisat:

I am sorry to say that there is no reliable way to accomplish your goals 
accurately and quickly.  IMHO you have to "live" with the data for 
awhile.  Construct cross-tabulation tables (e.g. contingency tables), 
etc. Think about the meaning in your context of  deviations from 
unifornmity.  consider the possibility that he data might not form 
distinct clusters but that a subset two or more variables may have 
non-random relatioships among as subset of samples.  Heirarchical 
clustering, decision trees, etc. are possibilities but IMHO are not 
guarenteed to be a royal road to success.  In my dumb experience, it 
will take maybe a month's hard work to winkle out the patterns that you 
seek and accept them with any confidence.  Remember that if you throw 
data at a computer program, you will almost always get something back 
regardless of  whether your data  has non-random relevant structure or 
not.  The fact that you have 5 million samples is  never going to be a 
problem unless you try some sort of nearest neighbor approach to 
clustering.  Also remember that there exists the "Cluster Validity 
Problem"--there is no objective way to know the true number of clusters 
nor will there ever be.


Good Luck.

saisat wrote:

>All:
>
>I have a customer database closed to 5 million customers
>Each customer has different category variables (like Customer Type,
>Country of Origin etc) and different range variables (like Daily
>Transaction amount, Daily Transaction Count etc). I need to segement
>these customers into different groups or clusters where in the group
>members in a group share common characteristics
>
>For example if i have the Data set
>
>Id     Ctry    CustomerType    DailyTransactionAmt     
>1      IQ      CType 1         2000
>2      IQ      CType 1         3000
>3      IQ      CType 1         4000
>4      IQ      CType 1         3000
>5      IQ      CType 1         10000
>6      IQ      CType 1         11000
>7      IQ      CType 1         12000
>8      IQ      CType 1         11000
>9      IN      CType 1         10000
>10     IN      CType 1         15000
>11     IN      CType 1         55000
>12     IN      CType 1         60000
>13     IN      CType 1         70000
>14     IQ      CType 2         85000
>15     IQ      CType 2         75000
>16     IQ      CType 2         90000
>17     IQ      CType 2         10000
>18     IQ      CType 2         3500
>19     IQ      CType 2         3000
>20     IQ      CType 2         4000
>21     IQ      CType 2         4000
>22     IN      CType 2         1100
>23     IN      CType 2         1000            
>
>
>I need an output like
>
>
>CType1 --- IQ -- (2000 <= amt<= 4000)  [Members: 1,2,3,4]
>CType1 ---- IQ -- (10000 <= amt <=12000)   [Members: 5,6,7,8]
>CType1 ---- IN -- (10000 <= amt <=15000) [Members: 9,10]
>CType1 ---- IN -- (55000 <= amt <=70000) [Members: 11,12,13]
>CType2 ---- IQ -- (75000 <= amt <=100000) [Members: 14,15,16,17]
>CType2 ---- IQ -- (3000 <= amt <=40000) [Members: 18,19,20,21]
>CType2 ---- IN -- (1000 <= amt <=1100) [Members: 22,23]
>
>
>Please note that I dont know the number of clusters before hand. 
>I am new to this area and am reading up on different material and I
>would appreciate any suggestions you can provide
>
>Thanks
>Satish
>  
>

.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: Question about clustering

Reply via email to