Saisat: I am sorry to say that there is no reliable way to accomplish your goals accurately and quickly. IMHO you have to "live" with the data for awhile. Construct cross-tabulation tables (e.g. contingency tables), etc. Think about the meaning in your context of deviations from unifornmity. consider the possibility that he data might not form distinct clusters but that a subset two or more variables may have non-random relatioships among as subset of samples. Heirarchical clustering, decision trees, etc. are possibilities but IMHO are not guarenteed to be a royal road to success. In my dumb experience, it will take maybe a month's hard work to winkle out the patterns that you seek and accept them with any confidence. Remember that if you throw data at a computer program, you will almost always get something back regardless of whether your data has non-random relevant structure or not. The fact that you have 5 million samples is never going to be a problem unless you try some sort of nearest neighbor approach to clustering. Also remember that there exists the "Cluster Validity Problem"--there is no objective way to know the true number of clusters nor will there ever be.
Good Luck. saisat wrote: >All: > >I have a customer database closed to 5 million customers >Each customer has different category variables (like Customer Type, >Country of Origin etc) and different range variables (like Daily >Transaction amount, Daily Transaction Count etc). I need to segement >these customers into different groups or clusters where in the group >members in a group share common characteristics > >For example if i have the Data set > >Id Ctry CustomerType DailyTransactionAmt >1 IQ CType 1 2000 >2 IQ CType 1 3000 >3 IQ CType 1 4000 >4 IQ CType 1 3000 >5 IQ CType 1 10000 >6 IQ CType 1 11000 >7 IQ CType 1 12000 >8 IQ CType 1 11000 >9 IN CType 1 10000 >10 IN CType 1 15000 >11 IN CType 1 55000 >12 IN CType 1 60000 >13 IN CType 1 70000 >14 IQ CType 2 85000 >15 IQ CType 2 75000 >16 IQ CType 2 90000 >17 IQ CType 2 10000 >18 IQ CType 2 3500 >19 IQ CType 2 3000 >20 IQ CType 2 4000 >21 IQ CType 2 4000 >22 IN CType 2 1100 >23 IN CType 2 1000 > > >I need an output like > > >CType1 --- IQ -- (2000 <= amt<= 4000) [Members: 1,2,3,4] >CType1 ---- IQ -- (10000 <= amt <=12000) [Members: 5,6,7,8] >CType1 ---- IN -- (10000 <= amt <=15000) [Members: 9,10] >CType1 ---- IN -- (55000 <= amt <=70000) [Members: 11,12,13] >CType2 ---- IQ -- (75000 <= amt <=100000) [Members: 14,15,16,17] >CType2 ---- IQ -- (3000 <= amt <=40000) [Members: 18,19,20,21] >CType2 ---- IN -- (1000 <= amt <=1100) [Members: 22,23] > > >Please note that I dont know the number of clusters before hand. >I am new to this area and am reading up on different material and I >would appreciate any suggestions you can provide > >Thanks >Satish > > . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
