Hallo Thanks for the reply!
I'll try to be more specific (which is a bit diffucult atm). The contract example you gave comes quite close. For example: if a contract from this month is not in a 5% range of what is was in the last month, give an alarm. similar things could be done on phone numbers or on numbers of connections. the thing i know no one has an idea of: is it 5% or 25%? these numbers are only clear after at least a year. and i dont assume anyone will tweak the system continousily. on the other hand: from december to january a jump of 25% may be absolutly normal (some companies 'stop' working in december as the whole staff must go on vacation). so beside the change of cost also a timely context should be reflected in the limit warnings. looking at a single contract seems a bad thing. So I could look at all contracts of the same company, try to figure out some average change in costs. If 80% of the contracts have an increase the alarm limit could be higher. But here again: why 80% (its a random number right now)? This would get easier once a full 12 month data set is stored. until then... try and error. but thats hard to sell. (espacially the time required for tweaking is hard to specify). beside the price of the contract there is information like: number of connections, time of the single connections (people work between 8-17h), number of products used (sms, telephony on foreign countries, data options, etc.) right now i dont know how to figure out the numbers to declare behaviour as 'unusual' based on limits that differ from company to company (a plumber may have different limits that a consulting company). of course there is the option: let the user choose the limits. this has two drawbacks: - where should the user know the limits from? and - some user have to look at thousands of contracts. so i would prefer the system to work on its own (as much as possible). i don't know whether there are some 'weighting' algorithms out there that do similar things. so any hint may help a lot. The whole alarming feature will be part of the web-application. so whatever we need to do on the cluster must be done via JDBC (from do-it-yourself to call a stored procedure). i dont know if we are allowed to run java applications inside the DB directly. I know we execute some tools to import data that run on the db hosts but in a different JVM. But actually the system requiremts question i have goes more into: do i need a computing cloud and MapReduce or are algorithms to learn from data independant of such things? Thanks :) kind regards werner Grant Ingersoll schrieb: > > On Sep 28, 2008, at 6:26 PM, werner mueller wrote: > >> Hallo's >> >> finally i find some time to ask boring questions :) >> >> I some sort of stumbled across the mahout project at apachecon08 in >> amsterdam. But i havent found the time for looking into it deeply. >> >> I would like to ask for some hints / links / directions for a >> 'predictions' feature. i read through the mahout wiki and found some >> interesting links. but since i com more from the applications part and i >> am not that much into databases i need some help getting started. >> >> we develop a reporting application for a telcommunication company. >> mainly we store data in an oracle cluster. it consists of a star-schema. >> the application mainly offers to create reports on two data sources: >> costs and traffic. the data amount is about 1-2 terabytes. >> >> the idea came up to implement some 'alarming' features. so customers >> could set up some limits for contracts, phone numbers etc to get >> notified once the limits are reached or the data 'behaves strange' (too >> strong increases for a period, other ideas to come...). > > Can you give an example? It sounds like you simply want the user to say > "if contracts > X, then alarm", but I gather not, since you are asking > here. Or are you looking for the user > to not be involved in setting the thresholds, but instead to learn from > past examples where there was a problem? For instance, you have > failures from before, but you don't particularly know why it failed > (i.e. what features caused the problem). > >> >> >> i would like to ask if there is something of use in mahout or whether >> you would recommend to keep such features 'simple' on a statistical >> basis and not use learning techniques at all? > > Well, simple is usually better, if it solves your problem. > >> >> >> on the other hand the more boring questions: do i need a hadoop cluster >> for your implementations or could i run them on oracle based clusters as >> well? > > I don't know enough about Oracle clusters to render an opinion. If your > asking if Mahout will run inside the Oracle JVM, I'm guessing that would > be a stretch at this point, but I don't have anything to base that on. > > -Grant > > >
