>From my experience (merging machine learning with business goals), I'll offer a few pieces of advice that may help guide you.
1. First determine what data you have (and how much of it), and how you want to store/ query it. - If you have 1.5 TB of log data, you are in the realm of Hadoop. If you find however that you only need to operate on a subset of this data (~100mb), you may just want to stick with loading it up in memory and using something like Octave, R, Matlab, Python to run algorithms against it. Probably the easiest. In fact, I'd say do that first before you go whole-hog on the distributed system. 2. Second, come up with questions about your data that you want to answer (or have someone give those questions to you). Make those questions as specific as possible. - The type of question will tell you what tool you need to use. Sometimes this means querying with Hive (ie. How many unique users viewed this type of page?) if the data is too much/too sparse to put into MySQL. Sometimes this means just writing a Python/Ruby script with a few Regex's and hunting through the data. If the questions are predictive in nature, you may need to use some machine learning tools. 3. Simple techniques often will get you 80% of the way to your goal. Machine Learning gets you the other 20% (or sometimes only 5%!). - I would say to use machine learning once you know the domain of the problem you're trying to solve extremely well. Because it will take effort and you should be immediately skeptical of any result you get back. It's a black box that you should really know the inner workings of, so my advice is to exhaust all non-machine learning options first, then go for that extra accuracy if its warranted. Good luck! On Tue, Aug 31, 2010 at 6:03 PM, Sean Owen <[email protected]> wrote: > On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <[email protected]> wrote: > > Per my understanding of hive, we can do some statistical reporting, like > > frequency of user sessions, which geographical region, which device he is > > using the most etc. > > Yes that's about what Hive is good for, if you're looking for some > open-source libraries along those lines. > > > > > But we also want to mine this data to get some predictive capabilities > like > > what is the likelihood that the user will use the same device again or if > we > > get sales/marketing data (on the roadmap for future), we want to possibly > > predict which region to put more marketing/sales efforts. What is the > > pattern for growth of user base, in which geographical regions etc. What > is > > the pattern of user requests failing and a number of requirements like > these > > from the business. > > This is pretty broad but I can try to give you the names of problems > this sounds like, to guide your search. > > Predicting user usage of device sounds like a classification problem, > like developing a probabilistic model of behavior. > > Deciding where to put marketing dollars sounds like a business > problem, not machine learning. I don't think a computer can tell you > that. Some techniques might help you identify trends in sales, but > this is simple regression, not really machine learning. > > Looking for patterns in failure sounds a bit like frequent pattern > mining -- trying to find events that go together unusually often. >
