I agree with you that there is preparation needed for Mahout processing. I was just trying to save on that effort by re-using the data in hive instead of double processing it.
I may have some more questions when I actually dive into the mining part. (possibly a couple of months down the line). Thanks for your inputs. On Wed, Sep 1, 2010 at 12:58 AM, Sean Owen <[email protected]> wrote: > Hive does something fairly unrelated to Mahout. It's an indexing and > query system. Both might start from the same source data, but to do > different things. There is no common format, no. Mahout generally > operates on text files or "Vectors" in SequenceFiles. So there's some > translation there at least. > > But I think a message here is that there's more preparation and > thought necessary to start data mining. It's not like you point a data > mining tool at some data and answers start flowing automatically. > You'd have to be deliberately extracting and preparing data anyhow. > > On Tue, Aug 31, 2010 at 11:41 PM, hdev ml <[email protected]> wrote: > > Thanks Sean for the answers. Thanks for Ted for validation. > > > > Now my question is, since I want to do both reporting of large data/ > > datawarehouse, let's assume I choose Hive for that. > > > > Now can Mahout integrate with Hive to make use of this data for learning, > > mining etc.? or do I have to export the hive data into text files which > can > > be hosted by Haddop/HDFS which later on Mahout can use for data mining. > > > > In short, can data warehousing part be done by Hive and then can data > mining > > part be done by Mahout on this hive data? > > > > -H > > > > On Tue, Aug 31, 2010 at 3:03 PM, Sean Owen <[email protected]> wrote: > > > >> On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <[email protected]> wrote: > >> > Per my understanding of hive, we can do some statistical reporting, > like > >> > frequency of user sessions, which geographical region, which device he > is > >> > using the most etc. > >> > >> Yes that's about what Hive is good for, if you're looking for some > >> open-source libraries along those lines. > >> > >> > > >> > But we also want to mine this data to get some predictive capabilities > >> like > >> > what is the likelihood that the user will use the same device again or > if > >> we > >> > get sales/marketing data (on the roadmap for future), we want to > possibly > >> > predict which region to put more marketing/sales efforts. What is the > >> > pattern for growth of user base, in which geographical regions etc. > What > >> is > >> > the pattern of user requests failing and a number of requirements like > >> these > >> > from the business. > >> > >> This is pretty broad but I can try to give you the names of problems > >> this sounds like, to guide your search. > >> > >> Predicting user usage of device sounds like a classification problem, > >> like developing a probabilistic model of behavior. > >> > >> Deciding where to put marketing dollars sounds like a business > >> problem, not machine learning. I don't think a computer can tell you > >> that. Some techniques might help you identify trends in sales, but > >> this is simple regression, not really machine learning. > >> > >> Looking for patterns in failure sounds a bit like frequent pattern > >> mining -- trying to find events that go together unusually often. > >> > > >
