Thanks Sean, that is exactly what I want.
On Mon, Sep 30, 2013 at 3:09 PM, Sean Busbey <bus...@cloudera.com> wrote: > S, > > Check out these presentations from Data Science Maryland back in May[1]. > > 1. working with Tweets in Hive: > > > http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-with-hadoop-20929978 > > 2. then pulling stuff out of Hive to use with Mahout: > > http://files.meetup.com/6195792/Working%20With%20Mahout.pdf > > The Mahout talk didn't have a directly useful outcome (largely because it > tried to work with the tweets as individual text documents), but it does > get through all the mechanics of exactly what you state you want. > > The meetup page also has links to video, if the slides don't give enough > context. > > HTH > > [1]: http://www.meetup.com/Data-Science-MD/events/111081282/ > > > On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B <saurabh.wri...@gmail.com>wrote: > >> Hi Nitin, >> >> No offense taken. Thank you for your response. Part of this is also >> trying to find the right tool for the job. >> >> I am doing queries to determine the cuts of tweets that I want, then >> doing some modest normalization (through a python script) and then I want >> to create sequenceFiles from that. >> >> So far Hive seems to be the most convenient way to do this. But I can >> take a look at PIG too. It looked like the "STORED AS SEQUENCEFILE" gets me >> 99% way there. So I was wondering if there was a way to get those ids in >> there as well. The last piece is always the stumbler :) >> >> Thanks again, >> >> S >> >> >> >> >> On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: >> >>> are you using hive to just convert your text files to sequence files? >>> If thats the case then you may want to look at the purpose why hive was >>> developed. >>> >>> If you want to modify data or process data which does not involve any >>> kind of analytics functions on a routine basis. >>> >>> If you want to do a data manipulation or enrichment and do not want to >>> code a lot of map reduce job, you can take a look at pig scripts. >>> basically what you want to do is generate an UUID for each of your >>> tweet and then feed it to mahout algorithms. >>> >>> Sorry if I understood it wrong or it sounds rude. >>> >> >> > > > -- > Sean >