Thanks Sean, that is exactly what I want.

On Mon, Sep 30, 2013 at 3:09 PM, Sean Busbey <bus...@cloudera.com> wrote:

> S,
>
> Check out these presentations from Data Science Maryland back in May[1].
>
> 1. working with Tweets in Hive:
>
>
> http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-with-hadoop-20929978
>
> 2. then pulling stuff out of Hive to use with Mahout:
>
> http://files.meetup.com/6195792/Working%20With%20Mahout.pdf
>
> The Mahout talk didn't have a directly useful outcome (largely because it
> tried to work with the tweets as individual text documents), but it does
> get through all the mechanics of exactly what you state you want.
>
> The meetup page also has links to video, if the slides don't give enough
> context.
>
> HTH
>
> [1]: http://www.meetup.com/Data-Science-MD/events/111081282/
>
>
> On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B <saurabh.wri...@gmail.com>wrote:
>
>> Hi Nitin,
>>
>> No offense taken. Thank you for your response. Part of this is also
>> trying to find the right tool for the job.
>>
>> I am doing queries to determine the cuts of tweets that I want, then
>> doing some modest normalization (through a python script) and then I want
>> to create sequenceFiles from that.
>>
>> So far Hive seems to be the most convenient way to do this. But I can
>> take a look at PIG too. It looked like the "STORED AS SEQUENCEFILE" gets me
>> 99% way there. So I was wondering if there was a way to get those ids in
>> there as well. The last piece is always the stumbler :)
>>
>> Thanks again,
>>
>> S
>>
>>
>>
>>
>> On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote:
>>
>>> are you using hive to just convert your text files to sequence files?
>>> If thats the case then you may want to look at the purpose why hive was
>>> developed.
>>>
>>> If you want to modify data or process data which does not involve any
>>> kind of analytics functions on a routine basis.
>>>
>>> If you want to do a data manipulation or enrichment and do not want to
>>> code a lot of map reduce job, you can take a look at pig scripts.
>>> basically what you want to do is generate an  UUID for each of your
>>> tweet and then feed it to mahout algorithms.
>>>
>>> Sorry if I understood it wrong or it sounds rude.
>>>
>>
>>
>
>
> --
> Sean
>

Reply via email to