Converting from textfile to sequencefile using Hive

2013-09-30 Thread Saurabh Bhatnagar (Business Intelligence)
Hi,

I have a lot of tweets saved as text. I created an external table on top of
it to access it as textfile. I need to convert these to sequencefiles with
each tweet as its own record. To do this, I created another table as a
sequencefile table like so -

CREATE EXTERNAL TABLE tweetseq(
  tweet STRING
  )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS SEQUENCEFILE
LOCATION '/user/hdfs/tweetseq'


Now when I insert into this table from my original tweets table, each line
gets its own record as expected. This is great. However, I don't have any
record ids here. Short of writing my own UDF to make that happen, are there
any obvious solutions I am missing here?

PS, I need the ids to be there because mahout seq2sparse expects that.
Without ids, it fails with -

java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be
cast to org.apache.hadoop.io.Text
at
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(SequenceFileTokenizerMapper.java:37)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

Regards,
S


Converting from textfile to sequencefile using Hive

2013-09-30 Thread Saurabh B
Hi,

I have a lot of tweets saved as text. I created an external table on top of
it to access it as textfile. I need to convert these to sequencefiles with
each tweet as its own record. To do this, I created another table as a
sequencefile table like so -

CREATE EXTERNAL TABLE tweetseq(
  tweet STRING
  )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS SEQUENCEFILE
LOCATION '/user/hdfs/tweetseq'


Now when I insert into this table from my original tweets table, each line
gets its own record as expected. This is great. However, I don't have any
record ids here. How can I get it to write ids?

PS, I need the ids to be there because mahout seq2sparse expects that.

Regards,
S


Re: Converting from textfile to sequencefile using Hive

2013-09-30 Thread Nitin Pawar
are you using hive to just convert your text files to sequence files?
If thats the case then you may want to look at the purpose why hive was
developed.

If you want to modify data or process data which does not involve any kind
of analytics functions on a routine basis.

If you want to do a data manipulation or enrichment and do not want to code
a lot of map reduce job, you can take a look at pig scripts.
basically what you want to do is generate an  UUID for each of your tweet
and then feed it to mahout algorithms.

Sorry if I understood it wrong or it sounds rude.


Re: Converting from textfile to sequencefile using Hive

2013-09-30 Thread Saurabh B
Hi Nitin,

No offense taken. Thank you for your response. Part of this is also trying
to find the right tool for the job.

I am doing queries to determine the cuts of tweets that I want, then doing
some modest normalization (through a python script) and then I want to
create sequenceFiles from that.

So far Hive seems to be the most convenient way to do this. But I can take
a look at PIG too. It looked like the STORED AS SEQUENCEFILE gets me 99%
way there. So I was wondering if there was a way to get those ids in there
as well. The last piece is always the stumbler :)

Thanks again,

S




On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 are you using hive to just convert your text files to sequence files?
 If thats the case then you may want to look at the purpose why hive was
 developed.

 If you want to modify data or process data which does not involve any kind
 of analytics functions on a routine basis.

 If you want to do a data manipulation or enrichment and do not want to
 code a lot of map reduce job, you can take a look at pig scripts.
 basically what you want to do is generate an  UUID for each of your tweet
 and then feed it to mahout algorithms.

 Sorry if I understood it wrong or it sounds rude.



Re: Converting from textfile to sequencefile using Hive

2013-09-30 Thread Sean Busbey
S,

Check out these presentations from Data Science Maryland back in May[1].

1. working with Tweets in Hive:

http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-with-hadoop-20929978

2. then pulling stuff out of Hive to use with Mahout:

http://files.meetup.com/6195792/Working%20With%20Mahout.pdf

The Mahout talk didn't have a directly useful outcome (largely because it
tried to work with the tweets as individual text documents), but it does
get through all the mechanics of exactly what you state you want.

The meetup page also has links to video, if the slides don't give enough
context.

HTH

[1]: http://www.meetup.com/Data-Science-MD/events/111081282/

On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B saurabh.wri...@gmail.comwrote:

 Hi Nitin,

 No offense taken. Thank you for your response. Part of this is also trying
 to find the right tool for the job.

 I am doing queries to determine the cuts of tweets that I want, then doing
 some modest normalization (through a python script) and then I want to
 create sequenceFiles from that.

 So far Hive seems to be the most convenient way to do this. But I can take
 a look at PIG too. It looked like the STORED AS SEQUENCEFILE gets me 99%
 way there. So I was wondering if there was a way to get those ids in there
 as well. The last piece is always the stumbler :)

 Thanks again,

 S




 On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 are you using hive to just convert your text files to sequence files?
 If thats the case then you may want to look at the purpose why hive was
 developed.

 If you want to modify data or process data which does not involve any
 kind of analytics functions on a routine basis.

 If you want to do a data manipulation or enrichment and do not want to
 code a lot of map reduce job, you can take a look at pig scripts.
 basically what you want to do is generate an  UUID for each of your tweet
 and then feed it to mahout algorithms.

 Sorry if I understood it wrong or it sounds rude.





-- 
Sean


Re: Converting from textfile to sequencefile using Hive

2013-09-30 Thread Saurabh B
Thanks Sean, that is exactly what I want.


On Mon, Sep 30, 2013 at 3:09 PM, Sean Busbey bus...@cloudera.com wrote:

 S,

 Check out these presentations from Data Science Maryland back in May[1].

 1. working with Tweets in Hive:


 http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-with-hadoop-20929978

 2. then pulling stuff out of Hive to use with Mahout:

 http://files.meetup.com/6195792/Working%20With%20Mahout.pdf

 The Mahout talk didn't have a directly useful outcome (largely because it
 tried to work with the tweets as individual text documents), but it does
 get through all the mechanics of exactly what you state you want.

 The meetup page also has links to video, if the slides don't give enough
 context.

 HTH

 [1]: http://www.meetup.com/Data-Science-MD/events/111081282/


 On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B saurabh.wri...@gmail.comwrote:

 Hi Nitin,

 No offense taken. Thank you for your response. Part of this is also
 trying to find the right tool for the job.

 I am doing queries to determine the cuts of tweets that I want, then
 doing some modest normalization (through a python script) and then I want
 to create sequenceFiles from that.

 So far Hive seems to be the most convenient way to do this. But I can
 take a look at PIG too. It looked like the STORED AS SEQUENCEFILE gets me
 99% way there. So I was wondering if there was a way to get those ids in
 there as well. The last piece is always the stumbler :)

 Thanks again,

 S




 On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 are you using hive to just convert your text files to sequence files?
 If thats the case then you may want to look at the purpose why hive was
 developed.

 If you want to modify data or process data which does not involve any
 kind of analytics functions on a routine basis.

 If you want to do a data manipulation or enrichment and do not want to
 code a lot of map reduce job, you can take a look at pig scripts.
 basically what you want to do is generate an  UUID for each of your
 tweet and then feed it to mahout algorithms.

 Sorry if I understood it wrong or it sounds rude.





 --
 Sean



Converting from textfile to sequencefile using Hive

2013-09-29 Thread Saurabh Bhatnagar (Business Intelligence)
Hi,

I have a lot of tweets saved as text. I created an external table on top of
it to access it as textfile. I need to convert these to sequencefiles with
each tweet as its own record. To do this, I created another table as a
sequencefile table like so -

CREATE EXTERNAL TABLE tweetseq(
  tweet STRING
  )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS SEQUENCEFILE
LOCATION '/user/hdfs/tweetseq'


Now when I insert into this table from my original tweets table, each line
gets its own record as expected. This is great. However, I don't have any
record ids here. Short of writing my own UDF to make that happen, are there
any obvious solutions I am missing here?

PS, I need the ids to be there because mahout seq2sparse expects that.
Without ids, it fails with -

java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be
cast to org.apache.hadoop.io.Text
at
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(SequenceFileTokenizerMapper.java:37)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

Regards,
S