Re: HBase and Hive integration

2015-06-05 Thread Sean Busbey
+user@hive
-user@hbase to bcc

Hi!

This question is better handled by the hive user list, so I've copied them
in and moved the hbase user list to bcc.

On Fri, Jun 5, 2015 at 12:54 PM, Buntu Dev buntu...@gmail.com wrote:

 Hi -

 Newbie question: I got Hive and HBase on different clusters and say all the
 appropriate ports are open to connect Hive to HBase, then how to create a
 Hive managed HBase table?

 Thanks!




-- 
Sean


Re: testing subscription

2014-05-11 Thread Sean Busbey
ASF infra had a mail outage for several days last week, impacting all
mailing lists.

It's fixed now and is still burning through backlog, please be patient.

More info: https://blogs.apache.org/infra/entry/mail_outage

-- 
Sean
On May 11, 2014 4:18 AM, Lefty Leverenz leftylever...@gmail.com wrote:

 Same here.  The 
 archiveshttp://mail-archives.apache.org/mod_mbox/hive-user/201405.mbox/dateonly
  have a few messages recently, but at least one failed to reach me and
 a reply I sent on the thread Re: build 
 Hive-0,13http://mail-archives.apache.org/mod_mbox/hive-user/201405.mbox/%3cCALr1C9pSZS454wKbx5_c7kGvQuZVO=y7gtcvryem4w5ruo2...@mail.gmail.com%3e
 isn't in the archives.

 The dev@hive list seems to have a similar problem.  I'm checking its
 archives now to see what I'm missing.

 At least I got your message, Peyman.  Thanks for testing.

 -- Lefty


 On Sat, May 10, 2014 at 5:15 PM, Peyman Mohajerian mohaj...@gmail.comwrote:

 I have stopped receiving any email from this list!





Re: Converting from textfile to sequencefile using Hive

2013-09-30 Thread Sean Busbey
S,

Check out these presentations from Data Science Maryland back in May[1].

1. working with Tweets in Hive:

http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-with-hadoop-20929978

2. then pulling stuff out of Hive to use with Mahout:

http://files.meetup.com/6195792/Working%20With%20Mahout.pdf

The Mahout talk didn't have a directly useful outcome (largely because it
tried to work with the tweets as individual text documents), but it does
get through all the mechanics of exactly what you state you want.

The meetup page also has links to video, if the slides don't give enough
context.

HTH

[1]: http://www.meetup.com/Data-Science-MD/events/111081282/

On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B saurabh.wri...@gmail.comwrote:

 Hi Nitin,

 No offense taken. Thank you for your response. Part of this is also trying
 to find the right tool for the job.

 I am doing queries to determine the cuts of tweets that I want, then doing
 some modest normalization (through a python script) and then I want to
 create sequenceFiles from that.

 So far Hive seems to be the most convenient way to do this. But I can take
 a look at PIG too. It looked like the STORED AS SEQUENCEFILE gets me 99%
 way there. So I was wondering if there was a way to get those ids in there
 as well. The last piece is always the stumbler :)

 Thanks again,

 S




 On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 are you using hive to just convert your text files to sequence files?
 If thats the case then you may want to look at the purpose why hive was
 developed.

 If you want to modify data or process data which does not involve any
 kind of analytics functions on a routine basis.

 If you want to do a data manipulation or enrichment and do not want to
 code a lot of map reduce job, you can take a look at pig scripts.
 basically what you want to do is generate an  UUID for each of your tweet
 and then feed it to mahout algorithms.

 Sorry if I understood it wrong or it sounds rude.





-- 
Sean


Re: Want query to use more reducers

2013-09-30 Thread Sean Busbey
Hey Keith,

It sounds like you should tweak the settings for how Hive handles query
execution[1]:

1) Tune the guessed number of reducers based on input size

= hive.exec.reducers.bytes.per.reducer

Defaults to 1G. Based on your description, it sounds like this is probably
still at default.

In this case, you should also set a max # of reducers based on your cluster
size.

= hive.exec.reducers.max

I usually set this to the # reduce slots, if there's a decent chance I'll
get to saturate the cluster. If not, don't worry about it.

2) Hard code a number of reducers

= mapred.reduce.tasks

Setting this will cause Hive to always use that number. It defaults to -1,
which tells hive to use the heuristic about input size to guess.

In either of the above cases, you should look at the options to merge small
files (search for merge  in the configuration property list) to avoid
getting lots of little outputs.

HTH

[1]:
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution

-Sean

On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley kwi...@keithwiley.com wrote:

 I have a query that doesn't use reducers as efficiently as I would hope.
  If I run it on a large table, it uses more reducers, even saturating the
 cluster, as I desire.  However, on smaller tables it uses as low as a
 single reducer.  While I understand there is a logic in this (not using
 multiple reducers until the data size is larger), it is nevertheless
 inefficient to run a query for thirty minutes leaving the entire cluster
 vacant when the query could distribute the work evenly and wrap things up
 in a fraction of the time.  The query is shown below (abstracted to its
 basic form).  As you can see, it is a little atypical: it is a nested query
 which obviously implies two map-reduce jobs and it uses a script for the
 reducer stage that I am trying to speed up.  I thought the distribute by
 clause should make it use the reducers more evenly, but as I said, that is
 not the behavior I am seeing.

 Any ideas how I could improve this situation?

 Thanks.

 CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
 SELECT * FROM (
 FROM (
 SELECT * FROM input_table
 DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC,
 input_column_2 ASC, input_column_etc ASC) q
 SELECT TRANSFORM(*)
 USING 'python my_reducer_script.py' AS(
 output_column_1,
 output_column_2,
 output_column_etc,
 )
 ) s
 ORDER BY output_column_1;


 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com

 Luminous beings are we, not this crude matter.
--  Yoda

 




-- 
Sean