Re: HBase and Hive integration
+user@hive -user@hbase to bcc Hi! This question is better handled by the hive user list, so I've copied them in and moved the hbase user list to bcc. On Fri, Jun 5, 2015 at 12:54 PM, Buntu Dev buntu...@gmail.com wrote: Hi - Newbie question: I got Hive and HBase on different clusters and say all the appropriate ports are open to connect Hive to HBase, then how to create a Hive managed HBase table? Thanks! -- Sean
Re: testing subscription
ASF infra had a mail outage for several days last week, impacting all mailing lists. It's fixed now and is still burning through backlog, please be patient. More info: https://blogs.apache.org/infra/entry/mail_outage -- Sean On May 11, 2014 4:18 AM, Lefty Leverenz leftylever...@gmail.com wrote: Same here. The archiveshttp://mail-archives.apache.org/mod_mbox/hive-user/201405.mbox/dateonly have a few messages recently, but at least one failed to reach me and a reply I sent on the thread Re: build Hive-0,13http://mail-archives.apache.org/mod_mbox/hive-user/201405.mbox/%3cCALr1C9pSZS454wKbx5_c7kGvQuZVO=y7gtcvryem4w5ruo2...@mail.gmail.com%3e isn't in the archives. The dev@hive list seems to have a similar problem. I'm checking its archives now to see what I'm missing. At least I got your message, Peyman. Thanks for testing. -- Lefty On Sat, May 10, 2014 at 5:15 PM, Peyman Mohajerian mohaj...@gmail.comwrote: I have stopped receiving any email from this list!
Re: Converting from textfile to sequencefile using Hive
S, Check out these presentations from Data Science Maryland back in May[1]. 1. working with Tweets in Hive: http://www.slideshare.net/JoeyEcheverria/analyzing-twitter-data-with-hadoop-20929978 2. then pulling stuff out of Hive to use with Mahout: http://files.meetup.com/6195792/Working%20With%20Mahout.pdf The Mahout talk didn't have a directly useful outcome (largely because it tried to work with the tweets as individual text documents), but it does get through all the mechanics of exactly what you state you want. The meetup page also has links to video, if the slides don't give enough context. HTH [1]: http://www.meetup.com/Data-Science-MD/events/111081282/ On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B saurabh.wri...@gmail.comwrote: Hi Nitin, No offense taken. Thank you for your response. Part of this is also trying to find the right tool for the job. I am doing queries to determine the cuts of tweets that I want, then doing some modest normalization (through a python script) and then I want to create sequenceFiles from that. So far Hive seems to be the most convenient way to do this. But I can take a look at PIG too. It looked like the STORED AS SEQUENCEFILE gets me 99% way there. So I was wondering if there was a way to get those ids in there as well. The last piece is always the stumbler :) Thanks again, S On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar nitinpawar...@gmail.comwrote: are you using hive to just convert your text files to sequence files? If thats the case then you may want to look at the purpose why hive was developed. If you want to modify data or process data which does not involve any kind of analytics functions on a routine basis. If you want to do a data manipulation or enrichment and do not want to code a lot of map reduce job, you can take a look at pig scripts. basically what you want to do is generate an UUID for each of your tweet and then feed it to mahout algorithms. Sorry if I understood it wrong or it sounds rude. -- Sean
Re: Want query to use more reducers
Hey Keith, It sounds like you should tweak the settings for how Hive handles query execution[1]: 1) Tune the guessed number of reducers based on input size = hive.exec.reducers.bytes.per.reducer Defaults to 1G. Based on your description, it sounds like this is probably still at default. In this case, you should also set a max # of reducers based on your cluster size. = hive.exec.reducers.max I usually set this to the # reduce slots, if there's a decent chance I'll get to saturate the cluster. If not, don't worry about it. 2) Hard code a number of reducers = mapred.reduce.tasks Setting this will cause Hive to always use that number. It defaults to -1, which tells hive to use the heuristic about input size to guess. In either of the above cases, you should look at the options to merge small files (search for merge in the configuration property list) to avoid getting lots of little outputs. HTH [1]: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution -Sean On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley kwi...@keithwiley.com wrote: I have a query that doesn't use reducers as efficiently as I would hope. If I run it on a large table, it uses more reducers, even saturating the cluster, as I desire. However, on smaller tables it uses as low as a single reducer. While I understand there is a logic in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute the work evenly and wrap things up in a fraction of the time. The query is shown below (abstracted to its basic form). As you can see, it is a little atypical: it is a nested query which obviously implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to speed up. I thought the distribute by clause should make it use the reducers more evenly, but as I said, that is not the behavior I am seeing. Any ideas how I could improve this situation? Thanks. CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as SELECT * FROM ( FROM ( SELECT * FROM input_table DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2 ASC, input_column_etc ASC) q SELECT TRANSFORM(*) USING 'python my_reducer_script.py' AS( output_column_1, output_column_2, output_column_etc, ) ) s ORDER BY output_column_1; Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda -- Sean