Unable to merge 3 tables in hive

2012-08-01 Thread iwannaplay games
Hi all, I have 3 tables in Mysql and i want to club the data of 3 tables in 1 table of hive.(create a data warehouse).I created table with all the columns of 3 tables but i am unable to push data in table of hive.After running an import statement of sqoop i pulled all the records in hdfs but At a

RE: Unable to merge 3 tables in hive

2012-08-01 Thread Matouk Iftissen
Hi, are you using import-all-tables tool ? if this is make sure that you respect consigns of this sqoop tool. See the sqoop user guide : http://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1766722 For more information. -Message d'origine- DeĀ : iwannaplay games

Re: Unable to merge 3 tables in hive

2012-08-01 Thread iwannaplay games
thanks i did it by creating 3 external tables and then using this query to update createddate from users table for a particular userid. insert overwrite table userinfo select u.userid,a.createddate from users a join userinfo u on u.userid=a.userid i can use query option also I ll try that now :)

Re: Best Report Generating tools for hive/hadoop file system

2012-08-01 Thread Artem Ervits
Latest eclipse birt release has Hive and Hadoop connector. Artem Ervits Data Analyst New York Presbyterian Hospital From: Techy Teck [mailto:comptechge...@gmail.com] Sent: Tuesday, July 31, 2012 08:46 PM To: user@hive.apache.org user@hive.apache.org Subject: Best Report Generating tools for

mapper is slower than hive' mapper

2012-08-01 Thread Yue Guan
Hi, there I'm writing mapreduce to replace some hive query and I find that my mapper is slow than hive's mapper. The Hive query is like: select sum(column1) from table group by column2, column3; My mapreduce program likes this: public static class HiveTableMapper extends

RE: mapper is slower than hive' mapper

2012-08-01 Thread Connell, Chuck
This is actually not surprising. Hive is essentially a MapReduce compiler. It is common for regular compilers (C, C#, Fortran) to emit faster assembler code than you write yourself. Compilers know the tricks of their target language. Chuck Connell Nuance RD Data Team Burlington, MA

Re: mapper is slower than hive' mapper

2012-08-01 Thread Bertrand Dechoux
One hint would be to reduce the number of writable instances you need. Create the object once and reuse it. By the way, Hive do not use Writable. ;) Bertrand On Wed, Aug 1, 2012 at 4:35 PM, Connell, Chuck chuck.conn...@nuance.comwrote: This is actually not surprising. Hive is essentially a

Re: mapper is slower than hive' mapper

2012-08-01 Thread Yue Guan
Hive don't use Writable?!!. Could you please give me a pointer to hive code to see how they do the job? I check the map output record. I find this: my case: total mapper input record: 23091348 total mapper output record: 23091348 avg mapper output bytes/record: 34.819994 total combiner output

Re: mapper is slower than hive' mapper

2012-08-01 Thread Edward Capriolo
As mentioned, if you avoid using new, by re-using objects and possibly use buffer objects you may be able to match or beat the speed. But in the general case the hive saves you time by allowing you not to worry about low level details like this. On Wed, Aug 1, 2012 at 10:35 AM, Connell, Chuck

Re: mapper is slower than hive' mapper

2012-08-01 Thread Bertrand Dechoux
I am not sure about Hive but if you look at Cascading they use a pseudo combiner instead of the standard (I mean Hadoop's) combiner. I guess Hive has a similar strategy. The point is that when you use a compiler, the compiler does smart thing that you don't need to think about (like loop

Re: mapper is slower than hive' mapper

2012-08-01 Thread Edward Capriolo
Hive does not use combiners it uses map side aggregation. Hive does use writables, sometimes it uses ones from hadoop, sometimes it uses its own custom writables for things like timestamps. On Wed, Aug 1, 2012 at 11:40 AM, Bertrand Dechoux decho...@gmail.com wrote: I am not sure about Hive but

cli timeouts

2012-08-01 Thread Travis Crawford
Hey Hive gurus - Does anyone know how the CLI handles metastore connection timeouts? It seems if I leave a CLI session idle more than hive.metastore.client.socket.timeout seconds then run show tables, the cli hangs for the timeout then throws a SocketTimeoutException. Restarting the CLI and

Re: mapper is slower than hive' mapper

2012-08-01 Thread Bertrand Dechoux
My bad. I wasn't sure, at least I know now. But other solutions may use other 'Serialization' strategies like Thrift (which is only other customisation point of Hadoop). Bertrand On Wed, Aug 1, 2012 at 5:49 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Hive does not use combiners it uses map

Re: mapper is slower than hive' mapper

2012-08-01 Thread Yue Guan
The story here is that we have a work flow based on hive queries. It takes several stages to get to the final data. For each stage, we have a hive table. And we try to write the whole work flow in mapreduce. Ideally, it will remove all the intermediate process and take two rounds of mapreduce

Re: Best Report Generating tools for hive/hadoop file system

2012-08-01 Thread Anurag Tangri
Cloudera has connector with microstrategy and Tableau. Looks like Cloudera Might have better working versions in 4.x releases. Wort= h checking. Datameer is another tool that also connects to hive in their new release and= let y ou analyse data And generate reports and graphs. Thanks, Anurag

Re: cli timeouts

2012-08-01 Thread Edward Capriolo
Are you communicating with a thrift metastore or a JDBC metastore? I have had connections opened for long periods of time and never remember experiencing them timeout. Edward On Wed, Aug 1, 2012 at 12:01 PM, Travis Crawford traviscrawf...@gmail.com wrote: Hey Hive gurus - Does anyone know

Difference between storing data as a TextFile and SequenceFile

2012-08-01 Thread Techy Teck
What is the difference between storing the data as a TextFile and SequenceFile? And which will be faster while doing Hive queries. I am creating a table like this- create table quality ( id bigint, total_chkout bigint, total_errpds bigint ) partitioned by (ds string) row format delimited

Re: cli timeouts

2012-08-01 Thread Travis Crawford
I'm using the thrift metastore via TFramedTransport. What value do you specify for hive.metastore.client.socket.timeout? I'm using 60. If I open the CLI, run show tables, wait the timeout period, then run show tables the CLI hangs in: main prio=10 tid=0x4151a000 nid=0x448 runnable

Re: cli timeouts

2012-08-01 Thread Edward Capriolo
I feel that that interface is very rarely used in the wild. The only use case I can figure out for it is people with very in depth hive experience that do not wish to interact with hive through the QL language. That being said I would think the coverage might be a little weak there. With the local

Efficiently Store data in Hive

2012-08-01 Thread Techy Teck
How can I efficiently store data in Hive and also store and retrieve compressed data in hive? Currently I am storing it as a TextFile. I was going through Bejoy article ( http://kickstarthadoop.blogspot.com/2011/10/how-to-efficiently-store-data-in-hive.html) and I found that LZO compression will

Re: cli timeouts

2012-08-01 Thread Travis Crawford
Oh interesting - you're saying instead of running a single HiveMetaStore thrift service, most users use the embedded HiveMetaStore mode and have each CLI instance connect to the DB directly? --travis On Wed, Aug 1, 2012 at 11:47 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I feel that that

Re: cli timeouts

2012-08-01 Thread Edward Capriolo
The two setup options are: cli-thriftmetastore-jdbc cli-jdbc (used to be called local mode) localmode has less moving parts so I prefer it. On Wed, Aug 1, 2012 at 2:54 PM, Travis Crawford traviscrawf...@gmail.com wrote: Oh interesting - you're saying instead of running a single HiveMetaStore

Re: cli timeouts

2012-08-01 Thread Travis Crawford
Interesting - this issue would certainly go away with local mode as there's no thrift call to fail. I'd very much prefer to run HMS as a centralized service though. Thanks for the info - I'll have to take a look at how the thrift client handles timeouts/reconnects/etc. --travis On Wed, Aug 1,

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

2012-08-01 Thread Techy Teck
I am trying to load data in to the date partition, so my data got succesfully loaded for 20120709 but when I tried to load the data for *20120710, * then I am seeing the below exception. Can anyone suggest me why is it happening like this? *Loading data to table data_quality partition

Re: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

2012-08-01 Thread Gabriel Eisbruch
Hi Techy this error use to appeare when the user executing the query has not permisions into the origin or target folder, if you create a single table (no externa) is probable that you has not permissions to write into /user/hive Respect to your before question, i am using snappy to compress the