Re: how to let hive support lzo

2013-07-22 Thread bejoy_ks

Hi,

Along with the mapred.compress* properties try to set
hive.exec.compress.output to true.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: ch huang 
Date: Mon, 22 Jul 2013 13:41:01 
To: 
Reply-To: user@hive.apache.org
Subject: Re: how to let hive support lzo

# hbase org.apache.hadoop.hbase.util.CompressionTest
hdfs://CH22:9000/alex/my.txt lzo
13/07/22 13:27:58 WARN conf.Configuration: hadoop.native.lib is deprecated.
Instead, use io.native.lib.available
13/07/22 13:27:59 INFO util.ChecksumType: Checksum using
org.apache.hadoop.util.PureJavaCrc32
13/07/22 13:27:59 INFO util.ChecksumType: Checksum can use
org.apache.hadoop.util.PureJavaCrc32C
13/07/22 13:27:59 ERROR metrics.SchemaMetrics: Inconsistent configuration.
Previous configuration for using table name in metrics: true, new
configuration: false
13/07/22 13:27:59 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/07/22 13:27:59 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
13/07/22 13:27:59 INFO compress.CodecPool: Got brand-new compressor
[.lzo_deflate]
13/07/22 13:28:00 INFO compress.CodecPool: Got brand-new decompressor
[.lzo_deflate]
SUCCESS




# hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.4.15.jar
com.hadoop.compression.lzo.LzoIndexer /alex
13/07/22 09:39:04 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/07/22 09:39:04 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
13/07/22 09:39:04 INFO lzo.LzoIndexer: LZO Indexing directory /alex...
13/07/22 09:39:04 INFO lzo.LzoIndexer:   LZO Indexing directory
hdfs://CH22:9000/alex/alex_t...
13/07/22 09:39:04 INFO lzo.LzoIndexer:   [INDEX] LZO Indexing file
hdfs://CH22:9000/alex/sqoop-1.99.2-bin-hadoop200.tar.gz.lzo, size 0.02 GB...
13/07/22 09:39:05 WARN conf.Configuration: hadoop.native.lib is deprecated.
Instead, use io.native.lib.available
13/07/22 09:39:06 INFO lzo.LzoIndexer:   Completed LZO Indexing in 1.16
seconds (13.99 MB/s).  Index size is 0.52 KB.

13/07/22 09:39:06 INFO lzo.LzoIndexer:   [INDEX] LZO Indexing file
hdfs://CH22:9000/alex/test1.lzo, size 0.00 GB...
13/07/22 09:39:06 INFO lzo.LzoIndexer:   Completed LZO Indexing in 0.08
seconds (0.00 MB/s).  Index size is 0.01 KB.


On Mon, Jul 22, 2013 at 1:37 PM, ch huang  wrote:

> hi ,all:
>  i already install and testing lzo in hadoop and hbase,all success,but
> when i try it on hive ,it failed ,how can i do let hive can recognize lzo?
>
>
> hive> set mapred.map.output.compression.codec;
>
> mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
> hive> set
> mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
> hive> select count(*) from test;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=
> Starting Job = job_1374463239553_0003, Tracking URL =
> http://CH22:8088/proxy/application_1374463239553_0003/
> Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1374463239553_0003
> Hadoop job information for Stage-1: number of mappers: 1; number of
> reducers: 1
> 2013-07-22 13:33:27,243 Stage-1 map = 0%,  reduce = 0%
> 2013-07-22 13:33:45,403 Stage-1 map = 100%,  reduce = 0%
> Ended Job = job_1374463239553_0003 with errors
> Error during job, obtaining debugging information...
> Job Tracking URL: 
> http://CH22:8088/proxy/application_1374463239553_0003/
> Examining task ID: task_1374463239553_0003_m_00 (and more) from job
> job_1374463239553_0003
> Task with the most failures(4):
> -
> Task ID:
>   task_1374463239553_0003_m_00
> URL:
>
> http://CH22:8088/taskdetails.jsp?jobid=job_1374463239553_0003&tipid=task_1374463239553_0003_m_00
> -
> Diagnostic Messages for this Task:
> Error: java.lang.RuntimeException: native-lzo library not available
> at
> com.hadoop.compression.lzo.LzoCodec.getCompressorType(LzoCodec.java:155)
> at
> org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:104)
> at
> org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:118)
> at org.apache.hadoop.mapred.IFile$Writer.(IFile.java:115)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1580)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1457)
> at org.apache.hadoop.mapred.MapTask.runOldMapp

Re: Hive CLI

2013-07-08 Thread bejoy_ks
Hi Rahul,

The same shortcuts ctrl+A and ctrl+E works in hive shell for me( hive 0.9)


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: rahul kavale 
Date: Tue, 9 Jul 2013 11:00:49 
To: 
Reply-To: user@hive.apache.org
Subject: Hive CLI

Hey there,
I have been using HIVE(0.7) for a while now using CLI and bash scripts.
But its a pain to move cursor in the CLI i.e. once you enter a very long
query then you cant go to start of the query (like you do using
Ctrl+A/Ctrl+E in terminal). Does anyone know how to do it?

Thanks & Regards,
Rahul



Re: Need help in Hive

2013-07-08 Thread bejoy_ks
Hi Maheedhar

As I understand, you are having a column with data of type MM:SS in your input 
data set.

AFAIK this format is not in the standard java.sql.Timestamp format also it 
doesn't even have any date part . Hence you may not be able to use Timestamp 
data type here.

You can define it as a string and then develop your custom UDFs for any further 
processing.
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Matouk IFTISSEN 
Date: Mon, 8 Jul 2013 09:47:11 
To: 
Reply-To: user@hive.apache.org
Subject: Re: Need help in Hive

Hello,
Try this function in hive query:
1- transform your data (type integer ) in timestamp (linux),
then do this:
2- from_unixtimeyour_date_timestamp), 'mm:ss') AS time

Hope this,  you will give help.



2013/7/8 Maheedhar Reddy 

> Hi All,
>
> I have Hive 0.8.0 version installed in my single node Apache Hadoop
> cluster.
>
> I have a time column which is in format *MM:SS* (Minutes:seconds). I
> tried the date functions to get the value in MM:SS format. But its not
> working out.
>
> Below is my column for your reference.
>
> *Active Time*
> *12:01*
> 0:20
> 2:18
>
> in the first record 12:01, 12 is the number of minutes and 01 is the
> seconds.
>
> so when the time i'm creating a table in Hive, i have to give a data type
> for this column Active Time,
> I have tried with various date type columns but none of them worked out
> for me. Please guide me.
>
> What function should I use, to get the time in *MM:SS* format?
>
>
> "You only live once, but if you do it right, once is enough."
>
>
> Cheers!!
>
> Maheedhar Reddy K V
>
>
> http://about.me/maheedhar.kv/#
>
>



Re: integration issure about hive and hbase

2013-07-08 Thread bejoy_ks
Hi

Can you try including the zookeeper quorum and port in your hive configuration 
as shown below

hive --auxpath .../hbase-handler.jar, .../hbase.jar, ...zookeeper.jar, 
.../guava.jar -hiveconf hbase.zookeeper.quorum= -hiveconf hbase.zookeeper.property.clientPort=

Substitute the above command with actual values.

Also ensure that the zk, hbase jars specified above are those used in your 
hbase cluster. To avoid any version mismatches. 
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: ch huang 
Date: Mon, 8 Jul 2013 16:40:59 
To: 
Reply-To: user@hive.apache.org
Subject: Re: integration issure about hive and hbase

i replace the zookeeper jar ,the error is different

hive> CREATE TABLE hbase_table_1(key int, value string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
> TBLPROPERTIES ("hbase.table.name" = "xyz");
FAILED: Error in metadata:
MetaException(message:org.apache.hadoop.hbase.ZooKeeperConnectionException:
HBase is able to connect to ZooKeeper but the connection closes
immediately. This could be a sign that the server has too many connections
(30 is the default). Consider inspecting your ZK server logs for that error
and then make sure you are reusing HBaseConfiguration as often as you can.
See HTable's javadoc for more information.
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.(ZooKeeperWatcher.java:160)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:1265)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:526)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.(HConnectionManager.java:516)
at
org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:173)
at
org.apache.hadoop.hbase.client.HBaseAdmin.(HBaseAdmin.java:93)
at
org.apache.hadoop.hive.hbase.HBaseStorageHandler.getHBaseAdmin(HBaseStorageHandler.java:74)
at
org.apache.hadoop.hive.hbase.HBaseStorageHandler.preCreateTable(HBaseStorageHandler.java:158)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:344)
at
org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:470)
at
org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3176)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:213)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:131)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:843)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:930)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.(ZooKeeperWatcher.java:138)
... 24 more
)
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask


On Mon, Jul 8, 2013 at 2:52 PM, Cheng Su  wrote:

>  Did you hbase cluster start up?
>
> The error message is more like that something wrong with the classpath.
> So maybe you'd better also check that.
>
>
> On Mon, Jul 8, 2013 at 1:54 PM, ch huang  wrote:
>
>> i get error when try create table on hbase use hive, anyone can help?
>>
>> hive> CREATE TABLE hive_hbasetable_demo(key int,value string)
>> > STORED BY 'ora.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
>> > TBLPROPERTIES ("hbase.table.name" = "hivehbasedemo");
>> Failed with exception org.apache.hadoop.hive.ql.metadata.HiveException:
>> Error in loading st

Re: Strange error in hive

2013-07-08 Thread bejoy_ks
Hii Jerome


Can you send the error log of the MapReduce task that failed? That should have 
some pointers which can help you troubleshoot the issue.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Jérôme Verdier 
Date: Mon, 8 Jul 2013 11:25:34 
To: 
Reply-To: user@hive.apache.org
Subject: Strange error in hive

Hi everybody,

I faced a strange error in hive this morning.

The error message is this one :

FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask

after a quick search on Google, it appears that this is a Hive bug :

https://issues.apache.org/jira/browse/HIVE-4650

Is there a way to pass through this error ?

Thanks.

NB : my hive script is in the attachment.


-- 
*Jérôme VERDIER*
06.72.19.17.31
verdier.jerom...@gmail.com



Re: When to use bucketed tables with/instead of partitioned tables

2013-06-17 Thread bejoy_ks
Hi Stephen 

In addition to join optimization, bucketing helps much in sampling as well. It 
helps you to  choose the sample space, (ie n buckets of m).

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Stephen Boesch 
Date: Sun, 16 Jun 2013 11:20:49 
To: 
Reply-To: user@hive.apache.org
Subject: When to use bucketed tables with/instead of partitioned tables

I am accustomed to using partitioned tables to obtain separate directories
for data files in each partition.

When looking at the documentation for bucketed tables it seems they are
typically used in conjunction with distribute by/sort by and an appropriate
partitioning key - and thus provide ability to do map side joins.

An explanation of when to use bucketed tables by themselves (in lieu of
partitioned tables)  as well as in conjunction with partitoined tables
would be appreciated.

thanks!

stephenb



Re: How to delete Specific date data using hive QL?

2013-06-04 Thread bejoy_ks
Adding my two cents
If you are having an unpartitioned data/table and would like to partition it on 
some specific columns in source table, Use dynamic partition insert.
That would get the source data in separate partitions on a partitioned target 
table. 

http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Hamza Asad 
Date: Tue, 4 Jun 2013 12:52:49 
To: 
Reply-To: user@hive.apache.org
Subject: Re: How to delete Specific date data using hive QL?

Thank u s much nitin for your help.. :)


On Tue, Jun 4, 2013 at 12:18 PM, Nitin Pawar wrote:

> 1- Does partitioning improve performance?
> --Only if you make use of partitions in your queries (mostly in where
> clause to limit data to your query for a specific value of partitioned
> column)
>
> 2- Do i have to create partition table new or i can create partition on
> existing table by renaming that date column and add partition column
> event_date (the actual column name) ?
> you can not create partitions on already existing data unless the data is
> in partitioned directories on hdfs.
> I would recommend create a new table with partitioned columns.
> load data from old table into partitioned table
> dump old table
>
> 3- can i import data directly into partition table using sqoop command?
> you can import data directly into a partition.
>
> for exported data, you don't have to worry. it remains as it is
>
>
> On Tue, Jun 4, 2013 at 12:41 PM, Hamza Asad wrote:
>
>> No i don't want to change my queries. I want that my queries work on same
>> table and partition does not change its schema.
>> and from schema i means schema on mysql (exported data).
>>
>> Few more things
>> 1- Does partitioning improve performance?
>> 2- Do i have to create partition table new or i can create partition on
>> existing table by renaming that date column and add partition column
>> event_date (the actual column name) ?
>> 3- can i import data directly into partition table using sqoop command?
>>
>>
>>
>>
>> On Tue, Jun 4, 2013 at 11:40 AM, Nitin Pawar wrote:
>>
>>> partitioning of data in hive is more for the reasons on how you layout
>>> data in a well defined manner so that when you access your data , you
>>> request only for specific data by specifying the partition columns in where
>>> clause.
>>>
>>> to answer your question,
>>> do you have to change your queries? out of the box the queries should
>>> work as it is unless and until you are changing the table schema by
>>> removing/adding new columns.
>>> does the format change when you export data? if your select statement is
>>> not changing it will not change
>>> will table schema change? do you mean schema on hive or mysql ?
>>>
>>>
>>> On Tue, Jun 4, 2013 at 11:37 AM, Hamza Asad wrote:
>>>
 thats far more better :) ..
 Please tell me few more things. Do i have to change my query if i
 create table with partition on date? rest of the columns would be same as
 it is? Also if i export that partitioned table to mysql, does schema of
 that table would same as it was before partition?


 On Tue, Jun 4, 2013 at 12:09 AM, Stephen Sprague wrote:

> there is no delete semantic.
>
> you either partition on the data you want to drop and use drop
> partition (or drop table for the whole shebang) or you can do as Nitin
> suggests by selecting the inverse of the data you want to delete and store
> it back into the table itself.  Not ideal but maybe it could work for your
> situation.
>
> Now here's another idea.  This was just _recently_ discussed on this
> group as coincidence would have it.  if you were to have scanned just a
> little of the groups messages you would have seen that and could then have
> added to the discussion! :)
>
>
> On Mon, Jun 3, 2013 at 2:19 AM, Hamza Asad wrote:
>
>> Thanx for your response nitin. Anybody else have any better solution?
>>
>>
>> On Mon, Jun 3, 2013 at 1:27 PM, Nitin Pawar 
>> wrote:
>>
>>> hive does not give you a record level deletion as of now.
>>>
>>> so unless you have partitioned, other option is you overwrite the
>>> table with data which you want
>>> please wait for others to suggest you more options. this one is just
>>> mine and can be costly too
>>>
>>>
>>> On Mon, Jun 3, 2013 at 12:36 PM, Hamza Asad 
>>> wrote:
>>>
 no, its not partitioned by date.


 On Mon, Jun 3, 2013 at 11:19 AM, Nitin Pawar <
 nitinpawar...@gmail.com> wrote:

> how is the data laid out?
> is it partitioned data by the date?
>
>
> On Mon, Jun 3, 2013 at 11:20 AM, Hamza Asad <
> hamza.asa...@gmail.com> wrote:
>
>> Dear all,
>> How can i remove data of specific dates from HDFS
>> us

Re: how does hive find where is MR job tracker

2013-05-28 Thread bejoy_ks
Hive gets the JobTracker from the mapred-site.xml specified within your 
$HADOOP_HOME/conf.

Is your $HADOOP_HOME/conf/mapred-site.xml on the node that runs hive have the 
correct value for jobtracker?
 If not changing that to the right one might resolve your issue.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Frank Luo 
Date: Tue, 28 May 2013 16:49:01 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: how does hive find where is MR job tracker

I have a cloudera cluster, version 4.2.0.

In the hive configuration, I have "MapReduce Service" set to "mapreduce1", 
which is my MR service.

However, without setting "mapred.job.tracker", whenever I run hive command, it 
always sends the job to a wrong job tracker. Here is the error:


java.net.ConnectException: Call From hqhd01ed01.pclc0.merkle.local/10.129.2.52 
to hqhd01ed01.pclc0.merkle.local:8021 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

And the Cloudera Manager doesn't allow me to manually set "mapred.job.tracker". 
So my question is how to make Hive point to the right job tracker without 
setting ""mapred.job.tracker" every time.

PS. Not sure it matters, but I did move the job tracker from machine A to 
machine B.

Thx!



Re: Sqoop Oracle Import to Hive Table - Error in metadata: InvalidObjectException

2013-05-25 Thread bejoy_ks
Hi

Can you try doing the import again after assigning 'DS12' the default schema 
for the user doing the import. Your DB admin should be able to do this in 
oracle .

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Raj Hadoop 
Date: Sat, 25 May 2013 14:54:50 
To: Hive
Reply-To: user@hive.apache.org
Subject: Sqoop Oracle Import to Hive Table - Error in metadata: 
InvalidObjectException

Hi,

I am trying to run the following to load an Oracle table to Hive table using 
Sqoop,


sqoop import --connect jdbc:oracle:thin:@//inferri.dm.com:1521/DBRM25 --table 
DS12.CREDITS --username UPX1 --password piiwer --hive-import

Note: DS12 is a schema and UPX1 is the user through which the schema and the 
table in the schema is accessed. I was able to access the table through sqlplus 
client tool.


I am getting the following error. Can any one identify the issue and let me 
know please?

ERROR exec.Task (SessionState.java:printError(400)) - FAILED: Error in 
metadata: InvalidObjectException(message:There is no database named ds12)
org.apache.hadoop.hive.ql.metadata.HiveException: 
InvalidObjectException(message:There is no database named ds12)
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:544)
    at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3305)
    at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:242)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:134)
    at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
    at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1326)
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1118)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:951)
    at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
    at 
org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:439)
    at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:449)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:647)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: InvalidObjectException(message:There is no database named dw)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table(HiveMetaStore.java:852)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:402)
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:538)
    ... 20 more

2013-05-25 17:37:14,276 ERROR ql.Driver (SessionState.java:printError(400)) - 
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask


Thanks,
Raj


Re: io.compression.codecs not found

2013-05-23 Thread bejoy_ks
These are the default, add snappy as well along


  io.compression.codecs
  
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec
  

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Sachin Sudarshana 
Date: Thu, 23 May 2013 20:01:17 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: io.compression.codecs not found

Hi Bejoy,

Thanks for the reply.
I would like to know "what" are the codecs that are available by default in
the Hadoop system, among which i can choose to set in the core-site.xml.

For ex: LZO compression codecs are not available by default and we have to
install the required libraries for it.

Thank you,
Sachin


On Thu, May 23, 2013 at 7:55 PM,  wrote:

> **
> Go to $HADOOP_HOME/config open and edit core-site.xml
>
> Add a new property 'io.compression.codecs' and assign the required
> compression codecs as its value.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Sachin Sudarshana 
> *Date: *Thu, 23 May 2013 19:46:37 +0530
> *To: *
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: io.compression.codecs not found
>
> Hi,
>
> I'm not using CM. I have installed CDH 4.2.1 using Linux packages.
>
> Thank you,
> Sachin
>
>
> On Thu, May 23, 2013 at 7:13 PM, Sanjay Subramanian <
> sanjay.subraman...@wizecommerce.com> wrote:
>
>> This property needs to be set in core-site.xml. If u r using
>> clouderamanager then ping me I will tell u how to set it there. Out of the
>> box hive works beautifully with gzip and snappy.  And if u r using lzo then
>> needs some plumbing. Depends on what ur usecase is I can provide guidance.
>>
>> Regards
>> Sanjay
>>
>> Sent from my iPhone
>>
>> On May 23, 2013, at 3:33 AM, "Sachin Sudarshana" 
>> wrote:
>>
>> > Hi,
>> >
>> > I'm trying to run some queries on compressed tables in Hive 0.10. I
>> wish to know what all compression codecs are available which i can make use
>> of.
>> > However, when i run set io.compression.codecs in the hive CLI, it
>> throws an error saying the io.compression.codecs is not found.
>> >
>> > I'm unable to figure out why its happening. Has it (the hiveconf
>> property) been removed from 0.10?
>> >
>> > Any help is greatly appreciated!
>> >
>> > Thank you,
>> > Sachin
>> >
>>
>> CONFIDENTIALITY NOTICE
>> ==
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>>
>



Re: Snappy with HIve

2013-05-23 Thread bejoy_ks
Hi

Please find responses below.

Do I have to give some INPUTFORMAT directive to make the Hive Table read Snappy 
Codec files ?
For example for LZO its
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"

Bejoy : No custom input format required. Add the snappy codec in 
io.compression.codecs.

QUESTION 2
For Hive scripts that will READ Snappy files and Output Snappy Files to Hive 
Tables are the following settings enough ?
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Bejoy: It should be fine. If it shows any issues add 
mapred.output.compress=true as well 

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Sanjay Subramanian 
Date: Tue, 21 May 2013 23:30:09 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Snappy with HIve

Hi guys

QUESTION 1
I have an MR job that creates Snappy Codec Output files.
My table definition is as follows
CREATE EXTERNAL TABLE IF NOT EXISTS outpdir_header_hive_only(hbase_pk 
STRING,header_servername_donotquerySTRING,header_date_donotquery STRING, 
header_id STRING, header_hbpk STRING,header_channelId 
INT,header_searchAnnotation STRING,header_continuedSearchFlag 
INT,header_prodLow INT,header_prodTotal INT,header_sort INT,header_view 
INT,header_adNodes INT,header_spellingSuggestion STRING,header_queryType 
INT,header_nodeId INT,header_pinpointPtitleId 
INT,header_firedSearchRulesSTRING,header_rbAbsentSellers INT,header_shuffled 
INT,header_searchSessionId STRING,header_normalizationFlag 
STRING,header_relatedItemResultCount INT,header_unrankedSelectedPtitleIds 
INT,header_normKeyword STRING,header_kplEntry INT,header_isSaved 
STRING,header_rawProfileScore DOUBLE,header_normalizedProfileScore 
INT,header_scorerInfo STRING,header_contextNode INT,header_fbId 
STRING,norm_stem_keyword STRING, attrs_origNodeId INT,attrs_mfrId 
INT,attrs_sellerId INT,attrs_otherAttrs STRING,attrs_ptitleId INT,cached_date 
STRING,cached_recordId STRING,cached_visitorId STRING,cached_visit_id 
STRING,cached_appStyle STRING,cached_publisherId INT,cached_IP 
STRING,cached_source STRING,cached_refkw STRING,cached_pixeled 
INT,cached_searchRefineAttrImps STRING,cached_pageType STRING,cached_zipCode 
STRING,cached_zipType STRING,cached_perpage INT) PARTITIONED BY (header_date 
STRING, header_servername STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

Do I have to give some INPUTFORMAT directive to make the Hive Table read Snappy 
Codec files ?
For example for LZO its
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"


QUESTION 2
For Hive scripts that will READ Snappy files and Output Snappy Files to Hive 
Tables are the following settings enough ?
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Thanks

sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



Re: io.compression.codecs not found

2013-05-23 Thread bejoy_ks
Go to $HADOOP_HOME/config open and edit core-site.xml

Add a new property 'io.compression.codecs' and assign the required compression 
codecs as its value.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Sachin Sudarshana 
Date: Thu, 23 May 2013 19:46:37 
To: 
Reply-To: user@hive.apache.org
Subject: Re: io.compression.codecs not found

Hi,

I'm not using CM. I have installed CDH 4.2.1 using Linux packages.

Thank you,
Sachin


On Thu, May 23, 2013 at 7:13 PM, Sanjay Subramanian <
sanjay.subraman...@wizecommerce.com> wrote:

> This property needs to be set in core-site.xml. If u r using
> clouderamanager then ping me I will tell u how to set it there. Out of the
> box hive works beautifully with gzip and snappy.  And if u r using lzo then
> needs some plumbing. Depends on what ur usecase is I can provide guidance.
>
> Regards
> Sanjay
>
> Sent from my iPhone
>
> On May 23, 2013, at 3:33 AM, "Sachin Sudarshana" 
> wrote:
>
> > Hi,
> >
> > I'm trying to run some queries on compressed tables in Hive 0.10. I wish
> to know what all compression codecs are available which i can make use of.
> > However, when i run set io.compression.codecs in the hive CLI, it throws
> an error saying the io.compression.codecs is not found.
> >
> > I'm unable to figure out why its happening. Has it (the hiveconf
> property) been removed from 0.10?
> >
> > Any help is greatly appreciated!
> >
> > Thank you,
> > Sachin
> >
>
> CONFIDENTIALITY NOTICE
> ==
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
>



Re: Hive on Oracle

2013-05-17 Thread bejoy_ks
Hi Raj

Which jar depends on what version of oracle you are using? The jar version 
corresponding to each oracle release would be there in oracle documentations 
online.

JDBC Jars should be available from the oracle website for free download.


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Raj Hadoop 
Date: Fri, 17 May 2013 20:43:46 
To: bejoy...@yahoo.com; 
user@hive.apache.org; User
Reply-To: user@hive.apache.org
Subject: Re: Hive on Oracle


Thanks for the reply.

Can you specify whether which jar file need need to be used ? where can i get 
the jar file? does oracle provide one for free? let me know please.

Thanks,
Raj






 From: "bejoy...@yahoo.com" 
To: user@hive.apache.org; Raj Hadoop ; User 
 
Sent: Friday, May 17, 2013 11:42 PM
Subject: Re: Hive on Oracle
 


Hi

The procedure is same as setting up mysql metastore. You need to use the jdbc 
driver/jar corresponding to the oracle version/release you are intending to use.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos


From:  Raj Hadoop  
Date: Fri, 17 May 2013 17:10:07 -0700 (PDT)
To: Hive; User
ReplyTo:  user@hive.apache.org 
Subject: Hive on Oracle

Hi,

I am planning to install Hive and want to set up Meta store on Oracle. What is 
the procedure? Which driver (JDBC) do I need to use it?


Thanks,
Raj


Re: Hive on Oracle

2013-05-17 Thread bejoy_ks
Hi

The procedure is same as setting up mysql metastore. You need to use the jdbc 
driver/jar corresponding to the oracle version/release you are intending to use.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Raj Hadoop 
Date: Fri, 17 May 2013 17:10:07 
To: Hive; User
Reply-To: user@hive.apache.org
Subject: Hive on Oracle

Hi,

I am planning to install Hive and want to set up Meta store on Oracle. What is 
the procedure? Which driver (JDBC) do I need to use it?


Thanks,
Raj



Re: Getting Slow Query Performance!

2013-03-12 Thread bejoy_ks
Hi

Since you are on a pseudo distributed/ single node environment the hadoop 
mapreduce parallelism is limited.

You might be having just a few map slots and map tasks might be in queue 
waiting for others to complete. In a larger cluster your job should be faster.

As a side note, Certain SQL queries that ulilize indexing would be faster in 
sql server than in hive.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Gobinda Paul 
Date: Tue, 12 Mar 2013 15:09:31 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Getting Slow Query Performance!






i use sqoop to import 30GB data ( two table employee(aprox 21 GB)  and 
salary(aprox 9GB ) into hadoop(Single Node) via hive.
i run a sample query like SELECT 
EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY 
WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT>90;
In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to 
execute that query .
CPU: Pentium(R) Dual-Core  CPU  E5700  @ 3.00GHzRAM:  2GBHDD: 500GB

Here IS My hive-site.xml conf.


  javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true
javax.jdo.option.ConnectionDriverName 
   com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserNameroot  
  javax.jdo.option.ConnectionPassword
123456
hive.hwi.listen.host 0.0.0.0 
This is the host address the Hive Web Interface will listen 
onhive.hwi.listen.port  
  This is the port the Hive Web Interface 
will listen on  
hive.hwi.war.file/lib/hive-hwi-0.9.0.war
This is the WAR file with the jsp content for Hive Web 
Interface   
mapred.reduce.tasks-1 
The default number of reduce tasks per job.  Typically set to a 
prime close to the number of available hosts.  Ignored when
mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive 
uses -1 as its default value.  By setting this property to -1, Hive will 
automatically figure out what should be the number of reducers.   
   
hive.exec.reducers.bytes.per.reducer 
10 size per reducer.The default is 1G, 
i.e if the input size is 10G, it will use 10 reducers.   


  hive.exec.reducers.max999   
 max number of reducers will be used. If the one   
specified in the configuration parameter mapred.reduce.tasks is 
negative, hive will use this one as the max number of reducers when 
automatically determine number of reducers.

  hive.exec.scratchdir
/tmp/hive-${user.name}Scratch space for Hive 
jobs  
hive.metastore.local true   



Any IDEA ?? 
  


Re: Getting Slow Query Performance!

2013-03-12 Thread bejoy_ks
Hi

Since you are on a pseudo distributed/ single node environment the hadoop 
mapreduce parallelism is limited.

You might be having just a few map slots and map tasks might be in queue 
waiting for others to complete. In a larger cluster your job should be faster.

Certain SQL queries that ulilize indexing would be faster in sql server than in 
hive.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Gobinda Paul 
Date: Tue, 12 Mar 2013 15:09:31 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Getting Slow Query Performance!






i use sqoop to import 30GB data ( two table employee(aprox 21 GB)  and 
salary(aprox 9GB ) into hadoop(Single Node) via hive.
i run a sample query like SELECT 
EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY 
WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT>90;
In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to 
execute that query .
CPU: Pentium(R) Dual-Core  CPU  E5700  @ 3.00GHzRAM:  2GBHDD: 500GB

Here IS My hive-site.xml conf.


  javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true
javax.jdo.option.ConnectionDriverName 
   com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserNameroot  
  javax.jdo.option.ConnectionPassword
123456
hive.hwi.listen.host 0.0.0.0 
This is the host address the Hive Web Interface will listen 
onhive.hwi.listen.port  
  This is the port the Hive Web Interface 
will listen on  
hive.hwi.war.file/lib/hive-hwi-0.9.0.war
This is the WAR file with the jsp content for Hive Web 
Interface   
mapred.reduce.tasks-1 
The default number of reduce tasks per job.  Typically set to a 
prime close to the number of available hosts.  Ignored when
mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive 
uses -1 as its default value.  By setting this property to -1, Hive will 
automatically figure out what should be the number of reducers.   
   
hive.exec.reducers.bytes.per.reducer 
10 size per reducer.The default is 1G, 
i.e if the input size is 10G, it will use 10 reducers.   


  hive.exec.reducers.max999   
 max number of reducers will be used. If the one   
specified in the configuration parameter mapred.reduce.tasks is 
negative, hive will use this one as the max number of reducers when 
automatically determine number of reducers.

  hive.exec.scratchdir
/tmp/hive-${user.name}Scratch space for Hive 
jobs  
hive.metastore.local true   



Any IDEA ?? 
  


Re: hive issue with sub-directories

2013-03-10 Thread bejoy_ks
Hi Suresh

AFAIK as of now a partition cannot contain sub directories, it can contain only 
files.

You may have to move the sub dirs out of the parent dir 'a' and create separate 
partitions for those.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Suresh Krishnappa 
Date: Mon, 11 Mar 2013 10:58:05 
To: 
Reply-To: user@hive.apache.org
Subject: Re: hive issue with sub-directories

Hi Mark,
I am using external table in HIVE.

This is how I am adding the partition

> alter table  add partition (pt=1) location '/test/a/';

I am able to run HIVE queries only if '/test/a/b' folder is deleted.

How can I retain this folder structure and still issue queries?

Thanks
Suresh

On Sun, Mar 10, 2013 at 12:48 AM, Mark Grover
wrote:

> Suresh,
> By default, the partition column name has to be appear in HDFS
> directory structure.
>
> e.g.
> /user/hive/warehouse//= value>/data1.txt
> /user/hive/warehouse//= value>/data2.txt
>
>
> On Thu, Mar 7, 2013 at 7:20 AM, Suresh Krishnappa
>  wrote:
> > Hi All,
> > I have the following directory structure in hdfs
> >
> > /test/a/
> > /test/a/1.avro
> > /test/a/2.avro
> > /test/a/b/
> > /test/a/b/3.avro
> >
> > I created an external HIVE table using Avro Serde and added /test/a as a
> > partition to this table.
> >
> > I am not able to run a select query. Always getting the error 'not a
> file'
> > on '/test/a/b'
> >
> > Is this by design, a bug or am I missing some configuration?
> > I am using HIVE 0.10
> >
> > Thanks
> > Suresh
> >
>



Re: java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil

2013-03-10 Thread bejoy_ks
Hi Sai

Local mode is just for trials, for any pre prod/production environment you need 
MR jobs.

Hive under the hood stores data in HDFS (mostly) and definitely we use 
hadoop/hive for larger data volumes. So MR should be in there to process them. 

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Ramki Palle 
Date: Sun, 10 Mar 2013 06:58:57 
To: ; Sai Sai
Reply-To: user@hive.apache.org
Subject: Re: java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil

Well, you get the results faster.

Please check this:

https://cwiki.apache.org/Hive/gettingstarted.html#GettingStarted-Runtimeconfiguration

Under section   "Hive, Map-Reduce and Local-Mode", it says

This can be very useful to run queries over small data sets - in such cases
local mode execution is usually significantly faster than submitting jobs
to a large cluster.

-Ramki.






On Sun, Mar 10, 2013 at 5:26 AM, Sai Sai  wrote:

> Ramki/John
> Many Thanks, that really helped. I have run the add jars in the new
> session and it appears to be running. However i was wondering about by
> passing MR, why would we do it and what is the use of it. Will appreciate
> any input.
> Thanks
> Sai
>
>
>   --
> *From:* Ramki Palle 
>
> *To:* user@hive.apache.org; Sai Sai 
> *Sent:* Sunday, 10 March 2013 4:22 AM
> *Subject:* Re: java.lang.NoClassDefFoundError:
> com/jayway/jsonpath/PathUtil
>
> When you execute the following query,
>
> hive> select * from twitter limit 5;
>
> Hive runs it in local mode and not use MapReduce.
>
> For the query,
>
> hive> select tweet_id from twitter limit 5;
>
> I think you need to add JSON jars to overcome this error. You might have
> added these in a previous session. If you want these jars available for all
> sessions, insert the add jar statements to your $HOME/.hiverc file.
>
>
> To bypass MapReduce
>
> set hive.exec.mode.local.auto = true;
>
> to suggest Hive to use local mode to execute the query. If it still uses
> MR, try
>
> set hive.fetch.task.conversion = more;.
>
>
> -Ramki.
>
>
>
> On Sun, Mar 10, 2013 at 12:19 AM, Sai Sai  wrote:
>
> Just wondering if anyone has any suggestions:
>
> This executes successfully:
>
> hive> select * from twitter limit 5;
>
> This does not work:
>
> hive> select tweet_id from twitter limit 5; // I have given the exception
> info below:
>
> Here is the output of this:
>
> hive> select * from twitter limit 5;
> OK
>
> tweet_idcreated_attextuser_iduser_screen_nameuser_lang
> 122106088022745088Fri Oct 07 00:28:54 + 2011wkwkw -_- ayo saja
> mba RT @yullyunet: Sepupuuu, kita lanjalan yok.. Kita karokoe-an.. Ajak mas
> galih jg kalo dia mau.. "@Dindnf: doremifas124735434Dindnfen
> 122106088018558976Fri Oct 07 00:28:54 + 2011@egg486 특별히
> 준비했습니다!252828803CocaCola_Koreako
> 122106088026939392Fri Oct 07 00:28:54 + 2011My offer of free
> gobbies for all if @amityaffliction play Blair snitch project still
> stands.168590073SarahYoungBlooden
> 122106088035328001Fri Oct 07 00:28:54 + 2011the girl nxt to me
> in the lib got her headphones in dancing and singing loud af like she the
> only one here haha267296295MONEYyDREAMS_en
> 122106088005971968Fri Oct 07 00:28:54 + 2011@KUnYoong_B2UTY
> Bị lsao đấy269182160b2st_b2utyhpen
> Time taken: 0.154 seconds
>
> This does not work:
>
> hive> select tweet_id from twitter limit 5;
>
>
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_201303050432_0094, Tracking URL =
> http://ubuntu:50030/jobdetails.jsp?jobid=job_201303050432_0094
> Kill Command = /home/satish/work/hadoop-1.0.4/libexec/../bin/hadoop job
> -kill job_201303050432_0094
> Hadoop job information for Stage-1: number of mappers: 1; number of
> reducers: 0
> 2013-03-10 00:14:44,509 Stage-1 map = 0%,  reduce = 0%
> 2013-03-10 00:15:14,613 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_201303050432_0094 with errors
> Error during job, obtaining debugging information...
> Job Tracking URL:
> http://ubuntu:50030/jobdetails.jsp?jobid=job_201303050432_0094
> Examining task ID: task_201303050432_0094_m_02 (and more) from job
> job_201303050432_0094
>
> Task with the most failures(4):
> -
> Task ID:
>   task_201303050432_0094_m_00
>
> URL:
>
> http://ubuntu:50030/taskdetails.jsp?jobid=job_201303050432_0094&tipid=task_201303050432_0094_m_00
> -
> Diagnostic Messages for this Task:
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
> at org.a

Re: Accessing sub column in hive

2013-03-08 Thread bejoy_ks
Hi Sai


You can do it as
Select address.country from employees;
 

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Bennie Schut 
Date: Fri, 8 Mar 2013 09:09:49 
To: user@hive.apache.org; 'Sai Sai'
Reply-To: user@hive.apache.org
Subject: RE: Accessing sub column in hive

Perhaps worth posting the error. Some might know what the error means.

Also a bit unrelated to hive but please do yourself a favor and don't use float 
to store monetary values like salary. You will get rounding issues at some 
point in time when you do arithmetic on them. Considering you are using hadoop 
you probably have a lot of data so adding it all up will get you there really 
really fast. 
http://stackoverflow.com/questions/3730019/why-not-use-double-or-float-to-represent-currency


From: Sai Sai [mailto:saigr...@yahoo.in]
Sent: Thursday, March 07, 2013 12:54 PM
To: user@hive.apache.org
Subject: Re: Accessing sub column in hive

I have a table created like this successfully:

CREATE TABLE IF NOT EXISTS employees (name STRING,salary FLOAT,subordinates 
ARRAY,deductions   MAP,address STRUCT)

I would like to access/display country column from my address struct.
I have tried this:

select address["country"] from employees;

I get an error.

Please help.

Thanks
Sai



Re: Finding maximum across a row

2013-03-01 Thread bejoy_ks
Hi Sachin

You could get the detailed ateps from hive wiki itself

https://cwiki.apache.org/Hive/hiveplugins.html

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Sachin Sudarshana 
Date: Fri, 1 Mar 2013 22:37:54 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: Finding maximum across a row

Hi Bejoy,

I am new to UDF in Hive. Could you send me any link/tutorials on where i
can be able to learn about writing the UDF?

Thanks!

On Fri, Mar 1, 2013 at 10:22 PM,  wrote:

> **
> Hi Sachin
>
> AFAIK There isn't one at the moment. But you can easily achieve this using
> a custom UDF.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Sachin Sudarshana 
> *Date: *Fri, 1 Mar 2013 22:16:37 +0530
> *To: *
> *ReplyTo: * user@hive.apache.org
> *Subject: *Finding maximum across a row
>
> Hi,
>
> Is there any function/method to find the maximum across a row in hive?
>
> Suppose i have a table like this:
>
> ColA   ColB   ColC
> 2  5  7
> 3  2  1
>
> I want the function to return
>
> 7
> 1
>
>
> Its urgently required. Any help would be greatly appreciated!
>
>
>
> --
> Thanks and Regards,
> Sachin Sudarshana
>



-- 
Thanks and Regards,
Sachin Sudarshana



Re: Finding maximum across a row

2013-03-01 Thread bejoy_ks
Hi Sachin

AFAIK There isn't one at the moment. But you can easily achieve this using a 
custom UDF.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Sachin Sudarshana 
Date: Fri, 1 Mar 2013 22:16:37 
To: 
Reply-To: user@hive.apache.org
Subject: Finding maximum across a row

Hi,

Is there any function/method to find the maximum across a row in hive?

Suppose i have a table like this:

ColA   ColB   ColC
2  5  7
3  2  1

I want the function to return

7
1


Its urgently required. Any help would be greatly appreciated!



-- 
Thanks and Regards,
Sachin Sudarshana



Re: Hive queries

2013-02-25 Thread bejoy_ks
Hi Cyril

I believe you are using the derby meta store and then it should be an issue 
with the hive configs.

Derby is trying to create a metastore at your current dir from where you are 
starting hive. The tables exported by sqoop would be inside HIVE_HOME and hence 
you are not able to see the tables from getting on to hive CLI from other 
locations.

To have a universal metastore db configure a specific dir in 
javax.jdo.option.ConnectionURL in hive-site.xml . In your conn url configure 
the db name as "databaseName=/home/hive/metastore_db"

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Cyril Bogus 
Date: Mon, 25 Feb 2013 10:34:29 
To: 
Reply-To: user@hive.apache.org
Subject: Re: Hive queries

I do not get any errors.
It is only when I run hive and try to query the tables I imported. Let's
say I want to only get numeric tuples for a given table. I cannot find the
table (show tables; is empty) unless I go in the hive home folder and run
hive again. I would expect the state of hive to be the same everywhere I
call it.
But so far it is not the case.


On Mon, Feb 25, 2013 at 10:22 AM, Nitin Pawar wrote:

> any errors you see ?
>
>
> On Mon, Feb 25, 2013 at 8:48 PM, Cyril Bogus  wrote:
>
>> Hi everyone,
>>
>> My setup is Hadoop 1.0.4, Hive 0.9.0, Sqoop 1.4.2-hadoop 1.0.0
>> Mahout 0.7
>>
>> I have imported tables from a remote database directly into Hive using
>> Sqoop.
>>
>> Somehow when I try to run Sqoop from Hadoop, the content
>>
>> Hive is giving me trouble in bookkeeping of where the imported tables are
>> located. I have a Single Node setup.
>>
>> Thank you for any answer and you can ask question if I was not specific
>> enough about my issue.
>>
>> Cyril
>>
>
>
>
> --
> Nitin Pawar
>



Re: Security for Hive

2013-02-23 Thread bejoy_ks
Hi Austin

AFAIK at the moment you can control permissions gracefully only on a data level 
not on the metadata level. ie you can play with the hdfs permissions .

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Austin Chungath 
Date: Fri, 22 Feb 2013 23:11:51 
To: bejoy...@yahoo.com; 
user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: RE: Security for Hive

  So that means any user can revoke or give permissions to any user for any
table in the metastore?

Sent from my Phone, please ignore typos
 --
From: bejoy...@yahoo.com
Sent: 22-02-2013 11:30 PM
To: user@hive.apache.org
Subject: Re: Security for Hive

Hi Sachin

Currently there is no such admin user concept in hive.
Regards
Bejoy KS

Sent from remote device, Please excuse typos
--
*From: * Sachin Sudarshana 
*Date: *Fri, 22 Feb 2013 16:40:49 +0530
*To: *
*ReplyTo: * user@hive.apache.org
*Subject: *Re: Security for Hive

Hi,
I have read about roles, user privileges, group privileges etc.
But these roles can be created by any user for any database/table. I would
like to know if there is a specific 'administrator' for hive who can log on
with his credentials and is the only one entitled to create roles, grant
privileges etc.

Thank you.

On Fri, Feb 22, 2013 at 4:19 PM, Jagat Singh  wrote:

> You might want to read this
>
> https://cwiki.apache.org/Hive/languagemanual-auth.html
>
>
>
>
> On Fri, Feb 22, 2013 at 9:44 PM, Sachin Sudarshana <
> sachin.sudarsh...@gmail.com> wrote:
>
>> Hi,
>>
>> I have just started learning about hive.
>> I have configured Hive to use mysql as the metastore instead of derby.
>> If I wish to use GRANT and REVOKE commands, i can use it with any user. A
>> user can issue GRANT or REVOKE commands to any other users' table since
>> both the users' tables are present in the same warehouse.
>>
>> Isn't there a concept of superuser/admin in hive who alone has the
>> authority to issue these commands ?
>>
>> Any answer is greatly appreciated!
>>
>> --
>> Thanks and Regards,
>> Sachin Sudarshana
>>
>
>


-- 
Thanks and Regards,
Sachin Sudarshana



Re: Security for Hive

2013-02-22 Thread bejoy_ks
Hi Sachin

Currently there is no such admin user concept in hive.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Sachin Sudarshana 
Date: Fri, 22 Feb 2013 16:40:49 
To: 
Reply-To: user@hive.apache.org
Subject: Re: Security for Hive

Hi,
I have read about roles, user privileges, group privileges etc.
But these roles can be created by any user for any database/table. I would
like to know if there is a specific 'administrator' for hive who can log on
with his credentials and is the only one entitled to create roles, grant
privileges etc.

Thank you.

On Fri, Feb 22, 2013 at 4:19 PM, Jagat Singh  wrote:

> You might want to read this
>
> https://cwiki.apache.org/Hive/languagemanual-auth.html
>
>
>
>
> On Fri, Feb 22, 2013 at 9:44 PM, Sachin Sudarshana <
> sachin.sudarsh...@gmail.com> wrote:
>
>> Hi,
>>
>> I have just started learning about hive.
>> I have configured Hive to use mysql as the metastore instead of derby.
>> If I wish to use GRANT and REVOKE commands, i can use it with any user. A
>> user can issue GRANT or REVOKE commands to any other users' table since
>> both the users' tables are present in the same warehouse.
>>
>> Isn't there a concept of superuser/admin in hive who alone has the
>> authority to issue these commands ?
>>
>> Any answer is greatly appreciated!
>>
>> --
>> Thanks and Regards,
>> Sachin Sudarshana
>>
>
>


-- 
Thanks and Regards,
Sachin Sudarshana



Re: Adding comment to a table for columns

2013-02-21 Thread bejoy_ks
Hi Gupta

Try out

DESCRIBE EXTENDED FORMATTED 

I vaguely recall a operation like this.
Please check hive wiki for the exact syntax.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Chunky Gupta 
Date: Thu, 21 Feb 2013 17:15:37 
To: ; ; 

Reply-To: user@hive.apache.org
Subject: Re: Adding comment to a table for columns

Hi Bejoy, Bhaskar

I tried using FORMATTED, but it will not give me comments which I have put
while creating table. Its output is like :-

col_namedata_type   comment
cstring  from deserializer
timestring  from deserializer

Thanks,
Chunky.

On Thu, Feb 21, 2013 at 4:50 PM,  wrote:

> **
> Hi Gupta
>
> You can the describe output in a formatted way using
>
> DESCRIBE FORMATTED ;
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Chunky Gupta 
> *Date: *Thu, 21 Feb 2013 16:46:30 +0530
> *To: *
> *ReplyTo: * user@hive.apache.org
> *Subject: *Adding comment to a table for columns
>
> Hi,
>
> I am using this syntax to add comments for all columns :-
>
> CREATE EXTERNAL TABLE test ( c STRING COMMENT 'Common  class', time STRING
> COMMENT 'Common  time', url STRING COMMENT 'Site URL' ) PARTITIONED BY (dt
> STRING ) LOCATION 's3://BucketName/'
>
> Output of Describe Extended table is like :- (Output is just an example
> copied from internet)
>
> hive> DESCRIBE EXTENDED table_name;
>
> Detailed Table Information Table(tableName:table_name,
> dbName:benchmarking, owner:root, createTime:1309480053, lastAccessTime:0,
> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:session_key,
> type:string, comment:null), FieldSchema(name:remote_address, type:string,
> comment:null), FieldSchema(name:canister_lssn, type:string, comment:null),
> FieldSchema(name:canister_session_id, type:bigint, comment:null),
> FieldSchema(name:tltsid, type:string, comment:null),
> FieldSchema(name:tltuid, type:string, comment:null),
> FieldSchema(name:tltvid, type:string, comment:null),
> FieldSchema(name:canister_server, type:string, comment:null),
> FieldSchema(name:session_timestamp, type:string, comment:null),
> FieldSchema(name:session_duration, type:string, comment:null),
> FieldSchema(name:hit_count, type:bigint, comment:null),
> FieldSchema(name:http_user_agent, type:string, comment:null),
> FieldSchema(name:extractid, type:bigint, comment:null),
> FieldSchema(name:site_link, type:string, comment:null),
> FieldSchema(name:dt, type:string, comment:null), FieldSchema(name:hour,
> type:int, comment:null)],
> location:hdfs://hadoop2/user/hive/warehouse/benchmarking.db/table_name,
> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)
>
> Is there any way of getting this detailed comments and column name in
> readable format, just like the output of "Describe table_name" ?.
>
>
> Thanks,
>
> Chunky.
>



Re: Adding comment to a table for columns

2013-02-21 Thread bejoy_ks
Hi Gupta

You can the describe output in a formatted way using

DESCRIBE FORMATTED ;

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Chunky Gupta 
Date: Thu, 21 Feb 2013 16:46:30 
To: 
Reply-To: user@hive.apache.org
Subject: Adding comment to a table for columns

Hi,

I am using this syntax to add comments for all columns :-

CREATE EXTERNAL TABLE test ( c STRING COMMENT 'Common  class', time STRING
COMMENT 'Common  time', url STRING COMMENT 'Site URL' ) PARTITIONED BY (dt
STRING ) LOCATION 's3://BucketName/'

Output of Describe Extended table is like :- (Output is just an example
copied from internet)

hive> DESCRIBE EXTENDED table_name;

Detailed Table Information Table(tableName:table_name, dbName:benchmarking,
owner:root, createTime:1309480053, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:session_key, type:string,
comment:null), FieldSchema(name:remote_address, type:string, comment:null),
FieldSchema(name:canister_lssn, type:string, comment:null),
FieldSchema(name:canister_session_id, type:bigint, comment:null),
FieldSchema(name:tltsid, type:string, comment:null),
FieldSchema(name:tltuid, type:string, comment:null),
FieldSchema(name:tltvid, type:string, comment:null),
FieldSchema(name:canister_server, type:string, comment:null),
FieldSchema(name:session_timestamp, type:string, comment:null),
FieldSchema(name:session_duration, type:string, comment:null),
FieldSchema(name:hit_count, type:bigint, comment:null),
FieldSchema(name:http_user_agent, type:string, comment:null),
FieldSchema(name:extractid, type:bigint, comment:null),
FieldSchema(name:site_link, type:string, comment:null),
FieldSchema(name:dt, type:string, comment:null), FieldSchema(name:hour,
type:int, comment:null)],
location:hdfs://hadoop2/user/hive/warehouse/benchmarking.db/table_name,
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)

Is there any way of getting this detailed comments and column name in
readable format, just like the output of "Describe table_name" ?.


Thanks,

Chunky.



Re: Running Hive on multi node

2013-02-21 Thread bejoy_ks
Hi Hamad

Fully distributed is a proper cluster where all demons are not on the same 
machine.

 You can have hadoop installed in three modes
- Stand Alone
- Pseudo Distributed (all daemons in same machine) and
- Fully Distributed

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Hamza Asad 
Date: Thu, 21 Feb 2013 15:26:48 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: Running Hive on multi node

what do u mean by fully Distributed?


On Thu, Feb 21, 2013 at 2:58 PM,  wrote:

> **
> Hi
>
> Hive uses the hadoop installation specified in HADOOP_HOME. If your hadoop
> home is configured for fully distributed operation it'll utilize the
> cluster itself.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Hamza Asad 
> *Date: *Thu, 21 Feb 2013 14:26:40 +0500
> *To: *
> *ReplyTo: * user@hive.apache.org
> *Subject: *Running Hive on multi node
>
> Does hive automatically runs on multi node as i configured hadoop on multi
> node OR i have to explicitly do its configuration??
>
> --
> *Muhammad Hamza Asad*
>



-- 
*Muhammad Hamza Asad*



Re: Running Hive on multi node

2013-02-21 Thread bejoy_ks
Hi

Hive uses the hadoop installation specified in HADOOP_HOME. If your hadoop home 
is configured for fully distributed operation it'll utilize the cluster itself.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Hamza Asad 
Date: Thu, 21 Feb 2013 14:26:40 
To: 
Reply-To: user@hive.apache.org
Subject: Running Hive on multi node

Does hive automatically runs on multi node as i configured hadoop on multi
node OR i have to explicitly do its configuration??

-- 
*Muhammad Hamza Asad*



Re: bucketing on a column with millions of unique IDs

2013-02-20 Thread bejoy_ks
Hi Li

The major consideration you should give is regarding the size of bucket. One 
bucket corresponds to a file in hdfs and you should ensure that every bucket is 
atleast a block size or in the worst case atleast majority of the buckets 
should be.

So based on the data size you should derive on this rather than the number of 
rows/records.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Echo Li 
Date: Wed, 20 Feb 2013 16:19:43 
To: 
Reply-To: user@hive.apache.org
Subject: bucketing on a column with millions of unique IDs

Hi guys,

I plan to bucket a table by "userid" as I'm going to do intense calculation
using "group by userid". there are about 110 million rows, with 7 million
unique userid, so my question is what is a good number of buckets for this
scenario, and how to determine number of buckets?

Any input is apprecaited :)

Echo



Re: CREATE EXTERNAL TABLE Fails on Some Directories

2013-02-15 Thread bejoy_ks

Hi Joseph

There are differences in the following ls commands

cloudera@localhost data]$ hdfs dfs -ls /715

This would list out all the contents in /715 in hdfs, if it is a dir

Found 1 items
-rw-r--r--   1 cloudera supergroup    7853975 2013-02-14 17:03 /715

The output clearly defines it is file as d is missing as the first char

[cloudera@localhost data]$ hdfs dfs -ls 715

This lists the dir 715 in your user's hdfs home dir. If your user is cloudera 
then usually your home dir might be /userdata/cloudera/ so in effect the dir 
listed is /userdata/cloudera/715


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Joseph D Antoni 
Date: Fri, 15 Feb 2013 08:55:50 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: CREATE EXTERNAL TABLE Fails on Some Directories

Not sure--I just truncated the file list from the ls--that was the first file 
(just obfuscated the name)

The command I used to create the directories was:

hdfs dfs -mkdir 715 
then 
hdfs dfs -put myfile.csv 715

[cloudera@localhost data]$ hdfs dfs -ls /715
Found 1 items
-rw-r--r--   1 cloudera supergroup    7853975 2013-02-14 17:03 /715
[cloudera@localhost data]$ hdfs dfs -ls 715
Found 13 items
-rw-r--r--   1 cloudera cloudera    7853975 2013-02-15 00:41 715/40-file.csv

Thanks






 From: Dean Wampler 
To: user@hive.apache.org; Joseph D Antoni  
Sent: Friday, February 15, 2013 11:50 AM
Subject: Re: CREATE EXTERNAL TABLE Fails on Some Directories
 

Something's odd about this output; why is there no / in front of 715? I always 
get the full path when I run a -ls command. I would expect either:

/715/file.csv
or
/user//715/file.csv

Or is that what you meant by "(didn't leave rest of ls results)"?

dean


On Fri, Feb 15, 2013 at 10:45 AM, Joseph D Antoni  wrote:

[cloudera@localhost data]$ hdfs dfs -ls 715
>Found 13 items
>-rw-r--r--   1 cloudera cloudera    7853975 2013-02-15 00:41 715/file.csv 
>(didn't leave rest of ls results)
>
>
>Thanks on the directory--wasn't clear on that..
>
>Joey
>
>
>
>
>
>
>
>
>
> From: Dean Wampler 
>To: user@hive.apache.org; Joseph D Antoni  
>Sent: Friday, February 15, 2013 11:37 AM
>Subject: Re: CREATE EXTERNAL TABLE Fails on Some Directories
> 
>
>
>You confirmed that 715 is an actual directory? It didn't become a file by 
>accident?
>
>
>By the way, you don't need to include the file name in the LOCATION. It will 
>read all the files in the directory.
>
>
>dean
>
>
>On Fri, Feb 15, 2013 at 10:29 AM, Joseph D Antoni  wrote:
>
>I'm trying to create a series of external tables for a time series of data 
>(using the prebuilt Cloudera VM).
>>
>>
>>The directory structure in HDFS is as such:
>>
>>
>>/711
>>/712
>>/713
>>/714
>>/715
>>/716
>>/717
>>
>>
>>Each directory contains the same set of files, from a different day. They 
>>were all put into HDFS using the following script:
>>
>>
>>for i in *;do hdfs dfs -put $i in $dir;done
>>
>>
>>They all show up with the same ownership/perms in HDFS.
>>
>>
>>Going into Hive to build the tables, I built a set of scripts to do the 
>>loads--then did a sed (changing 711 to 712,713, etc) to a file for each day. 
>>All of my loads work, EXCEPT for 715 and 716. 
>>
>>
>>Script is as follows:
>>
>>
>>create external table 715_table_name
>>(col1 string,
>>col2 string)
>>row format
>>delimited fields terminated by ','
>>lines terminated by '\n'
>>stored as textfile
>>location '/715/file.csv';
>>
>>
>>This is failing with:
>>
>>
>>Error in Metadata MetaException(message:Got except: 
>>org.apache.hadoop.fs.FileAlreadyExistsException Parent Path is not a 
>>directory: /715 715...
>>
>>
>>Like I mentioned it works for all of the other directories, except 715 and 
>>716. Thoughts on troubleshooting path?
>>
>>
>>Thanks
>>
>>
>>Joey D'Antoni
>
>
>
>-- 
>Dean Wampler, Ph.D.
>thinkbiganalytics.com
>+1-312-339-1330
>
>
>
>


-- 
Dean Wampler, Ph.D.
thinkbiganalytics.com
+1-312-339-1330


Re: Map join optimization issue

2013-02-15 Thread bejoy_ks
Hi 

In later versions of hive you actually don't need a map joint hint in your 
query. Just the following would suffice the purpose

Set hive.auto.convert.join=true 

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Mayuresh Kunjir 
Date: Fri, 15 Feb 2013 10:37:52 
To: user
Reply-To: user@hive.apache.org
Subject: Re: Map join optimization issue

Thanks Aniket. I actually had not specified the map-join hint though. Sorry
for providing the wrong information earlier. I had only
set hive.auto.convert.join=true before firing my join query.

~Mayuresh



On Thu, Feb 14, 2013 at 10:44 PM, Aniket Mokashi wrote:

> I think hive.mapjoin.smalltable.filesize parameter will be disregarded in
> that case.
>
>
> On Thu, Feb 14, 2013 at 7:25 AM, Mayuresh Kunjir <
> mayuresh.kun...@gmail.com> wrote:
>
>> Yes, the hint was specified.
>> On Feb 14, 2013 3:11 AM, "Aniket Mokashi"  wrote:
>>
>>> have you specified map-join hint in your query?
>>>
>>>
>>> On Thu, Feb 7, 2013 at 11:39 AM, Mayuresh Kunjir <
>>> mayuresh.kun...@gmail.com> wrote:
>>>

 Hello all,


 I am trying to join two tables, the smaller being of size 4GB. When I
 set hive.mapjoin.smalltable.filesize parameter above 500MB, Hive tries to
 perform a local task to read the smaller file. This of-course fails since
 the file size is greater and the backup common join is then run. What I do
 not understand is why did Hive attempt a map join when small file size was
 greater than the smalltable.filesize parameter.


 ~Mayuresh


>>>
>>>
>>> --
>>> "...:::Aniket:::... Quetzalco@tl"
>>>
>>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>



Re: LOAD HDFS into Hive

2013-01-25 Thread bejoy_ks
Hi Venkataraman

You can just create an external table and give it location as the hdfs dir 
where the data resides.

No need to perform an explicit LOAD operation here.


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: venkatramanan 
Date: Fri, 25 Jan 2013 18:30:29 
To: 
Reply-To: user@hive.apache.org
Subject: LOAD HDFS into Hive

Hi,

I need to load the hdfs data into the Hive table.

For example,

Am having the twitter data and its updated daily using the streaming 
API. These twitter responses are stored into the HDFS Path named like 
('TwitterData'). After that i try to load the data into the Hive. using 
the 'LOAD DATA stmt'. My problem is that hdfs data is lost once i load 
the data. is there any way to load the data without the hdfs data lose.

To Create the Table using the below stmt;

CREATE EXTERNAL TABLE Tweets (FromUserId String, Text string, 
FromUserIdString String, FromUser String, Geo String, Id BIGINT, 
IsoLangCode string, ToUserId INT, ToUserIdString string, CreatedAt 
string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED 
BY '\n';

To LOAD the data using the below stmt;

LOAD DATA INPATH '/twitter_sample' INTO TABLE tweets;

thanks in advance

Thanks,
Venkat



Re: An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread bejoy_ks
Hi David,

The default partitioner used in map reduce is the hash partitioner. So based on 
your keys they are send to a particular reducer.

May be in your current data set, the keys that have no values in table are all 
falling in the same hash bucket and hence being processed by the same reducer.

If you are noticing a skew on a particular reducer, sometimes  a simple work 
around like increasing the no of reducers explicitly might help you get pass 
the hurdle.

Also please ensure you have enabled skew join optimization.
 
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: "David Morel" 
Date: Thu, 24 Jan 2013 18:39:56 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: An explanation of LEFT OUTER JOIN and NULL values

On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote:

> Hi David
>
> An explain extended would give you the exact pointer.
>
> From my understanding, this is how it could work.
>
> You have two tables then two different map reduce job would be
> processing those. Based on the join keys, combination of corresponding
> columns would be chosen as key from mapper1 and mapper2. So if the
> combination of columns having the same value those records from two
> set of mappers would go into the same reducer.
>
> On the reducer if there is a corresponding value for a key from table
> 1 to  table 2/mapper 2 that value would be populated. If no val for
> mapper 2 then those columns from table 2 are made null.
>
> If there is a key-value just from table 2/mapper 2 and no
> corresponding value from mapper 1. That value is just discarded.

Hi Bejoy,

Thanks! So schematically, something like this, right?

mapper1 (bigger table):
K1-A, V1A
K2-A, V2A
K3-A, V3A

mapper2 (joined, smaller table):
K1-B, V1B

reducer1:
K1-A, V1A 
K1-B, V1B

returns:
K1, V1A, V1B etc

reducer2:
K2-A, V2A
*no* K2-B, V so: K2-B, NULL is created, same for next row.
K3-A, V3A

returns:
K2, V2A, NULL etc
K3, V3A, NULL etc

I still don't understand why my reducer2 (and only this one, which
apparently gets all the keys for which we don't have a row on table B)
would become overloaded. Am I completely misunderstanding the whole
thing?

David

> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
>
> -Original Message-
> From: "David Morel" 
> Date: Thu, 24 Jan 2013 18:03:40
> To: user@hive.apache.org
> Reply-To: user@hive.apache.org
> Subject: An explanation of LEFT OUTER JOIN and NULL values
>
> Hi!
>
> After hitting the "curse of the last reducer" many times on LEFT OUTER
> JOIN queries, and trying to think about it, I came to the conclusion
> there's something I am missing regarding how keys are handled in
> mapred jobs.
>
> The problem shows when I have table A containing billions of rows with
> distinctive keys, that I need to join to table B that has a much lower
> number of rows.
>
> I need to keep all the A rows, populated with NULL values from the B
> side, so that's what a LEFT OUTER is for.
>
> Now, when transforming that into a mapred job, my -naive-
> understanding would be that for every key on the A table, a missing
> key on the B table would be generated with a NULL value. If that were
> the case, I fail to understand why all NULL valued B keys would end up
> on the same reducer, since the key defines which reducer is used, not
> the value.
>
> So, obviously, this is not how it works.
>
> So my question is: how is this construct handled?
>
> Thanks a lot!
>
> D.Morel



Re: An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread bejoy_ks
Hi David

An explain extended would give you the exact pointer.

From my understanding, this is how it could work. 

 You have two tables then two different map reduce job would be processing 
those. Based on the join keys, combination of corresponding columns would be 
chosen as key from mapper1 and mapper2. So if the combination of columns having 
the same value those records from two set of mappers would go into the same 
reducer.

On the reducer if there is a corresponding value for a key from table 1 to  
table 2/mapper 2 that value would be populated. If no val for mapper 2 then 
those columns from table 2 are made null.

If there is a key-value just from table 2/mapper 2 and no  corresponding value 
from mapper 1. That value is just discarded.


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: "David Morel" 
Date: Thu, 24 Jan 2013 18:03:40 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: An explanation of LEFT OUTER JOIN and NULL values

Hi!

After hitting the "curse of the last reducer" many times on LEFT OUTER
JOIN queries, and trying to think about it, I came to the conclusion
there's something I am missing regarding how keys are handled in mapred
jobs.

The problem shows when I have table A containing billions of rows with
distinctive keys, that I need to join to table B that has a much lower
number of rows.

I need to keep all the A rows, populated with NULL values from the B
side, so that's what a LEFT OUTER is for.

Now, when transforming that into a mapred job, my -naive- understanding
would be that for every key on the A table, a missing key on the B table
would be generated with a NULL value. If that were the case, I fail to
understand why all NULL valued B keys would end up on the same reducer,
since the key defines which reducer is used, not the value.

So, obviously, this is not how it works.

So my question is: how is this construct handled?

Thanks a lot!

D.Morel


Re: Mapping HBase table in Hive

2013-01-13 Thread bejoy_ks
Hi Ibrahim.

 SQOOP is used to import data from rdbms to hbase in your case. 

Please get the schema from hbase for your corresponding table and post it here.

We can point out how your mapping could be.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Ibrahim Yakti 
Date: Sun, 13 Jan 2013 11:22:51 
To: user
Reply-To: user@hive.apache.org
Subject: Re: Mapping HBase table in Hive

Thanks Bejoy,

what do you mean by:

> If you need to map a full CF to a hive column, the data type of the hive
> column should be a Map.
>

suppose I used sqoop to move data from mysql to hbase and used id as a
column family, all the other columns will be QF then, right?

The integration document is not clear, I think it needs more clarification
or maybe I am still missing something.

--
Ibrahim


On Tue, Jan 8, 2013 at 9:35 PM,  wrote:

> data type of



Re: View with map join fails

2013-01-08 Thread bejoy_ks
Looks like there is a bug with mapjoin + view. Please check hive jira to see if 
there an issue open against this else  file a new jira.

From my understanding, When you enable map join, hive parser would create back 
up jobs. These back up jobs are executed only if map join fails. In normal 
cases when map join succeeds these jobs are filtered out and not executed.

'1116112419, job is filtered out (removed at runtime).'


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Santosh Achhra 
Date: Tue, 8 Jan 2013 17:11:18 
To: 
Reply-To: user@hive.apache.org
Subject: View with map join fails

Hello,

I have created a view  as shown below.

*CREATE VIEW V1 AS*
*select /*+ MAPJOIN(t1) ,MAPJOIN(t2)  */ t1.f1, t1.f2, t1.f3, t1.f4, t2.f1,
t2.f2, t2.f3 from TABLE1 t1 join TABLE t2 on ( t1.f2= t2.f2 and t1.f3 =
t2.f3 and t1.f4 = t2.f4 ) group by t1.f1, t1.f2, t1.f3, t1.f4, t2.f1,
t2.f2, t2.f3*

View get created successfully however when I execute below mentioned SQL or
any SQL on the view  get NULLPOINTER exception error

*hive> select count (*) from V1;*
*FAILED: NullPointerException null*
*hive>*

Is there anything wrong with the view creation ?

Next I created view without MAPJOIN hints

*CREATE VIEW V1 AS*
*select  t1.f1, t1.f2, t1.f3, t1.f4, t2.f1, t2.f2, t2.f3 from TABLE1 t1
join TABLE t2 on ( t1.f2= t2.f2 and t1.f3 = t2.f3 and t1.f4 = t2.f4 ) group
by t1.f1, t1.f2, t1.f3, t1.f4, t2.f1, t2.f2, t2.f3*

Before executing select SQL I excute *set  hive.auto.convert.join=true; *

I am getting beloow mentioned warnings
java.lang.InstantiationException:
org.apache.hadoop.hive.ql.parse.ASTNodeOrigin
Continuing ...
java.lang.RuntimeException: failed to evaluate: =Class.new();
Continuing ...


And I see from log that total 5 mapreduce jobs are
started however when don't set auto.convert.join to true, I see only 3
mapreduce jobs getting invoked.
*Total MapReduce jobs = 5*
*Ended Job = 1116112419, job is filtered out (removed at runtime).*
*Ended Job = -33256989, job is filtered out (removed at runtime).*
*WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.*


Good wishes,always !
Santosh



Re: Mapping HBase table in Hive

2013-01-08 Thread bejoy_ks
Hi Ibrahim

The hive hbase integration totally depends on the hbase table schema and not 
the schema of the source table in mysql.

You need to provide the column family qualifier mapping in there.

Get the hbase table's schema from hbase shell.

suppose you have the schema as
Id
CF1.qualifier1
CF1.qualifier2
CF1.qualifier3

You need to match each of these ColumnFamily:Qualifier to corresponding columns 
in hive. 

So in hbase.columns.mapping you need to provide these CF:QL in order.

If you need to map a full CF to a hive column, the data type of the hive column 
should be a Map.

You can get detailed hbase to hive integration document from hive wiki .


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Ibrahim Yakti 
Date: Tue, 8 Jan 2013 15:45:32 
To: user
Reply-To: user@hive.apache.org
Subject: Mapping HBase table in Hive

Hello,

suppose I have the following table (orders) in MySQL:

*** 1. row ***
  Field: id
   Type: int(10) unsigned
   Null: NO
Key: PRI
Default: NULL
  Extra: auto_increment
*** 2. row ***
  Field: value
   Type: int(10) unsigned
   Null: NO
Key:
Default: NULL
  Extra:
*** 3. row ***
  Field: date_lastchange
   Type: timestamp
   Null: NO
Key:
Default: CURRENT_TIMESTAMP
  Extra: on update CURRENT_TIMESTAMP
*** 4. row ***
  Field: date_inserted
   Type: timestamp
   Null: NO
Key:
Default: -00-00 00:00:00

I imported it into HBase with column family "id"

I want to create an external table in Hive to query the HBase table, I am
not able to get the mapping parameters (*hbase.columns.mapping*), it is
confusing, if anybody can explain it to me please. I used the following
query:

CREATE EXTERNAL TABLE hbase_orders(id bigint, value bigint, date_lastchange
string, date_inserted string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
("hbase.columns.mapping" = " ? ? ? ? ? ?") TBLPROPERTIES ("hbase.table.name"
= "orders");

Is there any way to build the Hive tables automatically or I should go with
the same process with each table?


Thanks in advanced.

--
Ibrahim



Re: Map Reduce Local Task

2013-01-08 Thread bejoy_ks
Hi Santhosh

As long as the smaller table size is in the range of a few MBs. It is a good 
candidate for map join.

If the smaller table size is still more then you can take a look at bucketed 
map joins.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Santosh Achhra 
Date: Wed, 9 Jan 2013 00:11:37 
To: 
Reply-To: user@hive.apache.org
Subject: Re: Map Reduce Local Task

Thank you Dean,

One of our table is very small, it has only 16,000 rows and other big table
has 45 million plus records. Wont doing a loacl task help in this case ?

Good wishes,always !
Santosh


On Tue, Jan 8, 2013 at 11:59 PM, Dean Wampler <
dean.wamp...@thinkbiganalytics.com> wrote:

> more aggressive about trying to convert a join to a local task, where it
> bypasses the job tracker. When you're experimenting with queries on a small
> data set, it can make things much faster, but won't be useful for large
> data sets where you need the cluster.
>



Re: External table with partitions

2013-01-06 Thread bejoy_ks
Sorry, I din understand your query on first look through.

Like Jagat said, you may need to go with a temp table for this.

Do a hadoop fs -cp ../../a.* 

Create a external table with location as 'destn dir'.

CREATE EXERNAL TABLE  LIKE  LOCATION '' ;

NB: I just gave the syntax from memory. please check the syntax in hive user 
guide.
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: bejoy...@yahoo.com
Date: Sun, 6 Jan 2013 14:39:45 
To: 
Reply-To: user@hive.apache.org
Subject: Re: External table with partitions

Hi Oded

If you have created the directories manually that would come visible to the 
hive table only if the partitions/ sub dirs are added to the meta data using
'ALTER TABLE ... ADD PARTITION' . 
Partitions are not retrieved implicitly into hive tabe even if you have a 
proper sub dir structure.

Similarly if you don't need a particular partition on your table permanently 
you can always delete them using the alter table command.

If you are intending to use a particular partition alone in your query no need 
to alter the partitions. Just append a where clause to the query that has scope 
only on the required partitions.

Hope this helps.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Oded Poncz 
Date: Sun, 6 Jan 2013 16:07:26 
To: 
Reply-To: user@hive.apache.org
Subject: External table with partitions

Is it possible to instruct hive to get only specific files from a
partitioned external table?
For example I have the following directory structure

data/dd=2012-12-31/a1.txt
data/dd=2012-12-31/a2.txt
data/dd=2012-12-31/a3.txt
data/dd=2012-12-31/a4.txt

data/dd=2012-12-31/b1.txt
data/dd=2012-12-31/b2.txt
data/dd=2012-12-31/b2.txt

Is it possible to add 2012-12-31 as a partition and tell hive to load only
the a* files to the table?
Thanks,



Re: External table with partitions

2013-01-06 Thread bejoy_ks
Hi Oded

If you have created the directories manually that would come visible to the 
hive table only if the partitions/ sub dirs are added to the meta data using
'ALTER TABLE ... ADD PARTITION' . 
Partitions are not retrieved implicitly into hive tabe even if you have a 
proper sub dir structure.

Similarly if you don't need a particular partition on your table permanently 
you can always delete them using the alter table command.

If you are intending to use a particular partition alone in your query no need 
to alter the partitions. Just append a where clause to the query that has scope 
only on the required partitions.

Hope this helps.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Oded Poncz 
Date: Sun, 6 Jan 2013 16:07:26 
To: 
Reply-To: user@hive.apache.org
Subject: External table with partitions

Is it possible to instruct hive to get only specific files from a
partitioned external table?
For example I have the following directory structure

data/dd=2012-12-31/a1.txt
data/dd=2012-12-31/a2.txt
data/dd=2012-12-31/a3.txt
data/dd=2012-12-31/a4.txt

data/dd=2012-12-31/b1.txt
data/dd=2012-12-31/b2.txt
data/dd=2012-12-31/b2.txt

Is it possible to add 2012-12-31 as a partition and tell hive to load only
the a* files to the table?
Thanks,



Re: Map side join

2012-12-13 Thread bejoy_ks
Hi Souvik

To have the new hdfs block size in effect on the already existing files, you 
need to re copy them into hdfs.

To play with the number of mappers you can set lesser value like 64mb for min 
and max split size.

Mapred.min.split.size and mapred.max.split.size

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee 
Date: Thu, 13 Dec 2012 12:00:16 
To: ; 
Subject: Re: Map side join

Hi Bejoy,

The input files are non-compressed text file.
There are enough free slots in the cluster.

Can you please let me know can I increase the no of mappers?
I tried reducing the HDFS block size to 32 MB from 128 MB. I was expecting
to get more mappers. But still it's launching same no of mappers like it
was doing while the HDFS block size was 128 MB. I have enough map slots
available, but not being able to utilize those.


Thanks and regards,
Souvik.


On Thu, Dec 13, 2012 at 11:12 AM,  wrote:

> **
> Hi Souvik
>
> Is your input files compressed using some non splittable compression codec?
>
> Do you have enough free slots while this job is running?
>
> Make sure that the job is not running locally.
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Souvik Banerjee 
> *Date: *Wed, 12 Dec 2012 14:27:27 -0600
> *To: *; 
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: Map side join
>
> Hi Bejoy,
>
> Yes I ran the pi example. It was fine.
> Regarding the HIVE Job what I found is that it took 4 hrs for the first
> map job to get completed.
> Those map tasks were doing their job and only reported status after
> completion. It is indeed taking too long time to finish. Nothing I could
> find relevant in the logs.
>
> Thanks and regards,
> Souvik.
>
> On Wed, Dec 12, 2012 at 8:04 AM,  wrote:
>
>> **
>> Hi Souvik
>>
>> Apart from hive jobs is the normal mapreduce jobs like the wordcount
>> running fine on your cluster?
>>
>> If it is working, for the hive jobs are you seeing anything skeptical in
>> task, Tasktracker or jobtracker logs?
>>
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> --
>> *From: * Souvik Banerjee 
>> *Date: *Tue, 11 Dec 2012 17:12:20 -0600
>> *To: *; 
>> *ReplyTo: * user@hive.apache.org
>> *Subject: *Re: Map side join
>>
>> Hello Everybody,
>>
>> Need help in for on HIVE join. As we were talking about the Map side join
>> I tried that.
>> I set the flag set hive.auto.convert.join=true;
>>
>> I saw Hive converts the same to map join while launching the job. But the
>> problem is that none of the map job progresses in my case. I made the
>> dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
>> done very quickly.
>> No luck with any change of settings.
>> Failing to progress with the default setting changes these settings.
>> set hive.mapred.local.mem=1024; // Initially it was 216 I guess
>> set hive.join.cache.size=10; // Initialliu it was 25000
>>
>> Also on Hadoop side I made this changes
>>
>> mapred.child.java.opts -Xmx1073741824
>>
>> But I don't see any progress. After more than 40 minutes of run I am at
>> 0% map completion state.
>> Can you please throw some light on this?
>>
>> Thanks a lot once again.
>>
>> Regards,
>> Souvik.
>>
>>
>>
>> On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee > > wrote:
>>
>>> Hi Bejoy,
>>>
>>> That's wonderful. Thanks for your reply.
>>> What I was wondering if HIVE can do map side join with more than one
>>> condition on JOIN clause.
>>> I'll simply try it out and post the result.
>>>
>>> Thanks once again.
>>>
>>> Regards,
>>> Souvik.
>>>
>>>  On Fri, Dec 7, 2012 at 2:10 PM,  wrote:
>>>
 **
 Hi Souvik

 In earlier versions of hive you had to give the map join hint. But in
 later versions just set hive.auto.convert.join = true;
 Hive automatically selects the smaller table. It is better to give the
 smaller table as the first one in join.

 You can use a map join if you are joining a small table with a large
 one, in terms of data size. By small, better to have the smaller table size
 in range of MBs.
 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: *Souvik Banerjee 
 *Date: *Fri, 7 Dec 2012 13:58:25 -0600
 *To: *
 *ReplyTo: *user@hive.apache.org
 *Subject: *Map side join

 Hello everybody,

 I have got a question. I didn't came across any post which says
 somethign about this.
 I have got two tables. Lets say A and B.
 I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
 The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
 B.id2) AND (A.id3 = B.id3)

 Can I ask HIVE to use map side join in this scenario? Should I give a
 hint to HIVE by saying /*+mapjoin(B)*/

 Get back to me if you want any more information in this regard.

Re: Map side join

2012-12-13 Thread bejoy_ks
Hi Souvik

Is your input files compressed using some non splittable compression codec?

Do you have enough free slots while this job is running?

Make sure that the job is not running locally.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee 
Date: Wed, 12 Dec 2012 14:27:27 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: Map side join

Hi Bejoy,

Yes I ran the pi example. It was fine.
Regarding the HIVE Job what I found is that it took 4 hrs for the first map
job to get completed.
Those map tasks were doing their job and only reported status after
completion. It is indeed taking too long time to finish. Nothing I could
find relevant in the logs.

Thanks and regards,
Souvik.

On Wed, Dec 12, 2012 at 8:04 AM,  wrote:

> **
> Hi Souvik
>
> Apart from hive jobs is the normal mapreduce jobs like the wordcount
> running fine on your cluster?
>
> If it is working, for the hive jobs are you seeing anything skeptical in
> task, Tasktracker or jobtracker logs?
>
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Souvik Banerjee 
> *Date: *Tue, 11 Dec 2012 17:12:20 -0600
> *To: *; 
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: Map side join
>
> Hello Everybody,
>
> Need help in for on HIVE join. As we were talking about the Map side join
> I tried that.
> I set the flag set hive.auto.convert.join=true;
>
> I saw Hive converts the same to map join while launching the job. But the
> problem is that none of the map job progresses in my case. I made the
> dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
> done very quickly.
> No luck with any change of settings.
> Failing to progress with the default setting changes these settings.
> set hive.mapred.local.mem=1024; // Initially it was 216 I guess
> set hive.join.cache.size=10; // Initialliu it was 25000
>
> Also on Hadoop side I made this changes
>
> mapred.child.java.opts -Xmx1073741824
>
> But I don't see any progress. After more than 40 minutes of run I am at 0%
> map completion state.
> Can you please throw some light on this?
>
> Thanks a lot once again.
>
> Regards,
> Souvik.
>
>
>
> On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee 
> wrote:
>
>> Hi Bejoy,
>>
>> That's wonderful. Thanks for your reply.
>> What I was wondering if HIVE can do map side join with more than one
>> condition on JOIN clause.
>> I'll simply try it out and post the result.
>>
>> Thanks once again.
>>
>> Regards,
>> Souvik.
>>
>>  On Fri, Dec 7, 2012 at 2:10 PM,  wrote:
>>
>>> **
>>> Hi Souvik
>>>
>>> In earlier versions of hive you had to give the map join hint. But in
>>> later versions just set hive.auto.convert.join = true;
>>> Hive automatically selects the smaller table. It is better to give the
>>> smaller table as the first one in join.
>>>
>>> You can use a map join if you are joining a small table with a large
>>> one, in terms of data size. By small, better to have the smaller table size
>>> in range of MBs.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> --
>>> *From: *Souvik Banerjee 
>>> *Date: *Fri, 7 Dec 2012 13:58:25 -0600
>>> *To: *
>>> *ReplyTo: *user@hive.apache.org
>>> *Subject: *Map side join
>>>
>>> Hello everybody,
>>>
>>> I have got a question. I didn't came across any post which says
>>> somethign about this.
>>> I have got two tables. Lets say A and B.
>>> I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
>>> The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
>>> B.id2) AND (A.id3 = B.id3)
>>>
>>> Can I ask HIVE to use map side join in this scenario? Should I give a
>>> hint to HIVE by saying /*+mapjoin(B)*/
>>>
>>> Get back to me if you want any more information in this regard.
>>>
>>> Thanks and regards,
>>> Souvik.
>>>
>>
>>
>



Re: Map side join

2012-12-12 Thread bejoy_ks
Hi Souvik

Apart from hive jobs is the normal mapreduce jobs like the wordcount running 
fine on your cluster?

If it is working, for the hive jobs are you seeing anything skeptical in task, 
Tasktracker or jobtracker logs?


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee 
Date: Tue, 11 Dec 2012 17:12:20 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: Map side join

Hello Everybody,

Need help in for on HIVE join. As we were talking about the Map side join I
tried that.
I set the flag set hive.auto.convert.join=true;

I saw Hive converts the same to map join while launching the job. But the
problem is that none of the map job progresses in my case. I made the
dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
done very quickly.
No luck with any change of settings.
Failing to progress with the default setting changes these settings.
set hive.mapred.local.mem=1024; // Initially it was 216 I guess
set hive.join.cache.size=10; // Initialliu it was 25000

Also on Hadoop side I made this changes

mapred.child.java.opts -Xmx1073741824

But I don't see any progress. After more than 40 minutes of run I am at 0%
map completion state.
Can you please throw some light on this?

Thanks a lot once again.

Regards,
Souvik.



On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee wrote:

> Hi Bejoy,
>
> That's wonderful. Thanks for your reply.
> What I was wondering if HIVE can do map side join with more than one
> condition on JOIN clause.
> I'll simply try it out and post the result.
>
> Thanks once again.
>
> Regards,
> Souvik.
>
>  On Fri, Dec 7, 2012 at 2:10 PM,  wrote:
>
>> **
>> Hi Souvik
>>
>> In earlier versions of hive you had to give the map join hint. But in
>> later versions just set hive.auto.convert.join = true;
>> Hive automatically selects the smaller table. It is better to give the
>> smaller table as the first one in join.
>>
>> You can use a map join if you are joining a small table with a large one,
>> in terms of data size. By small, better to have the smaller table size in
>> range of MBs.
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> --
>> *From: *Souvik Banerjee 
>> *Date: *Fri, 7 Dec 2012 13:58:25 -0600
>> *To: *
>> *ReplyTo: *user@hive.apache.org
>> *Subject: *Map side join
>>
>> Hello everybody,
>>
>> I have got a question. I didn't came across any post which says somethign
>> about this.
>> I have got two tables. Lets say A and B.
>> I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
>> The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
>> B.id2) AND (A.id3 = B.id3)
>>
>> Can I ask HIVE to use map side join in this scenario? Should I give a
>> hint to HIVE by saying /*+mapjoin(B)*/
>>
>> Get back to me if you want any more information in this regard.
>>
>> Thanks and regards,
>> Souvik.
>>
>
>



Re: Map side join

2012-12-07 Thread bejoy_ks
Hi Souvik

In earlier versions of hive you had to give the map join hint. But in later 
versions just set hive.auto.convert.join = true;
Hive automatically selects the smaller table. It is better to give the smaller 
table as the first  one in join.

You can use a map join if you are joining a small table with a large one, in 
terms of data size. By small, better to have the smaller table size in range of 
MBs.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee 
Date: Fri, 7 Dec 2012 13:58:25 
To: 
Reply-To: user@hive.apache.org
Subject: Map side join

Hello everybody,

I have got a question. I didn't came across any post which says somethign
about this.
I have got two tables. Lets say A and B.
I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
B.id2) AND (A.id3 = B.id3)

Can I ask HIVE to use map side join in this scenario? Should I give a hint
to HIVE by saying /*+mapjoin(B)*/

Get back to me if you want any more information in this regard.

Thanks and regards,
Souvik.



Re: Hive | HBase Integration

2012-02-28 Thread bejoy_ks
Hi Rinku
   Were you able to create a normal table within your hive without any 
issues? By Normal table I mean the one that has data dir in hdfs not in HBase. 

Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: "Garg, Rinku" 
Date: Wed, 29 Feb 2012 05:29:12 
To: user@hive.apache.org; Bejoy Ks
Reply-To: user@hive.apache.org
Subject: RE: Hive | HBase Integration

Hi ,

We tried the same also by issuing the below mentioned command but command does 
not executes completely and a new command line is shown with an asterisk (*), 
and table does not gets created.

CREATE EXTERNAL TABLE hive_hbasetable_k(key int, value string) STORED BY 
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES 
("hbase.columns.mapping" = ":key,cf1:val") TBLPROPERTIES ("hbase.table.name" = 
"hivehbasek");

Please suggest.

Thanks
Rinku Garg

From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: 29 February 2012 10:33
To: user@hive.apache.org
Subject: Re: Hive | HBase Integration

Hi Keshav
  Make your hive table EXTERNAL, it should get things rolling. If you are 
mapping hive to a Hbase table, then as a mandatory requirement the hive table 
has to be EXTERNAL.

Hope it helps.

Regards
Bejoy.K.S


From: "Savant, Keshav" 
mailto:keshav.c.sav...@fisglobal.com>>
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Sent: Tuesday, February 28, 2012 5:58 PM
Subject: Hive | HBase Integration

Hi All,

We did a successful setup of hadoop-0.20.203.0 and hive-0.7.1.

In our next step we are eyeing HBase integration with Hive. As far as we 
understand from articles available on internet and apache site, we can use 
HBase instead of derby as a metastore of Hive, this gives us more flexibility 
while handling very large data.

We are using hbase-0.92.0 to integrate it with Hive, till now HBase has been 
setup and we can create sample table on it and insert sample data in it, but we 
are not able to integrate it with Hive, because when we issue the command to 
create hive specific table on HBase (below in box) the command does not 
executes completely and a new command line is shown with an asterisk (*), and 
table does not gets created.

CREATE TABLE hive_hbasetable_k(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "hivehbasek");



Please provide us some pointers (steps to follow) for doing this integration or 
what we are not doing correctly. Till now we got these below URLs to do this, 
any help is appreciated

http://mevivs.wordpress.com/2010/11/24/hivehbase-integration/
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

Kind regards,
Keshav C Savant
_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.



Re: parallel inserts ?

2012-02-15 Thread bejoy_ks
Hi John
   Yes Insert is parallel in default for hive. Hive QL gets transformed to 
mapreduce jobs and hence definitely it is parallel. The only case it is not 
parallel is when you have just 1 reducer . It is just reading and processing 
the input files and in parallel using map reduce jobs from the source table 
data dir and writes the desired output files to the destination table dir.  
 
Hive is just an abstraction over map reduce and can't be compared 
against a db in terms of features. Almost every data processing operation is 
just some map reduce jobs. 
Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: John B 
Date: Wed, 15 Feb 2012 10:59:09 
To: 
Reply-To: user@hive.apache.org
Subject: parallel inserts ?

Other sql datbases typically can parallelize selects but are unable to
automatically parallelize inserts.

With the most recent stable hiveql will the following statement have
the --insert-- automatically parallelized ?

 INSERT OVERWRITE TABLE pv_gender
 SELECT pv_users.gender
 FROM pv_users


I understand there is now 'insert into ..select from' syntax. Is the
insert part of that statement automatically parallelized ?

What is the highest insert speed anybody has seen - and I am not
talking about imports I mean inserts from one table to another ?



Re: Doubt in INSERT query in Hive?

2012-02-15 Thread bejoy_ks
Bhavesh
   In this case if you are not using INSERT INTO, you may need some tmp 
table write the query output to that. Load that data from there to your target 
table's data dir. 
You are not writing that to any file while doing the LOAD DATA operation. 
Rather you are just moving the files(in hdfs) from the source location to the 
table's data dir (where the previous data files are present). In hdfs move 
operation there is just a meta data operation happening at file system level. 

 Go with INSERT INTO as it is a cleaner way in hql perspective.
Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Bhavesh Shah 
Date: Wed, 15 Feb 2012 15:03:07 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: Doubt in INSERT query in Hive?

Hi Bejoy K S,
Thanks for your reply.
The overhead is, in select query I have near about 85 columns. Writing this
in the file and again loading it may take some time.
For that reason I am thinking that it will be inefficient.



-- 
Regards,
Bhavesh Shah


On Wed, Feb 15, 2012 at 2:51 PM,  wrote:

> **
> Hi Bhavesh
> INSERT INTO is supported in hive 0.8 . An upgrade would get you things
> rolling.
> LOAD DATA inefficient? What was the performance overhead you were facing
> here?
> Regards
> Bejoy K S
>
> From handheld, Please excuse typos.
> --
> *From: * Bhavesh Shah 
> *Date: *Wed, 15 Feb 2012 14:33:29 +0530
> *To: *; 
> *ReplyTo: * user@hive.apache.org
> *Subject: *Doubt in INSERT query in Hive?
>
> Hello,
> Whenever we want to insert into table we use:
> INSERT OVERWRITE TABLE TBL_NAME
> (SELECT )
> Due to this, table gets overwrites everytime.
>
> I don't want to overwrite table, I want append it everytime.
> I thought about LOAD TABLE , but writing the file may take more time and I
> don't think so that it will efficient.
>
> Does Hive Support INSERT INTO TABLE TAB_NAME?
> (I am using hive-0.7.1)
> Is there any patch for it? (But I don't know How to apply patch ?)
>
> Pls suggest me as soon as possible.
> Thanks.
>
>
>
> --
> Regards,
> Bhavesh Shah
>
>



Re: Doubt in INSERT query in Hive?

2012-02-15 Thread bejoy_ks
Hi Bhavesh
   INSERT INTO is supported in hive 0.8 . An upgrade would get you things 
rolling. 
LOAD DATA inefficient? What was the performance overhead you were facing here?

Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Bhavesh Shah 
Date: Wed, 15 Feb 2012 14:33:29 
To: ; 
Reply-To: user@hive.apache.org
Subject: Doubt in INSERT query in Hive?

Hello,
Whenever we want to insert into table we use:
INSERT OVERWRITE TABLE TBL_NAME
(SELECT )
Due to this, table gets overwrites everytime.

I don't want to overwrite table, I want append it everytime.
I thought about LOAD TABLE , but writing the file may take more time and I
don't think so that it will efficient.

Does Hive Support INSERT INTO TABLE TAB_NAME?
(I am using hive-0.7.1)
Is there any patch for it? (But I don't know How to apply patch ?)

Pls suggest me as soon as possible.
Thanks.



-- 
Regards,
Bhavesh Shah



Re: external partitioned table

2012-02-08 Thread bejoy_ks
Hi Koert
As you are creating dir/sub dirs using mapreduce jobs out of hive, hive 
is unaware of these sub dirs. There is no other way in such cases other than an 
add partition DDL to register the dir with a hive partition. 
If you are using oozie or shell to trigger your jobs,you can accomplish  it as
-use java to come up with the correct add partition statement and write those 
statement(s) into a file 
-execute the file using hive -f 

Hope it helps!..


Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Koert Kuipers 
Date: Wed, 8 Feb 2012 11:04:18 
To: 
Reply-To: user@hive.apache.org
Subject: external partitioned table

hello all,

we have an external partitioned table in hive.

we add to this table by having map-reduce jobs (so not from hive) create
new subdirectories with the right format (partitionid=partitionvalue).

however hive doesn't pick them up automatically. we have to go into hive
shell and run "alter table sometable add partition
(partitionid=partitionvalue)". to make matter worse hive doesnt really lend
itself to running such an add-partition-operation from java (or for that
matter: hive doesn't lend itself to any easy programmatic manipulations...
grrr. but i will stop now before i go on a a rant).

any suggestions how to approach this? thanks!

best, koert



Re: Error when Creating an UDF

2012-02-06 Thread bejoy_ks
Hi
One of your jar is not available and may be that has the required UDF or 
any related methods.

Hive was not able to locate your first jar

'/scripts/hiveMd5.jar does not exist'

Just fix this with the correct location. Everything should work fine.
 
Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Jean-Charles Thomas 
Date: Mon, 6 Feb 2012 16:51:58 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Error when Creating an UDF

Hi everybody,

i am trying to create an UDF follwing the example in the Hive Wiki.
Everything is fine but the CREATE statement (see below) where an error occurs:

hive> add jar /scripts/hiveMd5.jar;
/scripts/hiveMd5.jar does not exist
hive> add jar /scripts/hive/udf/Md5.jar;
Added /scripts/hive/udf/Md5.jar to class path
Added resource: /scripts/hive/udf/Md5.jar
hive> CREATE TEMPORARY FUNCTION mymd5 AS 'com.autoscout24.hive.udf.Md5';
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.FunctionTask
hive>

in the Hive log, there is no much more:

2012-02-06 16:16:36,096 ERROR ql.Driver (SessionState.java:printError(343)) - 
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.FunctionTask

Any Help is welcome,

Thanks a lot for hlep,

Jean-Charles





Re: Important Question

2012-01-25 Thread bejoy_ks
Real Time.. Definitely not hive. Go in for HBase, but don't expect Hbase to be 
as flexible as RDBMS. You need to choose your Row Key and Column Families 
wisely as per your requirements.
For data mining and analytics you can mount Hive table  over corresponding 
Hbase table and play on with SQL like queries.



Regards
Bejoy K S

-Original Message-
From: Dalia Sobhy 
Date: Wed, 25 Jan 2012 17:01:08 
To: ; 
Reply-To: user@hive.apache.org
Subject: Important Question


Dear all,
I am developing an API for medical use i.e Hospital admissions and all about 
patients, thus transactions and queries and realtime data is important here...
Therefore both real-time and analytical processing is a must..
Therefore which best suits my application Hbase or Hive or another method ??
Please reply quickly bec this is critical thxxx a million ;)
  


Re: Question on bucketed map join

2012-01-19 Thread bejoy_ks
Corrected a few typos in previous mail

Hi Avrila
Hi Avrila
   AFAIK the bucketed map join is not default in hive and it happens only 
when the configuration parameter hive.optimize.bucketmapjoin  is set to true. 
You may be getting the same execution plan because hive.optimize.bucketmapjoin  
is set to true  in the hive configuration xml file. To cross confirm the same 
could you explicitly set this to false
(set hive.optimize.bucketmapjoin = false;
) in your hive session and get the query execution plan from explain command. 
Please find some pointers in line
1. Should I see sth different in the explain extended output if I set and unset 
the hive.optimize.bucketmapjoin option?
[Bejoy]Yes, you should be seeing different plans for both.
Try EXPLAIN your join query after setting this
set hive.optimize.bucketmapjoin = false;

2. Should I see something different in the output of hive while running the 
query if again I set and unset the hive.optimize.bucketmapjoin?
[Bejoy] No,Hive output should be the same. What ever is the execution plan for 
an join, optimally the end result should be same.

3. Is it possible that even though I set bucketmapjoin to true, Hive will still 
perform a normal map-side join for some reason? How can I check if this has 
actually happened?
[Bejoy] Hive would perform a plain map side join only if the following 
parameter is enabled. (default it is disabled)
set hive.auto.convert.join = true; you need to check this value in your 
configurations.
If it is enabled irrespective of the table size hive would always try a map 
join, it would come to a normal join only after the map join attempt fails.
AFAIK, if the number of buckets are same or multiples between the two tables 
involved in a join and if the join is on the same columns that are bucketed, 
with bucketmapjoin enabled it shouldn't execute a plain mapside join but a 
bucketed map side join would be triggered.

Hope it helps!..


Regards
Bejoy K S

-Original Message-
From: Bejoy Ks 
Date: Thu, 19 Jan 2012 09:22:08 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Question on bucketed map join

Hi Avrila
   AFAIK the bucketed map join is not default in hive and it happens only 
when the values is set to true. It could be because the same value is already 
set in the hive configuration xml file. To cross confirm the same could you 
explicitly set this to false 

(set hive.optimize.bucketmapjoin = false;)and get the query execution plan from 
explain command. 


Please some pointers in line

1. Should I see sth different in the explain extended output if I set and unset 
the hive.optimize.bucketmapjoin option?
[Bejoy] you should be seeing the same
Try EXPLAIN your join query after setting this
set hive.optimize.bucketmapjoin = false;


2. Should I see something different in the output of hive while running 
the query if again I set and unset the hive.optimize.bucketmapjoin?
[Bejoy] No,Hive output should be the same. What ever is the execution plan for 
an join, optimally the end result should be same. 


3.
 Is it possible that even though I set bucketmapjoin to true, Hive will 
still perform a normal map-side join for some reason? How can I check if
 this has actually happened?
[Bejoy] Hive would perform a plain map side join only if the following 
parameter is enabled. (default it is disabled)

set hive.auto.convert.join = true; you need to check this value in your 
configurations.
If it is enabled irrespective of the table size hive would always try a map 
join, it would come to a normal join only after the map join attempt fails.
AFAIK, if the number of buckets are same or multiples between the two tables 
involved in a join and if the join is on the same columns that are bucketed, 
with bucketmapjoin enabled it shouldn't execute a plain mapside join a bucketed 
map side join would be triggered.

Hope it helps!..

Regards
Bejoy.K.S




 From: Avrilia Floratou 
To: user@hive.apache.org 
Sent: Thursday, January 19, 2012 9:23 PM
Subject: Question on bucketed map join
 
Hi,

I have two tables with 8 buckets each on the same key and want to join them.
I ran "explain extended" and get the plan produced by HIVE which shows that a 
map-side join is a possible plan.

I then set in my script the hive.optimize.bucketmapjoin option to true and 
reran the "explain extended" query. I get the exact same plans as output.

I ran the query with and without the bucketmapjoin optimization and saw no 
difference in the running time.

I have the following questions:

1. Should I see sth different in the explain extended output if I set and unset 
the hive.optimize.bucketmapjoin option?

2. Should I see something different in the output of hive while running the 
query if again I set and unset the hive.optimize.bucketmapjoin?

3. Is it possible that even though I set bucketmapjoin to true, Hive will still 
perform a normal map-side join for some reason? How can I check if this has 

Re: Insert based on whether string contains

2012-01-04 Thread bejoy_ks
I agree with Matt on that aspect. The solution proposed by me was purely based 
on the sample data provided where there were  3 digit comma separated values. 
If there are chances of 4 digit values as well in event_list you may need to 
revisit the solution.

Regards
Bejoy K S

-Original Message-
From: "Tucker, Matt" 
Date: Wed, 4 Jan 2012 08:56:44 
To: user@hive.apache.org; Bejoy Ks
Reply-To: user@hive.apache.org
Subject: Re: Insert based on whether string contains 

The find_in_set() UDF is a safer choice for doing a search for that value, as 
%239% could also match 2390, which has a different meaning in Omniture logs.



On Jan 4, 2012, at 8:46 AM, "Bejoy Ks" 
mailto:bejoy...@yahoo.com>> wrote:

Hi Dave

   If I get your requirement correct, you need to load data into 
video_plays_for_sept  table FROM omniture table only if omniture.event_list 
contain the string 239.

Try the following query, it should work fine.

INSERT OVERWRITE TABLE video_plays_for_sept
SELECT concat(visid_high, visid_low), geo_city, geo_country, geo_region FROM 
omniture WHERE event_list LIKE  ‘%239%’;

Hope it helps!..

Regards,
Bejoy.K.S


From: Dave Houston mailto:r...@crankyadmin.net>>
To: user@hive.apache.org
Sent: Wednesday, January 4, 2012 6:41 PM
Subject: Insert based on whether string contains

Hi there, i have a string that has '239, 236, 232, 934' (not always in that 
order) and want to insert into another table if 239 is in the string.

INSERT OVERWRITE TABLE video_plays_for_sept

SELECT concat(visid_high, visid_low), geo_city, geo_country, geo_region from 
omniture where regexp_extract(event_list, '\d+') = "239";

is that I have at the minute but always returns 0 Rows loaded to 
video_plays_for_sept


Many thanks

Dave Houston
r...@crankyadmin.net







Re: Schemas/Databases in Hive

2011-12-22 Thread bejoy_ks
Also multiple databases have proved helpful for me in organizing tables into 
corresponding databases when you have quite a large number of tables to manage.
Also I believe it'd be helpful in providing access restrictions.

 
Regards
Bejoy K S

-Original Message-
From: bejoy...@yahoo.com
Date: Thu, 22 Dec 2011 17:19:16 
To: 
Reply-To: bejoy...@yahoo.com
Subject: Re: Schemas/Databases in Hive

Ranjith
   Hive do support multiple data bases if you are on some of the latest 
versions of hive try
Create database testdb;
Use testdb;

It should give you what you are looking for.

Regards
Bejoy K S

-Original Message-
From: "Raghunath, Ranjith" 
Date: Thu, 22 Dec 2011 17:02:09 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Schemas/Databases in Hive

What is the intent of having tables in different databases or schemas in Hive? 
Thanks

Thank you,
Ranjith




Re: Schemas/Databases in Hive

2011-12-22 Thread bejoy_ks
Ranjith
   Hive do support multiple data bases if you are on some of the latest 
versions of hive try
Create database testdb;
Use testdb;

It should give you what you are looking for.

Regards
Bejoy K S

-Original Message-
From: "Raghunath, Ranjith" 
Date: Thu, 22 Dec 2011 17:02:09 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Schemas/Databases in Hive

What is the intent of having tables in different databases or schemas in Hive? 
Thanks

Thank you,
Ranjith



Re: Loading data into hive tables

2011-12-08 Thread bejoy_ks
Adithya
  The answer is yes. SQOOP is the tool you are looking for. It has an 
import option to load data from from any jdbc compliant database into hive. It 
even creates the hive table for you by refering to the source db table.

Hope It helps!..

Regards
Bejoy K S

-Original Message-
From: Aditya Singh30 
Date: Fri, 9 Dec 2011 09:57:26 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Loading data into hive tables

Hi,
I want to know if there is any way to load data directly from 
some other DB, say Oracle/MySQL etc., into hive tables, without getting the 
data from DB into a text/rcfile/sequence file in a specific format and then 
loading the data from that file into hive table.

Regards,
Aditya

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***



Re: Hive query failing on group by

2011-10-19 Thread bejoy_ks
Looks like some data problem. Were you using the GROUP BY  query on same data 
set?
But if count(*) also throws an error then it comes to square 1, 
installation/configuration problem with hive or map reduce.

Regards
Bejoy K S

-Original Message-
From: Mark Kerzner 
Date: Wed, 19 Oct 2011 10:55:34 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: Hive query failing on group by

Bejoy,

I've been using this install of Hive for some time now, and simple queries
and joins work fine. It's the GROUP BY that I have problems with, sometimes
even with COUNT(*).

I am trying to isolate the problem now, and reduce it to the smallest query
possible. I am also trying to find a workaround (I noticed that sometimes
rephrasing queries for Hive helps), since I need this for a project.

Thank you,
Mark

On Wed, Oct 19, 2011 at 10:25 AM,  wrote:

> ** Mark
> To ensure your hive installation is fine run two queries
> SELECT * FROM trans LIMIT 10;
> SELECT * FROM trans WHERE ***;
> You can try this for couple of different tables. If these queries return
> results and work fine as desired then your hive could be working good.
>
> If it works good as the second step issue a simple join between two tables
> on primitive data type columns. If that also looks good then you can kind of
> confirm that the bug is with your hive query.
>
> We can look into that direction then.
>
>
>
> Regards
> Bejoy K S
> --
> *From: * Mark Kerzner 
> *Date: *Wed, 19 Oct 2011 10:02:57 -0500
> *To: *
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: Hive query failing on group by
>
> Vikas,
>
> I am using Cloudera CDHU1 on Ubuntu. I get the same results on RedHat
> CDHU0.
>
> Mark
>
> On Wed, Oct 19, 2011 at 9:47 AM, Vikas Srivastava <
> vikas.srivast...@one97.net> wrote:
>
>> install hive with RPM this is correpted!!
>>
>> On Wed, Oct 19, 2011 at 8:01 PM, Mark Kerzner 
>> wrote:
>>
>>> Here is what my hive logs say
>>>
>>> hive -hiveconf hive.root.logger=DEBUG
>>>
>>> 2011-10-19 09:24:35,148 ERROR DataNucleus.Plugin
>>> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
>>> "org.eclipse.core.resources" but it cannot be resolved.
>>> 2011-10-19 09:24:35,150 ERROR DataNucleus.Plugin
>>> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
>>> "org.eclipse.core.runtime" but it cannot be resolved.
>>> 2011-10-19 09:24:35,150 ERROR DataNucleus.Plugin
>>> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
>>> "org.eclipse.text" but it cannot be resolved.
>>>
>>>
>>> On Wed, Oct 19, 2011 at 9:21 AM,  wrote:
>>>
 ** Hi Mark
 What does your Map reduce job logs say? Try figuring out the error form
 there. From hive CLI you could hardly find out the root cause of your
 errors. From job tracker web UI < http://hostname:50030/jobtracker.jsp>
 you can easily browse to failed tasks and get the actual exception from
 there. If you are not able to figure out from there then please post in
 those logs with your table schema.

 Regards
 Bejoy K S
 --
 *From: * Mark Kerzner 
 *Date: *Wed, 19 Oct 2011 09:06:13 -0500
 *To: *Hive user
 *ReplyTo: * user@hive.apache.org
 *Subject: *Hive query failing on group by

 HI,

 I am trying to figure out what I am doing wrong with this query and the
 unusual error I am getting. Also suspicious is the reduce % going up and
 down.

 select trans.property_id, day(trans.log_timestamp) from trans JOIN opts
 on trans.ext_booking_id["ext_booking_id"] = opts.ext_booking_id group by
 day(trans.log_timestamp), trans.property_id;

 2011-10-19 08:55:19,778 Stage-1 map = 0%,  reduce = 0%
 2011-10-19 08:55:22,786 Stage-1 map = 100%,  reduce = 0%
 2011-10-19 08:55:29,804 Stage-1 map = 100%,  reduce = 33%
 2011-10-19 08:55:32,811 Stage-1 map = 100%,  reduce = 0%
 2011-10-19 08:55:39,829 Stage-1 map = 100%,  reduce = 33%
 2011-10-19 08:55:43,839 Stage-1 map = 100%,  reduce = 0%
 2011-10-19 08:55:50,855 Stage-1 map = 100%,  reduce = 33%
 2011-10-19 08:55:54,864 Stage-1 map = 100%,  reduce = 0%
 2011-10-19 08:56:00,878 Stage-1 map = 100%,  reduce = 33%
 2011-10-19 08:56:04,887 Stage-1 map = 100%,  reduce = 0%
 2011-10-19 08:56:05,891 Stage-1 map = 100%,  reduce = 100%
 Ended Job = job_201110111849_0024 with errors
 FAILED: Execution Error, return code 2 from
 org.apache.hadoop.hive.ql.exec.MapRedTask

 Thank you,
 Mark

>>>
>>>
>>
>>
>> --
>> With Regards
>> Vikas Srivastava
>>
>> DWH & Analytics Team
>> Mob:+91 9560885900
>> One97 | Let's get talking !
>>
>>
>



Re: Hive query failing on group by

2011-10-19 Thread bejoy_ks
Mark
 To ensure your hive installation is fine run two queries
SELECT * FROM trans LIMIT 10;
SELECT * FROM trans WHERE ***;
You can try this for couple of different tables. If these queries return 
results and work fine as desired then your hive could be working good.

If it works good as the second step issue a simple join between two tables on 
primitive data type columns. If that also looks good then you can kind of 
confirm that the bug is with your hive query. 

We can look into that direction then.



Regards
Bejoy K S

-Original Message-
From: Mark Kerzner 
Date: Wed, 19 Oct 2011 10:02:57 
To: 
Reply-To: user@hive.apache.org
Subject: Re: Hive query failing on group by

Vikas,

I am using Cloudera CDHU1 on Ubuntu. I get the same results on RedHat CDHU0.

Mark

On Wed, Oct 19, 2011 at 9:47 AM, Vikas Srivastava <
vikas.srivast...@one97.net> wrote:

> install hive with RPM this is correpted!!
>
> On Wed, Oct 19, 2011 at 8:01 PM, Mark Kerzner wrote:
>
>> Here is what my hive logs say
>>
>> hive -hiveconf hive.root.logger=DEBUG
>>
>> 2011-10-19 09:24:35,148 ERROR DataNucleus.Plugin
>> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
>> "org.eclipse.core.resources" but it cannot be resolved.
>> 2011-10-19 09:24:35,150 ERROR DataNucleus.Plugin
>> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
>> "org.eclipse.core.runtime" but it cannot be resolved.
>> 2011-10-19 09:24:35,150 ERROR DataNucleus.Plugin
>> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
>> "org.eclipse.text" but it cannot be resolved.
>>
>>
>> On Wed, Oct 19, 2011 at 9:21 AM,  wrote:
>>
>>> ** Hi Mark
>>> What does your Map reduce job logs say? Try figuring out the error form
>>> there. From hive CLI you could hardly find out the root cause of your
>>> errors. From job tracker web UI < http://hostname:50030/jobtracker.jsp>
>>> you can easily browse to failed tasks and get the actual exception from
>>> there. If you are not able to figure out from there then please post in
>>> those logs with your table schema.
>>>
>>> Regards
>>> Bejoy K S
>>> --
>>> *From: * Mark Kerzner 
>>> *Date: *Wed, 19 Oct 2011 09:06:13 -0500
>>> *To: *Hive user
>>> *ReplyTo: * user@hive.apache.org
>>> *Subject: *Hive query failing on group by
>>>
>>> HI,
>>>
>>> I am trying to figure out what I am doing wrong with this query and the
>>> unusual error I am getting. Also suspicious is the reduce % going up and
>>> down.
>>>
>>> select trans.property_id, day(trans.log_timestamp) from trans JOIN opts
>>> on trans.ext_booking_id["ext_booking_id"] = opts.ext_booking_id group by
>>> day(trans.log_timestamp), trans.property_id;
>>>
>>> 2011-10-19 08:55:19,778 Stage-1 map = 0%,  reduce = 0%
>>> 2011-10-19 08:55:22,786 Stage-1 map = 100%,  reduce = 0%
>>> 2011-10-19 08:55:29,804 Stage-1 map = 100%,  reduce = 33%
>>> 2011-10-19 08:55:32,811 Stage-1 map = 100%,  reduce = 0%
>>> 2011-10-19 08:55:39,829 Stage-1 map = 100%,  reduce = 33%
>>> 2011-10-19 08:55:43,839 Stage-1 map = 100%,  reduce = 0%
>>> 2011-10-19 08:55:50,855 Stage-1 map = 100%,  reduce = 33%
>>> 2011-10-19 08:55:54,864 Stage-1 map = 100%,  reduce = 0%
>>> 2011-10-19 08:56:00,878 Stage-1 map = 100%,  reduce = 33%
>>> 2011-10-19 08:56:04,887 Stage-1 map = 100%,  reduce = 0%
>>> 2011-10-19 08:56:05,891 Stage-1 map = 100%,  reduce = 100%
>>> Ended Job = job_201110111849_0024 with errors
>>> FAILED: Execution Error, return code 2 from
>>> org.apache.hadoop.hive.ql.exec.MapRedTask
>>>
>>> Thank you,
>>> Mark
>>>
>>
>>
>
>
> --
> With Regards
> Vikas Srivastava
>
> DWH & Analytics Team
> Mob:+91 9560885900
> One97 | Let's get talking !
>
>



Re: Hive query failing on group by

2011-10-19 Thread bejoy_ks
Hi Mark
 What does your Map reduce job logs say? Try figuring out the error form 
there. From hive CLI you could hardly find out the root cause of your errors. 
From job tracker web UI < http://hostname:50030/jobtracker.jsp> you can easily 
browse to failed tasks and get the actual exception from there. If you are not 
able to figure out from there then please post in those logs with your table 
schema.


Regards
Bejoy K S

-Original Message-
From: Mark Kerzner 
Date: Wed, 19 Oct 2011 09:06:13 
To: Hive user
Reply-To: user@hive.apache.org
Subject: Hive query failing on group by

HI,

I am trying to figure out what I am doing wrong with this query and the
unusual error I am getting. Also suspicious is the reduce % going up and
down.

select trans.property_id, day(trans.log_timestamp) from trans JOIN opts on
trans.ext_booking_id["ext_booking_id"] = opts.ext_booking_id group by
day(trans.log_timestamp), trans.property_id;

2011-10-19 08:55:19,778 Stage-1 map = 0%,  reduce = 0%
2011-10-19 08:55:22,786 Stage-1 map = 100%,  reduce = 0%
2011-10-19 08:55:29,804 Stage-1 map = 100%,  reduce = 33%
2011-10-19 08:55:32,811 Stage-1 map = 100%,  reduce = 0%
2011-10-19 08:55:39,829 Stage-1 map = 100%,  reduce = 33%
2011-10-19 08:55:43,839 Stage-1 map = 100%,  reduce = 0%
2011-10-19 08:55:50,855 Stage-1 map = 100%,  reduce = 33%
2011-10-19 08:55:54,864 Stage-1 map = 100%,  reduce = 0%
2011-10-19 08:56:00,878 Stage-1 map = 100%,  reduce = 33%
2011-10-19 08:56:04,887 Stage-1 map = 100%,  reduce = 0%
2011-10-19 08:56:05,891 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201110111849_0024 with errors
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask

Thank you,
Mark



Re: upgrading hadoop package

2011-09-01 Thread bejoy_ks
Hi Li
  AFAIK 0.21 is not really a stable version of hadoop . So if this upgrade 
is on a production cluster it'd be better to go in with 0.20.203.
Regards
Bejoy K S

-Original Message-
From: Shouguo Li 
Date: Thu, 1 Sep 2011 11:41:46 
To: 
Reply-To: user@hive.apache.org
Subject: upgrading hadoop package

hey guys,

i'm planning to upgrade my hadoop cluster from 0.20.1 to 0.21 to take
advantage of new bz2 splitting feature. i found a simple upgrade guide,
http://wiki.apache.org/hadoop/Hadoop_Upgrade
but i can't find anything that's related to hive. do we need to do anything
for hive? is the migration transparent to hive?
thx!



Re: Re:Re: Re: RE: Why a sql only use one map task?

2011-08-25 Thread bejoy_ks
Hi Daniel
 In the hadoop eco system the number of map tasks is actually decided 
by the job basically based  no of input splits . Setting mapred.map.tasks 
wouldn't assure that only that many number of map tasks are triggered. What 
worked out here for you is that you were specifying that a map tasks should 
process a min data volume by setting value for mapred.min.split size.
 So in your case in real there were 9 input splits but when you imposed a 
constrain on the min data that a map task should handle, the map tasks came 
down to 3. 
Regards
Bejoy K S

-Original Message-
From: "Daniel,Wu" 
Date: Thu, 25 Aug 2011 20:02:43 
To: 
Reply-To: user@hive.apache.org
Subject: Re:Re:Re: Re: RE: Why a sql only use one map task?

after I set
set mapred.min.split.size=2;

Then it will kick off 3 map tasks (the file I have is 500M).  So looks like we 
need to set mapred.min.split.size instead of mapred.map.tasks to control how 
many maps to kick off.


At 2011-08-25 19:38:30,"Daniel,Wu"  wrote:

It works, after I set as you said, but looks like I can't control the map task, 
it always use 9 maps, even if I set
set mapred.map.tasks=2;


Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempts
map100.00%


900900 / 0
reduce100.00%


100100 / 0



At 2011-08-25 06:35:38,"Ashutosh Chauhan"  wrote:
This may be because CombineHiveInputFormat is combining your splits in one map 
task. If you don't want that to happen, do:
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveI nputFormat


2011/8/24 Daniel,Wu

I pasted the inform I pasted blow, the map capacity is 6. And no matter how I 
set  mapred.map.tasks, such as 3,  it doesn't work, as it always use 1 map task 
(please see the completed job information).



Cluster Summary (Heap Size is 16.81 MB/966.69 MB)
Running Map TasksRunning Reduce TasksTotal SubmissionsNodesOccupied Map 
SlotsOccupied Reduce SlotsReserved Map SlotsReserved Reduce SlotsMap Task 
CapacityReduce Task CapacityAvg. Tasks/NodeBlacklisted NodesExcluded Nodes
0063664.


Completed Jobs
JobidPriorityUserNameMap % CompleteMap TotalMaps CompletedReduce % 
CompleteReduce TotalReduces CompletedJob Scheduling InformationDiagnostic Info
job_201108242119_0001NORMALoracleselect count(*) from test(Stage-1)100.00%


00100.00%


1 1NANA
job_201108242119_0002NORMALoracleselect count(*) from test(Stage-1)100.00%


11100.00%


1 1NANA
job_201108242119_0003NORMALoracleselect count(*) from test(Stage-1)100.00%


11100.00%


1 1NANA
job_201108242119_0004NORMALoracleselect period_key,count(*) 
from...period_key(Stage-1)100.00%


11100.00%


3 3NANA
job_201108242119_0005NORMALoracleselect period_key,count(*) 
from...period_key(Stage-1)100.00%


11100.00%


3 3NANA
job_201108242119_0006NORMALoracleselect period_key,count(*) 
from...period_key(Stage-1)100.00%


11100.00%


3 3NANA



At 2011-08-24 18:19:38,wd  wrote:
>What about your total Map Task Capacity?
>you may check it from http://your_jobtracker:50030/jobtracker.jsp

>
>2011/8/24 Daniel,Wu :
>> I checked my setting, all are with the default value.So per the book of
>> "Hadoop the definitive guide", the split size should be 64M. And the file
>> size is about 500M, so that's about 8 splits. And from the map job
>> information (after the map job is done), I can see it gets 8 split from one
>> node. But anyhow it starts only one map task.
>>
>>
>>
>> At 2011-08-24 02:28:18,"Aggarwal, Vaibhav"  wrote:
>>
>> If you actually have splittable files you can set the following setting to
>> create more splits:
>>
>>
>>
>> mapred.max.split.size appropriately.
>>
>>
>>
>> Thanks
>>
>> Vaibhav
>>
>>
>>
>> From: Daniel,Wu [mailto:hadoop...@163.com]
>> Sent: Tuesday, August 23, 2011 6:51 AM
>> To: hive
>> Subject: Why a sql only use one map task?
>>
>>
>>
>>   I run the following simple sql
>> select count(*) from sales;
>> And the job information shows it only uses one map task.
>>
>> The underlying hadoop has 3 data/data nodes. So I expect hive should kick
>> off 3 map tasks, one on each task nodes. What can make hive only run one map
>> task? Do I need to set something to kick off multiple map task?  in my
>> config, I didn't change hive config.
>>
>>
>>
>>











Re: Hive crashing after an upgrade - issue with existing larger tables

2011-08-18 Thread bejoy_ks
A small correction to my previous post. The CDH version is CDH u1 not u0
Sorry for the confusion

Regards
Bejoy K S

-Original Message-
From: Bejoy Ks 
Date: Thu, 18 Aug 2011 05:51:58 
To: hive user group
Reply-To: user@hive.apache.org
Subject: Hive crashing after an upgrade - issue with existing larger tables

Hi Experts

        I was working on hive with larger volume data  with hive 0.7 . Recently 
my hive installation was upgraded to 0.7.1 . After the upgrade I'm having a lot 
of issues with queries that were already working fine with larger data. The 
queries that took seconds to return results is now taking hours, for most 
larger tables even the map reduce jobs are not getting triggered. Queries like 
Select * and describe are working fine since they don't involve any map reduce 
jobs. For the jobs that didn't even get triggered I got the following error 
from job tracker

Job initialization failed: java.io.IOException: Split metadata size exceeded 
1000. 
Aborting job job_201106061630_6993 at 
org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
 
at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:807) 
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:701) 
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4013) 
at 
org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) 
at java.lang.Thread.run(Thread.java:619) 


Looks like some metadata issue. My cluster is on CDH3-u0 . Has anyone faced 
similar issues before. Please share your thoughts what could be the probable 
cause of the error.

Thank You



Re: how to load data to partitioned table

2011-08-14 Thread bejoy_ks
Ya I very much agree with you on those lines. Using the basic stuff would 
literally run into memory issues  with large datasets. I had some of those 
resolved by using the DISTRIBUTE BY clause and so. In short a little work 
around over your hive queries could help you out in some cases.
Regards
Bejoy K S

-Original Message-
From: hadoopman 
Date: Sun, 14 Aug 2011 08:57:12 
To: 
Reply-To: user@hive.apache.org
Subject: Re: how to load data to partitioned table

Something else I've noticed is when loading LOTS of historical data, if 
you can try to say load a month of data at a time, try to just load THAT 
month of data and only that month.  I've been able to load several years 
of data (depending on the data) at a single load however there have been 
times when loading a large dataset that I would run into memory issues 
during the reduce phase (usually during shuffle/sort).  Things from out 
of memory to stack overflow messages (I've compiled a list of the more 
fun ones).

Then I noticed that only loading data from say a single month loaded 
quickly and without the memory headaches during the reduce.

Something to keep in mind and it works great!



On 08/12/2011 07:58 AM, bejoy...@yahoo.com wrote:
> Hi Daniel
> Just having a look at your requirement , to load data into a partition 
> based hive table from any input file the most hassle free approach 
> would be.
> 1. Load the data into a non partitioned table that shares similar 
> structure as the target table.
> 2. Populate the target table with the data from non partitioned one 
> using hive dynamic partition
> approach.
> With Dynamic partitions you don't need to manually identify the data 
> partitions and distribute data accordingly.
>
> A similar implementation is described in the blog post
> www.kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html
>
> Hope it helps
>
> Regards
> Bejoy K S
>
> 
> *From: * Vikas Srivastava 
> *Date: *Fri, 12 Aug 2011 17:31:28 +0530
> *To: *
> *ReplyTo: * user@hive.apache.org
> *Subject: *Re: how to load data to partitioned table
>
> Hey ,
>
> Simpley you have run query like this
>
> FROM sales_temp INSERT OVERWRITE TABLE sales partition(period_key) 
> SELECT *
>
>
> Regards
> Vikas Srivastava
>
>
> 2011/8/12 Daniel,Wu mailto:hadoop...@163.com>>
>
>   suppose the table is partitioned by period_key, and the csv file
> also has a column named as period_key. The csv file contains
> multiple days of data, how can we load it in the the table?
>
> I think of an workaround by first load the data into a
> non-partition table, and then insert the data from non-partition
> table to the partition table.
>
> hive> INSERT OVERWRITE TABLE sales SELECT * FROM sales_temp;
> FAILED: Error in semantic analysis: need to specify partition
> columns because the destination table is partitioned.
>
>
> However it doesn't work also. please help.
>
>
>
>
>
> -- 
> With Regards
> Vikas Srivastava
>
> DWH & Analytics Team
> Mob:+91 9560885900
> One97 | Let's get talking !
>




Re: how to load data to partitioned table

2011-08-12 Thread bejoy_ks
Hi Daniel
  Just having a look at your requirement , to load data into a partition 
based hive table from any input file the most hassle free approach would be.
1.  Load the data into a non partitioned table that shares similar structure as 
the target table.
2. Populate the target table with the data from non partitioned one using hive 
dynamic partition
approach.
With Dynamic partitions you don't need to manually identify the data partitions 
and distribute data accordingly. 

A similar implementation is described in the blog post
www.kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html

Hope it helps

Regards
Bejoy K S

-Original Message-
From: Vikas Srivastava 
Date: Fri, 12 Aug 2011 17:31:28 
To: 
Reply-To: user@hive.apache.org
Subject: Re: how to load data to partitioned table

Hey ,

Simpley you have run query like this

FROM sales_temp INSERT OVERWRITE TABLE sales partition(period_key) SELECT *


Regards
Vikas Srivastava


2011/8/12 Daniel,Wu 

>   suppose the table is partitioned by period_key, and the csv file also has
> a column named as period_key. The csv file contains multiple days of data,
> how can we load it in the the table?
>
> I think of an workaround by first load the data into a non-partition table,
> and then insert the data from non-partition table to the partition table.
>
> hive> INSERT OVERWRITE TABLE sales SELECT * FROM sales_temp;
> FAILED: Error in semantic analysis: need to specify partition columns
> because the destination table is partitioned.
>
>
> However it doesn't work also. please help.
>
>
>


-- 
With Regards
Vikas Srivastava

DWH & Analytics Team
Mob:+91 9560885900
One97 | Let's get talking !



Re: why need to copy when run a sql with a single map

2011-08-10 Thread bejoy_ks
Hi
  Hive queries are parsed into hadoop map reduce jobs. In map reduce jobs, 
between map and reduce tasks there are two phases, copy-phase and sort-phase 
together known as sort and shuffle phase. So the copy task indicated in hive 
job  here should be the copy phase of map reduce. It does the copying of map 
output from map task nodes to corresponding reduce task nodes.

Regards
Bejoy K S

-Original Message-
From: "Daniel,Wu" 
Date: Wed, 10 Aug 2011 20:07:48 
To: hive
Reply-To: user@hive.apache.org
Subject: why need to copy when run a sql with a single map

I run a single query like

select retailer_key,count(*) from records group by retailer_key;

it uses a single map as shown below, since the file is already on HDFS, so I 
think hadoop/hive doesn't need to copy anything.


Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempts
map100.00%


100100 / 0
reduce100.00%


100100 / 0

but the final chart in the job  report shows "copy" takes about 33% of the 
total time, and the rest are "sort", and "reduce".  So why it should copy here, 
or copy means something elso?
 oracle@oracle-MS-7623:~/test$ hadoop fs -lsr /

drwxr-xr-x   - oracle supergroup  0 2011-08-10 19:46 /user
drwxr-xr-x   - oracle supergroup  0 2011-08-10 19:46 /user/hive
drwxr-xr-x   - oracle supergroup  0 2011-08-10 19:59 
/user/hive/warehouse
drwxr-xr-x   - oracle supergroup  0 2011-08-10 19:59 
/user/hive/warehouse/records
-rw-r--r--   1 oracle supergroup   41600256 2011-08-10 19:59 
/user/hive/warehouse/records/test.txt





Re: Hive or pig for sequential iterations like those using foreach

2011-08-08 Thread bejoy_ks
Thanks Amareshwari, the article gave me much valuable hints to decide my 
choice. But on curiosity, does hive support stage by stage iterative 
processing? If so how? 

Thank You
Regards
Bejoy K S

-Original Message-
From: Amareshwari Sri Ramadasu 
Date: Mon, 8 Aug 2011 17:14:21 
To: user@hive.apache.org; 
bejoy...@yahoo.com
Reply-To: user@hive.apache.org
Subject: Re: Hive or pig for sequential iterations like those using foreach

You can have a look at typical use cases of Pig and Hive here 
http://developer.yahoo.com/blogs/hadoop/posts/2010/08/pig_and_hive_at_yahoo/

Thanks
Amareshwari

On 8/8/11 5:10 PM, "bejoy...@yahoo.com"  wrote:

Hi
   I've been successful using hive for a past few projects. Now for a 
particular use case I'm bit confused what to choose, Hive or Pig. My project 
involves a step by step sequential work flow. In every step I retrieve some 
values based on some query, use these values as input to new queries 
iterative(similar to foreach implementation in Pig) and so on. Is hive a good 
choice here when I'm having 11 sequence of operation as described?  The second 
confusion for me is, does hive support 'foreach' equivalent functionality?

Please advise.

I'm from JAVA background, not much into  db development so not sure of any such 
concepts in SQL.

Thanks

Regards
Bejoy K S





Hive or pig for sequential iterations like those using foreach

2011-08-08 Thread bejoy_ks
Hi 
   I've been successful using hive for a past few projects. Now for a 
particular use case I'm bit confused what to choose, Hive or Pig. My project 
involves a step by step sequential work flow. In every step I retrieve some 
values based on some query, use these values as input to new queries 
iterative(similar to foreach implementation in Pig) and so on. Is hive a good 
choice here when I'm having 11 sequence of operation as described?  The second 
confusion for me is, does hive support 'foreach' equivalent functionality?

Please advise. 

I'm from JAVA background, not much into  db development so not sure of any such 
concepts in SQL.

Thanks 

Regards
Bejoy K S



Re: NPE with hive.cli.print.header=true;

2011-08-01 Thread bejoy_ks
Hi Ayon
AFAIK hive is supposed to behave so. If you set the 
hive.cli.print.header=true for enabling column headers then some commands like 
'desc' is not expected to work. Not sure whether there is some patch recently 
out for this.

Regards
Bejoy K S

-Original Message-
From: Ayon Sinha 
Date: Mon, 1 Aug 2011 17:29:17 
To: Hive Mailinglist
Reply-To: user@hive.apache.org
Subject: NPE with hive.cli.print.header=true;

With 
set hive.cli.print.header=true;


I get NPE's for "desc" as well as "use"

Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

Is there a patch for this?
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



Re: Partition by existing field?

2011-07-08 Thread bejoy_ks
Hi Travis
 From my understanding of your requirement, Dynamic Partitions in hive 
is the most suitable solution.

I have written a blogpost on such requirements please refer
 
http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html
 for an understanding on the implementation . You can refer the hive wiki as 
well.

Please revert for any clarification
Regards
Bejoy K S

-Original Message-
From: "Travis Powell" 
Date: Fri, 8 Jul 2011 13:11:58 
To: 
Reply-To: user@hive.apache.org
Subject: Partition by existing field?

Can I partition by an existing field?

 

I have a 10 GB file with a date field and an hour of day field. Can I
load this file into a table, then insert-overwrite into another
partitioned table that uses those fields as a partition? Would something
like the following work?

 

INSERT OVERWRITE TABLE tealeaf_event
PARTITION(dt=evt.datestring,hour=evt.hour) SELECT * FROM staging_event
evt;

 

Thanks!

Travis




Re: Hive create table

2011-05-25 Thread bejoy_ks
Hi Jinhang
   I don't think hive supports multi character delimiters. The hassle free 
option here would be to preprocess the data using mapreduce to replace the 
multi character delimiter with another permissible one that suits your data.
Regards
Bejoy K S

-Original Message-
From: jinhang du 
Date: Wed, 25 May 2011 19:56:16 
To: 
Reply-To: user@hive.apache.org
Subject: Hive create table

Hi all,

I want to custom the delimiter of the table in a row.
Like my data format is '124‘, and how could I create a table (int,
int, int)

Thanks.

-- 
dujinhang



Re: Hive map join - process a little larger tables withmoderatenumber of rows

2011-04-01 Thread bejoy_ks
Thanks for your reply Viral. However  in later versions of hive you don't have 
to tell hive anything (which is the smaller table) . During runtime hive itself 
identifies the smaller table and do the local map task on the same irrespective 
of whether it comes on left or right side of the join. There is a face book 
post on such join optimizations within hive, you can get a better picture from 
the same .
Regards
Bejoy K S

-Original Message-
From: Viral Bajaria 
Date: Fri, 1 Apr 2011 01:25:41 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: Hive map join - process a little larger tables with
 moderatenumber of rows

Bejoy,

We still use older version of Hive (0.5). In that version the join order
used to matter. You needed to keep the largest table as the rightmost in
your JOIN sequence to make sure that it is streamed and thus avoid the OOM
exceptions which are caused by mappers which load the entire table in memory
and run out of JVM -Xmx parameter.

If you cannot do that, then you can use the STREAMTABLE hint as follows:
SELECT /*+ STREAMTABLE(t1) */ * FROM t1 join t2 on t1.col1 = t2.col1 <.>

Thanks,
Viral

On Thu, Mar 31, 2011 at 10:15 PM,  wrote:

> Thanks Yongqiang for your reply. I'm running a hive script which has nearly
> 10 joins within. From those joins all map joins(9 of them involves one small
> table) involving smaller tables are running fine. Just 1 join is on two
> larger tables and this map join fails, however since the back up task(common
> join) is executed successfully the whole hive job runs to completion
> successfully.
>  In brief my hive job is running successfully now, but I just want to
> get the failed map join as well running instead of the common join being
> executed. I'm curious to see what would be the performance improvement out
> there with this difference in execution.
>   To get a map join executed on larger tables do I have to for memory
> parameters with hadoop?
> Since my entire task is already running to completion and I want get just a
> map join working, shouldn't altering some hive map join parameters do my
> job?
> Please advise
>
>
> Regards
> Bejoy K S
>
> -Original Message-
> From: yongqiang he 
> Date: Thu, 31 Mar 2011 16:25:03
> To: 
> Reply-To: user@hive.apache.org
> Subject: Re: Hive map join - process a little larger tables with moderate
>  number of rows
>
> You possibly got a OOM error when processing the small tables. OOM is
> a fatal error that can not be controlled by the hive configs. So can
> you try to increase your memory setting?
>
> thanks
> yongqiang
> On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks  wrote:
> > Hi Experts
> > I'm currently working with hive 0.7 mostly with JOINS. In all
> > permissible cases i'm using map joins by setting the
> > hive.auto.convert.join=true  parameter. Usage of local map joins have
> made a
> > considerable performance improvement in hive queries.I have used this
> local
> > map join only on the default set of hive configuration parameters now i'd
> > try to dig more deeper into this. Want to try out this local map join on
> > little bigger tables with more no of rows. Given below is a failure log
> of
> > one of my local map tasks and in turn executing its back up common join
> task
> >
> > 2011-03-31 09:56:54 Starting to launch local task to process map
> > join;  maximum memory = 932118528
> > 2011-03-31 09:56:57 Processing rows:20  Hashtable size:
> > 19  Memory usage:   115481024   rate:   0.124
> > 2011-03-31 09:57:00 Processing rows:30  Hashtable size:
> > 29  Memory usage:   169344064   rate:   0.182
> > 2011-03-31 09:57:03 Processing rows:40  Hashtable size:
> > 39  Memory usage:   232132792   rate:   0.249
> > 2011-03-31 09:57:06 Processing rows:50  Hashtable size:
> > 49  Memory usage:   282338544   rate:   0.303
> > 2011-03-31 09:57:10 Processing rows:60  Hashtable size:
> > 59  Memory usage:   336738640   rate:   0.361
> > 2011-03-31 09:57:14 Processing rows:70  Hashtable size:
> > 69  Memory usage:   391117888   rate:   0.42
> > 2011-03-31 09:57:22 Processing rows:80  Hashtable size:
> > 79  Memory usage:   453906496   rate:   0.487
> > 2011-03-31 09:57:27 Processing rows:90  Hashtable size:
> > 89  Memory usage:   508306552   rate:   0.545
> > 2011-03-31 09:57:34 Processing rows:100 Hashtable size:
> > 99  Memory usage:   562706496   rate:   0.604
> > FAILED: Execution Error, return code 2 from
> > org.apache.hadoop.hive.ql.exec.MapredLocalTask
> > ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
> > Launching Job 4 out of 6
> >
> >
> > Here i"d like to make this local map task running, for the same i tried
> > setting the following hive parameters as
> > hive -f  HiveJob.txt -hiveconf hive.mapjoin.maxsize=100 -hiveconf

Re: Hive map join - process a little larger tables withmoderatenumber of rows

2011-04-01 Thread bejoy_ks
Thanks Yongqiang . I worked for me and I was able to evaluate the performance. 
It proved to be expensive :) 
Regards
Bejoy K S

-Original Message-
From: yongqiang he 
Date: Thu, 31 Mar 2011 22:27:26 
To: ; 
Reply-To: user@hive.apache.org
Subject: Re: Hive map join - process a little larger tables with
 moderatenumber of rows

Can you try this one "hive.mapred.local.mem" (in MB)? It is to control
the heapsize of the join's local child process.
You can also try to increase the HADOOP_HEAPSIZE for your hive client.

But these all depends on how big is your small file.

thanks
yongqiang
On Thu, Mar 31, 2011 at 10:15 PM,   wrote:
> Thanks Yongqiang for your reply. I'm running a hive script which has nearly 
> 10 joins within. From those joins all map joins(9 of them involves one small 
> table) involving smaller tables are running fine. Just 1 join is on two 
> larger tables and this map join fails, however since the back up task(common 
> join) is executed successfully the whole hive job runs to completion 
> successfully.
>      In brief my hive job is running successfully now, but I just want to get 
> the failed map join as well running instead of the common join being 
> executed. I'm curious to see what would be the performance improvement out 
> there with this difference in execution.
>       To get a map join executed on larger tables do I have to for memory 
> parameters with hadoop?
> Since my entire task is already running to completion and I want get just a 
> map join working, shouldn't altering some hive map join parameters do my job?
> Please advise
>
>
> Regards
> Bejoy K S
>
> -Original Message-
> From: yongqiang he 
> Date: Thu, 31 Mar 2011 16:25:03
> To: 
> Reply-To: user@hive.apache.org
> Subject: Re: Hive map join - process a little larger tables with moderate
>  number of rows
>
> You possibly got a OOM error when processing the small tables. OOM is
> a fatal error that can not be controlled by the hive configs. So can
> you try to increase your memory setting?
>
> thanks
> yongqiang
> On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks  wrote:
>> Hi Experts
>>     I'm currently working with hive 0.7 mostly with JOINS. In all
>> permissible cases i'm using map joins by setting the
>> hive.auto.convert.join=true  parameter. Usage of local map joins have made a
>> considerable performance improvement in hive queries.I have used this local
>> map join only on the default set of hive configuration parameters now i'd
>> try to dig more deeper into this. Want to try out this local map join on
>> little bigger tables with more no of rows. Given below is a failure log of
>> one of my local map tasks and in turn executing its back up common join task
>>
>> 2011-03-31 09:56:54 Starting to launch local task to process map
>> join;  maximum memory = 932118528
>> 2011-03-31 09:56:57 Processing rows:    20  Hashtable size:
>> 19  Memory usage:   115481024   rate:   0.124
>> 2011-03-31 09:57:00 Processing rows:    30  Hashtable size:
>> 29  Memory usage:   169344064   rate:   0.182
>> 2011-03-31 09:57:03 Processing rows:    40  Hashtable size:
>> 39  Memory usage:   232132792   rate:   0.249
>> 2011-03-31 09:57:06 Processing rows:    50  Hashtable size:
>> 49  Memory usage:   282338544   rate:   0.303
>> 2011-03-31 09:57:10 Processing rows:    60  Hashtable size:
>> 59  Memory usage:   336738640   rate:   0.361
>> 2011-03-31 09:57:14 Processing rows:    70  Hashtable size:
>> 69  Memory usage:   391117888   rate:   0.42
>> 2011-03-31 09:57:22 Processing rows:    80  Hashtable size:
>> 79  Memory usage:   453906496   rate:   0.487
>> 2011-03-31 09:57:27 Processing rows:    90  Hashtable size:
>> 89  Memory usage:   508306552   rate:   0.545
>> 2011-03-31 09:57:34 Processing rows:    100 Hashtable size:
>> 99  Memory usage:   562706496   rate:   0.604
>> FAILED: Execution Error, return code 2 from
>> org.apache.hadoop.hive.ql.exec.MapredLocalTask
>> ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
>> Launching Job 4 out of 6
>>
>>
>> Here i"d like to make this local map task running, for the same i tried
>> setting the following hive parameters as
>> hive -f  HiveJob.txt -hiveconf hive.mapjoin.maxsize=100 -hiveconf
>> hive.mapjoin.smalltable.filesize=4000 -hiveconf
>> hive.auto.convert.join=true
>> Butting setting the two config parameters doesn't make my local map task
>> proceed beyond this stage.  I didn't try out
>> overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from my
>> task log shows that the memory usage rate is just 0.604, so i assume setting
>> the same with a larger value wont cater to a solution in my case.Could some
>> one please guide me what are the actual parameters and the values I should
>> set to get things rolling.
>>
>> Thank You
>>
>> Regard

Re: Hive map join - process a little larger tables with moderatenumber of rows

2011-03-31 Thread bejoy_ks
Thanks Yongqiang for your reply. I'm running a hive script which has nearly 10 
joins within. From those joins all map joins(9 of them involves one small 
table) involving smaller tables are running fine. Just 1 join is on two larger 
tables and this map join fails, however since the back up task(common join) is 
executed successfully the whole hive job runs to completion successfully.
  In brief my hive job is running successfully now, but I just want to get 
the failed map join as well running instead of the common join being executed. 
I'm curious to see what would be the performance improvement out there with 
this difference in execution.
   To get a map join executed on larger tables do I have to for memory 
parameters with hadoop?
Since my entire task is already running to completion and I want get just a map 
join working, shouldn't altering some hive map join parameters do my job?
Please advise


Regards
Bejoy K S

-Original Message-
From: yongqiang he 
Date: Thu, 31 Mar 2011 16:25:03 
To: 
Reply-To: user@hive.apache.org
Subject: Re: Hive map join - process a little larger tables with moderate
 number of rows

You possibly got a OOM error when processing the small tables. OOM is
a fatal error that can not be controlled by the hive configs. So can
you try to increase your memory setting?

thanks
yongqiang
On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks  wrote:
> Hi Experts
>     I'm currently working with hive 0.7 mostly with JOINS. In all
> permissible cases i'm using map joins by setting the
> hive.auto.convert.join=true  parameter. Usage of local map joins have made a
> considerable performance improvement in hive queries.I have used this local
> map join only on the default set of hive configuration parameters now i'd
> try to dig more deeper into this. Want to try out this local map join on
> little bigger tables with more no of rows. Given below is a failure log of
> one of my local map tasks and in turn executing its back up common join task
>
> 2011-03-31 09:56:54 Starting to launch local task to process map
> join;  maximum memory = 932118528
> 2011-03-31 09:56:57 Processing rows:    20  Hashtable size:
> 19  Memory usage:   115481024   rate:   0.124
> 2011-03-31 09:57:00 Processing rows:    30  Hashtable size:
> 29  Memory usage:   169344064   rate:   0.182
> 2011-03-31 09:57:03 Processing rows:    40  Hashtable size:
> 39  Memory usage:   232132792   rate:   0.249
> 2011-03-31 09:57:06 Processing rows:    50  Hashtable size:
> 49  Memory usage:   282338544   rate:   0.303
> 2011-03-31 09:57:10 Processing rows:    60  Hashtable size:
> 59  Memory usage:   336738640   rate:   0.361
> 2011-03-31 09:57:14 Processing rows:    70  Hashtable size:
> 69  Memory usage:   391117888   rate:   0.42
> 2011-03-31 09:57:22 Processing rows:    80  Hashtable size:
> 79  Memory usage:   453906496   rate:   0.487
> 2011-03-31 09:57:27 Processing rows:    90  Hashtable size:
> 89  Memory usage:   508306552   rate:   0.545
> 2011-03-31 09:57:34 Processing rows:    100 Hashtable size:
> 99  Memory usage:   562706496   rate:   0.604
> FAILED: Execution Error, return code 2 from
> org.apache.hadoop.hive.ql.exec.MapredLocalTask
> ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
> Launching Job 4 out of 6
>
>
> Here i"d like to make this local map task running, for the same i tried
> setting the following hive parameters as
> hive -f  HiveJob.txt -hiveconf hive.mapjoin.maxsize=100 -hiveconf
> hive.mapjoin.smalltable.filesize=4000 -hiveconf
> hive.auto.convert.join=true
> Butting setting the two config parameters doesn't make my local map task
> proceed beyond this stage.  I didn't try out
> overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from my
> task log shows that the memory usage rate is just 0.604, so i assume setting
> the same with a larger value wont cater to a solution in my case.Could some
> one please guide me what are the actual parameters and the values I should
> set to get things rolling.
>
> Thank You
>
> Regards
> Bejoy.K.S
>
>


Re: Hadoop error 2 while joining two large tables

2011-03-17 Thread bejoy_ks
Try out CDH3b4 it has hive 0.7 and the latest of other hadoop tools. When you 
work with open source it is definitely a good practice to upgrade those with 
latest versions. With newer versions bugs would be minimal , performance would 
be better and you get more functionalities. Your query looks fine an upgrade of 
hive could sort things out. 
Regards
Bejoy K S

-Original Message-
From: Edward Capriolo 
Date: Thu, 17 Mar 2011 08:51:05 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Hadoop error 2 while joining two large tables

I am pretty sure the cloudera distro has an upgrade path to a more recent hive.

On Thursday, March 17, 2011, hadoop n00b  wrote:
> Hello All,
>
> Thanks a lot for your response. To clarify a few points -
>
> I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of 
> Hive as we have to use Cloudera distro only.
>
> All records in the smaller table have at least one record in the larger table 
> (of course a few exceptions could be there but only a few).
>
> The join is using ON clause. The query is something like -
>
> select ...
> from
> (
>   (select ... from smaller_table)
>   join
>   (select from larger_table)
>   on (smaller_table.col = larger_table.col)
> )
>
> I will try out setting mapred.child.java.opts -Xmx to a higher value and let 
> you know.
>
> Is there a pattern or rule of thumb to follow on when to add more nodes?
>
> Thanks again!
>
> On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong  wrote:
>
>
>
> In addition, put the smaller table on the left-hand side of a JOIN:
>
> SELECT ... FROM small_table JOIN large_table ON ...
>
>
>
>
> From: Bejoy Ks [mailto:bejoy...@yahoo.com]
> Sent: Wednesday, March 16, 2011 11:43 AM
>
> To: user@hive.apache.org
> Subject: Re: Hadoop error 2 while joining two large tables
>
>
>
>
>
>
> Hey hadoop n00b
>     I second Mark's thought. But definitely you can try out re framing your 
> query to get things rolling. I'm not sure on your hive Query.But still, from 
> my experience with joins on huge tables (record counts in the range of 
> hundreds of millions) you should give join conditions with JOIN ON clause 
> rather than specifying all conditions in WHERE.
>
> Say if you have a query this way
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
> a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;
>
> You can definitely re frame this query as
> SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
> ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 > b.Column2;
>
> From my understanding Hive supports equijoins so you can't have the 
> inequality conditions there within JOIN ON, inequality should come to WHERE. 
> This approach has worked for me when I encountered a similar situation as 
> yours some time ago. Try this out,hope it helps.
>
> Regards
> Bejoy.K.S
>
>
>
>
>
>
>
>
> From: "Sunderlin, Mark" 
> To: "user@hive.apache.org" 
> Sent: Wed, March 16, 2011 11:22:09 PM
> Subject: RE: Hadoop error 2 while joining two large tables
>
>
>
>
> hadoop n00b asks, “Is adding more nodes the solution to such problem?”
>
> Whatever else answers you get, you should append “ … and add more nodes.” 
> More nodes is never a bad thing ;-)
>
>
> ---
> Mark E. Sunderlin
> Solutions Architect |AOL Data Warehouse
>
> P: 703-256-6935 | C: 540-327-6222
>
> AIM: MESunderlin
> 22000 AOL Way