Re: how to let hive support lzo

2013-07-22 Thread bejoy_ks

Hi,

Along with the mapred.compress* properties try to set
hive.exec.compress.output to true.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: ch huang justlo...@gmail.com
Date: Mon, 22 Jul 2013 13:41:01 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: how to let hive support lzo

# hbase org.apache.hadoop.hbase.util.CompressionTest
hdfs://CH22:9000/alex/my.txt lzo
13/07/22 13:27:58 WARN conf.Configuration: hadoop.native.lib is deprecated.
Instead, use io.native.lib.available
13/07/22 13:27:59 INFO util.ChecksumType: Checksum using
org.apache.hadoop.util.PureJavaCrc32
13/07/22 13:27:59 INFO util.ChecksumType: Checksum can use
org.apache.hadoop.util.PureJavaCrc32C
13/07/22 13:27:59 ERROR metrics.SchemaMetrics: Inconsistent configuration.
Previous configuration for using table name in metrics: true, new
configuration: false
13/07/22 13:27:59 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/07/22 13:27:59 INFO lzo.LzoCodec: Successfully loaded  initialized
native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
13/07/22 13:27:59 INFO compress.CodecPool: Got brand-new compressor
[.lzo_deflate]
13/07/22 13:28:00 INFO compress.CodecPool: Got brand-new decompressor
[.lzo_deflate]
SUCCESS




# hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.4.15.jar
com.hadoop.compression.lzo.LzoIndexer /alex
13/07/22 09:39:04 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/07/22 09:39:04 INFO lzo.LzoCodec: Successfully loaded  initialized
native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
13/07/22 09:39:04 INFO lzo.LzoIndexer: LZO Indexing directory /alex...
13/07/22 09:39:04 INFO lzo.LzoIndexer:   LZO Indexing directory
hdfs://CH22:9000/alex/alex_t...
13/07/22 09:39:04 INFO lzo.LzoIndexer:   [INDEX] LZO Indexing file
hdfs://CH22:9000/alex/sqoop-1.99.2-bin-hadoop200.tar.gz.lzo, size 0.02 GB...
13/07/22 09:39:05 WARN conf.Configuration: hadoop.native.lib is deprecated.
Instead, use io.native.lib.available
13/07/22 09:39:06 INFO lzo.LzoIndexer:   Completed LZO Indexing in 1.16
seconds (13.99 MB/s).  Index size is 0.52 KB.

13/07/22 09:39:06 INFO lzo.LzoIndexer:   [INDEX] LZO Indexing file
hdfs://CH22:9000/alex/test1.lzo, size 0.00 GB...
13/07/22 09:39:06 INFO lzo.LzoIndexer:   Completed LZO Indexing in 0.08
seconds (0.00 MB/s).  Index size is 0.01 KB.


On Mon, Jul 22, 2013 at 1:37 PM, ch huang justlo...@gmail.com wrote:

 hi ,all:
  i already install and testing lzo in hadoop and hbase,all success,but
 when i try it on hive ,it failed ,how can i do let hive can recognize lzo?


 hive set mapred.map.output.compression.codec;

 mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
 hive set
 mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
 hive select count(*) from test;
 Total MapReduce jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks determined at compile time: 1
 In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=number
 In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=number
 In order to set a constant number of reducers:
   set mapred.reduce.tasks=number
 Starting Job = job_1374463239553_0003, Tracking URL =
 http://CH22:8088/proxy/application_1374463239553_0003/http://ch22:8088/proxy/application_1374463239553_0003/
 Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1374463239553_0003
 Hadoop job information for Stage-1: number of mappers: 1; number of
 reducers: 1
 2013-07-22 13:33:27,243 Stage-1 map = 0%,  reduce = 0%
 2013-07-22 13:33:45,403 Stage-1 map = 100%,  reduce = 0%
 Ended Job = job_1374463239553_0003 with errors
 Error during job, obtaining debugging information...
 Job Tracking URL: 
 http://CH22:8088/proxy/application_1374463239553_0003/http://ch22:8088/proxy/application_1374463239553_0003/
 Examining task ID: task_1374463239553_0003_m_00 (and more) from job
 job_1374463239553_0003
 Task with the most failures(4):
 -
 Task ID:
   task_1374463239553_0003_m_00
 URL:

 http://CH22:8088/taskdetails.jsp?jobid=job_1374463239553_0003tipid=task_1374463239553_0003_m_00http://ch22:8088/taskdetails.jsp?jobid=job_1374463239553_0003tipid=task_1374463239553_0003_m_00
 -
 Diagnostic Messages for this Task:
 Error: java.lang.RuntimeException: native-lzo library not available
 at
 com.hadoop.compression.lzo.LzoCodec.getCompressorType(LzoCodec.java:155)
 at
 org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:104)
 at
 org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:118)
 at org.apache.hadoop.mapred.IFile$Writer.init(IFile.java:115)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1580)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1457)
 at 

Re: Hive CLI

2013-07-09 Thread bejoy_ks
Hi Rahul,

The same shortcuts ctrl+A and ctrl+E works in hive shell for me( hive 0.9)


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: rahul kavale kavale.ra...@gmail.com
Date: Tue, 9 Jul 2013 11:00:49 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Hive CLI

Hey there,
I have been using HIVE(0.7) for a while now using CLI and bash scripts.
But its a pain to move cursor in the CLI i.e. once you enter a very long
query then you cant go to start of the query (like you do using
Ctrl+A/Ctrl+E in terminal). Does anyone know how to do it?

Thanks  Regards,
Rahul



Re: Strange error in hive

2013-07-08 Thread bejoy_ks
Hii Jerome


Can you send the error log of the MapReduce task that failed? That should have 
some pointers which can help you troubleshoot the issue.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Jérôme Verdier verdier.jerom...@gmail.com
Date: Mon, 8 Jul 2013 11:25:34 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Strange error in hive

Hi everybody,

I faced a strange error in hive this morning.

The error message is this one :

FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask

after a quick search on Google, it appears that this is a Hive bug :

https://issues.apache.org/jira/browse/HIVE-4650

Is there a way to pass through this error ?

Thanks.

NB : my hive script is in the attachment.


-- 
*Jérôme VERDIER*
06.72.19.17.31
verdier.jerom...@gmail.com



Re: integration issure about hive and hbase

2013-07-08 Thread bejoy_ks
Hi

Can you try including the zookeeper quorum and port in your hive configuration 
as shown below

hive --auxpath .../hbase-handler.jar, .../hbase.jar, ...zookeeper.jar, 
.../guava.jar -hiveconf hbase.zookeeper.quorum=zk server names separated by 
comma -hiveconf hbase.zookeeper.property.clientPort=your custom port

Substitute the above command with actual values.

Also ensure that the zk, hbase jars specified above are those used in your 
hbase cluster. To avoid any version mismatches. 
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: ch huang justlo...@gmail.com
Date: Mon, 8 Jul 2013 16:40:59 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: integration issure about hive and hbase

i replace the zookeeper jar ,the error is different

hive CREATE TABLE hbase_table_1(key int, value string)
 STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
 WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:val)
 TBLPROPERTIES (hbase.table.name = xyz);
FAILED: Error in metadata:
MetaException(message:org.apache.hadoop.hbase.ZooKeeperConnectionException:
HBase is able to connect to ZooKeeper but the connection closes
immediately. This could be a sign that the server has too many connections
(30 is the default). Consider inspecting your ZK server logs for that error
and then make sure you are reusing HBaseConfiguration as often as you can.
See HTable's javadoc for more information.
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:160)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:1265)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:526)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:516)
at
org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:173)
at
org.apache.hadoop.hbase.client.HBaseAdmin.init(HBaseAdmin.java:93)
at
org.apache.hadoop.hive.hbase.HBaseStorageHandler.getHBaseAdmin(HBaseStorageHandler.java:74)
at
org.apache.hadoop.hive.hbase.HBaseStorageHandler.preCreateTable(HBaseStorageHandler.java:158)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:344)
at
org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:470)
at
org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3176)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:213)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:131)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:843)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:930)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:138)
... 24 more
)
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask


On Mon, Jul 8, 2013 at 2:52 PM, Cheng Su scarcer...@gmail.com wrote:

  Did you hbase cluster start up?

 The error message is more like that something wrong with the classpath.
 So maybe you'd better also check that.


 On Mon, Jul 8, 2013 at 1:54 PM, ch huang justlo...@gmail.com wrote:

 i get error when try create table on hbase use hive, anyone can help?

 hive CREATE TABLE hive_hbasetable_demo(key int,value string)
  STORED BY 'ora.apache.hadoop.hive.hbase.HBaseStorageHandler'
  WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:val)
  TBLPROPERTIES (hbase.table.name = 

Re: Need help in Hive

2013-07-08 Thread bejoy_ks
Hi Maheedhar

As I understand, you are having a column with data of type MM:SS in your input 
data set.

AFAIK this format is not in the standard java.sql.Timestamp format also it 
doesn't even have any date part . Hence you may not be able to use Timestamp 
data type here.

You can define it as a string and then develop your custom UDFs for any further 
processing.
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Matouk IFTISSEN matouk.iftis...@ysance.com
Date: Mon, 8 Jul 2013 09:47:11 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Need help in Hive

Hello,
Try this function in hive query:
1- transform your data (type integer ) in timestamp (linux),
then do this:
2- from_unixtimeyour_date_timestamp), 'mm:ss') AS time

Hope this,  you will give help.



2013/7/8 Maheedhar Reddy maheedhar...@gmail.com

 Hi All,

 I have Hive 0.8.0 version installed in my single node Apache Hadoop
 cluster.

 I have a time column which is in format *MM:SS* (Minutes:seconds). I
 tried the date functions to get the value in MM:SS format. But its not
 working out.

 Below is my column for your reference.

 *Active Time*
 *12:01*
 0:20
 2:18

 in the first record 12:01, 12 is the number of minutes and 01 is the
 seconds.

 so when the time i'm creating a table in Hive, i have to give a data type
 for this column Active Time,
 I have tried with various date type columns but none of them worked out
 for me. Please guide me.

 What function should I use, to get the time in *MM:SS* format?


 You only live once, but if you do it right, once is enough.


 Cheers!!

 Maheedhar Reddy K V


 http://about.me/maheedhar.kv/#





Re: When to use bucketed tables with/instead of partitioned tables

2013-06-17 Thread bejoy_ks
Hi Stephen 

In addition to join optimization, bucketing helps much in sampling as well. It 
helps you to  choose the sample space, (ie n buckets of m).

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Stephen Boesch java...@gmail.com
Date: Sun, 16 Jun 2013 11:20:49 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: When to use bucketed tables with/instead of partitioned tables

I am accustomed to using partitioned tables to obtain separate directories
for data files in each partition.

When looking at the documentation for bucketed tables it seems they are
typically used in conjunction with distribute by/sort by and an appropriate
partitioning key - and thus provide ability to do map side joins.

An explanation of when to use bucketed tables by themselves (in lieu of
partitioned tables)  as well as in conjunction with partitoined tables
would be appreciated.

thanks!

stephenb



Re: How to delete Specific date data using hive QL?

2013-06-04 Thread bejoy_ks
Adding my two cents
If you are having an unpartitioned data/table and would like to partition it on 
some specific columns in source table, Use dynamic partition insert.
That would get the source data in separate partitions on a partitioned target 
table. 

http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Hamza Asad hamza.asa...@gmail.com
Date: Tue, 4 Jun 2013 12:52:49 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: How to delete Specific date data using hive QL?

Thank u s much nitin for your help.. :)


On Tue, Jun 4, 2013 at 12:18 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 1- Does partitioning improve performance?
 --Only if you make use of partitions in your queries (mostly in where
 clause to limit data to your query for a specific value of partitioned
 column)

 2- Do i have to create partition table new or i can create partition on
 existing table by renaming that date column and add partition column
 event_date (the actual column name) ?
 you can not create partitions on already existing data unless the data is
 in partitioned directories on hdfs.
 I would recommend create a new table with partitioned columns.
 load data from old table into partitioned table
 dump old table

 3- can i import data directly into partition table using sqoop command?
 you can import data directly into a partition.

 for exported data, you don't have to worry. it remains as it is


 On Tue, Jun 4, 2013 at 12:41 PM, Hamza Asad hamza.asa...@gmail.comwrote:

 No i don't want to change my queries. I want that my queries work on same
 table and partition does not change its schema.
 and from schema i means schema on mysql (exported data).

 Few more things
 1- Does partitioning improve performance?
 2- Do i have to create partition table new or i can create partition on
 existing table by renaming that date column and add partition column
 event_date (the actual column name) ?
 3- can i import data directly into partition table using sqoop command?




 On Tue, Jun 4, 2013 at 11:40 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 partitioning of data in hive is more for the reasons on how you layout
 data in a well defined manner so that when you access your data , you
 request only for specific data by specifying the partition columns in where
 clause.

 to answer your question,
 do you have to change your queries? out of the box the queries should
 work as it is unless and until you are changing the table schema by
 removing/adding new columns.
 does the format change when you export data? if your select statement is
 not changing it will not change
 will table schema change? do you mean schema on hive or mysql ?


 On Tue, Jun 4, 2013 at 11:37 AM, Hamza Asad hamza.asa...@gmail.comwrote:

 thats far more better :) ..
 Please tell me few more things. Do i have to change my query if i
 create table with partition on date? rest of the columns would be same as
 it is? Also if i export that partitioned table to mysql, does schema of
 that table would same as it was before partition?


 On Tue, Jun 4, 2013 at 12:09 AM, Stephen Sprague sprag...@gmail.comwrote:

 there is no delete semantic.

 you either partition on the data you want to drop and use drop
 partition (or drop table for the whole shebang) or you can do as Nitin
 suggests by selecting the inverse of the data you want to delete and store
 it back into the table itself.  Not ideal but maybe it could work for your
 situation.

 Now here's another idea.  This was just _recently_ discussed on this
 group as coincidence would have it.  if you were to have scanned just a
 little of the groups messages you would have seen that and could then have
 added to the discussion! :)


 On Mon, Jun 3, 2013 at 2:19 AM, Hamza Asad hamza.asa...@gmail.comwrote:

 Thanx for your response nitin. Anybody else have any better solution?


 On Mon, Jun 3, 2013 at 1:27 PM, Nitin Pawar 
 nitinpawar...@gmail.comwrote:

 hive does not give you a record level deletion as of now.

 so unless you have partitioned, other option is you overwrite the
 table with data which you want
 please wait for others to suggest you more options. this one is just
 mine and can be costly too


 On Mon, Jun 3, 2013 at 12:36 PM, Hamza Asad 
 hamza.asa...@gmail.comwrote:

 no, its not partitioned by date.


 On Mon, Jun 3, 2013 at 11:19 AM, Nitin Pawar 
 nitinpawar...@gmail.com wrote:

 how is the data laid out?
 is it partitioned data by the date?


 On Mon, Jun 3, 2013 at 11:20 AM, Hamza Asad 
 hamza.asa...@gmail.com wrote:

 Dear all,
 How can i remove data of specific dates from HDFS
 using hive query language?

 --
 *Muhammad Hamza Asad*




 --
 Nitin Pawar




 --
 *Muhammad Hamza Asad*




 --
 Nitin Pawar




 --
 *Muhammad Hamza Asad*





 --
 *Muhammad Hamza Asad*




 --
 Nitin Pawar




 --
 *Muhammad Hamza Asad*




 --
 Nitin 

Re: how does hive find where is MR job tracker

2013-05-28 Thread bejoy_ks
Hive gets the JobTracker from the mapred-site.xml specified within your 
$HADOOP_HOME/conf.

Is your $HADOOP_HOME/conf/mapred-site.xml on the node that runs hive have the 
correct value for jobtracker?
 If not changing that to the right one might resolve your issue.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Frank Luo j...@merkleinc.com
Date: Tue, 28 May 2013 16:49:01 
To: user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: how does hive find where is MR job tracker

I have a cloudera cluster, version 4.2.0.

In the hive configuration, I have MapReduce Service set to mapreduce1, 
which is my MR service.

However, without setting mapred.job.tracker, whenever I run hive command, it 
always sends the job to a wrong job tracker. Here is the error:


java.net.ConnectException: Call From hqhd01ed01.pclc0.merkle.local/10.129.2.52 
to hqhd01ed01.pclc0.merkle.local:8021 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

And the Cloudera Manager doesn't allow me to manually set mapred.job.tracker. 
So my question is how to make Hive point to the right job tracker without 
setting mapred.job.tracker every time.

PS. Not sure it matters, but I did move the job tracker from machine A to 
machine B.

Thx!



Re: Hive on Oracle

2013-05-17 Thread bejoy_ks
Hi Raj

Which jar depends on what version of oracle you are using? The jar version 
corresponding to each oracle release would be there in oracle documentations 
online.

JDBC Jars should be available from the oracle website for free download.


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Raj Hadoop hadoop...@yahoo.com
Date: Fri, 17 May 2013 20:43:46 
To: bejoy...@yahoo.combejoy...@yahoo.com; 
user@hive.apache.orguser@hive.apache.org; Useru...@hadoop.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Hive on Oracle


Thanks for the reply.

Can you specify whether which jar file need need to be used ? where can i get 
the jar file? does oracle provide one for free? let me know please.

Thanks,
Raj






 From: bejoy...@yahoo.com bejoy...@yahoo.com
To: user@hive.apache.org; Raj Hadoop hadoop...@yahoo.com; User 
u...@hadoop.apache.org 
Sent: Friday, May 17, 2013 11:42 PM
Subject: Re: Hive on Oracle
 


Hi

The procedure is same as setting up mysql metastore. You need to use the jdbc 
driver/jar corresponding to the oracle version/release you are intending to use.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos


From:  Raj Hadoop hadoop...@yahoo.com 
Date: Fri, 17 May 2013 17:10:07 -0700 (PDT)
To: Hiveuser@hive.apache.org; Useru...@hadoop.apache.org
ReplyTo:  user@hive.apache.org 
Subject: Hive on Oracle

Hi,

I am planning to install Hive and want to set up Meta store on Oracle. What is 
the procedure? Which driver (JDBC) do I need to use it?


Thanks,
Raj


Re: Getting Slow Query Performance!

2013-03-12 Thread bejoy_ks
Hi

Since you are on a pseudo distributed/ single node environment the hadoop 
mapreduce parallelism is limited.

You might be having just a few map slots and map tasks might be in queue 
waiting for others to complete. In a larger cluster your job should be faster.

Certain SQL queries that ulilize indexing would be faster in sql server than in 
hive.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Gobinda Paul gobi...@live.com
Date: Tue, 12 Mar 2013 15:09:31 
To: user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Getting Slow Query Performance!






i use sqoop to import 30GB data ( two table employee(aprox 21 GB)  and 
salary(aprox 9GB ) into hadoop(Single Node) via hive.
i run a sample query like SELECT 
EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY 
WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT90;
In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to 
execute that query .
CPU: Pentium(R) Dual-Core  CPU  E5700  @ 3.00GHzRAM:  2GBHDD: 500GB

Here IS My hive-site.xml conf.

?xml version=1.0??xml-stylesheet type=text/xsl href=configuration.xsl?
configuration  propertynamejavax.jdo.option.ConnectionURL/name
valuejdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true/value
  /property  propertynamejavax.jdo.option.ConnectionDriverName/name 
   valuecom.mysql.jdbc.Driver/value  /property  property
namejavax.jdo.option.ConnectionUserName/namevalueroot/value  
/property  propertynamejavax.jdo.option.ConnectionPassword/name
value123456/value  /property  property
namehive.hwi.listen.host/name value0.0.0.0/value 
descriptionThis is the host address the Hive Web Interface will listen 
on/description  /property  propertynamehive.hwi.listen.port/name  
  value/valuedescriptionThis is the port the Hive Web Interface 
will listen on/description   /property   property
namehive.hwi.war.file/namevalue/lib/hive-hwi-0.9.0.war/value
descriptionThis is the WAR file with the jsp content for Hive Web 
Interface/description   /property
  property  namemapred.reduce.tasks/namevalue-1/value 
descriptionThe default number of reduce tasks per job.  Typically set to a 
prime close to the number of available hosts.  Ignored when
mapred.job.tracker is local. Hadoop set this to 1 by default, whereas hive 
uses -1 as its default value.  By setting this property to -1, Hive will 
automatically figure out what should be the number of reducers.   
/description   /property
   property namehive.exec.reducers.bytes.per.reducer/name 
value10/value descriptionsize per reducer.The default is 1G, 
i.e if the input size is 10G, it will use 10 reducers./description   
/property

  propertynamehive.exec.reducers.max/namevalue999/value   
 descriptionmax number of reducers will be used. If the one   
specified in the configuration parameter mapred.reduce.tasks is 
negative, hive will use this one as the max number of reducers when 
automatically determine number of reducers. /description   
/property
  propertynamehive.exec.scratchdir/name
value/tmp/hive-${user.name}/valuedescriptionScratch space for Hive 
jobs/description  /property
   property namehive.metastore.local/name valuetrue/value   
/property
/configuration

Any IDEA ?? 
  


Re: Getting Slow Query Performance!

2013-03-12 Thread bejoy_ks
Hi

Since you are on a pseudo distributed/ single node environment the hadoop 
mapreduce parallelism is limited.

You might be having just a few map slots and map tasks might be in queue 
waiting for others to complete. In a larger cluster your job should be faster.

As a side note, Certain SQL queries that ulilize indexing would be faster in 
sql server than in hive.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Gobinda Paul gobi...@live.com
Date: Tue, 12 Mar 2013 15:09:31 
To: user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Getting Slow Query Performance!






i use sqoop to import 30GB data ( two table employee(aprox 21 GB)  and 
salary(aprox 9GB ) into hadoop(Single Node) via hive.
i run a sample query like SELECT 
EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY 
WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT90;
In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to 
execute that query .
CPU: Pentium(R) Dual-Core  CPU  E5700  @ 3.00GHzRAM:  2GBHDD: 500GB

Here IS My hive-site.xml conf.

?xml version=1.0??xml-stylesheet type=text/xsl href=configuration.xsl?
configuration  propertynamejavax.jdo.option.ConnectionURL/name
valuejdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true/value
  /property  propertynamejavax.jdo.option.ConnectionDriverName/name 
   valuecom.mysql.jdbc.Driver/value  /property  property
namejavax.jdo.option.ConnectionUserName/namevalueroot/value  
/property  propertynamejavax.jdo.option.ConnectionPassword/name
value123456/value  /property  property
namehive.hwi.listen.host/name value0.0.0.0/value 
descriptionThis is the host address the Hive Web Interface will listen 
on/description  /property  propertynamehive.hwi.listen.port/name  
  value/valuedescriptionThis is the port the Hive Web Interface 
will listen on/description   /property   property
namehive.hwi.war.file/namevalue/lib/hive-hwi-0.9.0.war/value
descriptionThis is the WAR file with the jsp content for Hive Web 
Interface/description   /property
  property  namemapred.reduce.tasks/namevalue-1/value 
descriptionThe default number of reduce tasks per job.  Typically set to a 
prime close to the number of available hosts.  Ignored when
mapred.job.tracker is local. Hadoop set this to 1 by default, whereas hive 
uses -1 as its default value.  By setting this property to -1, Hive will 
automatically figure out what should be the number of reducers.   
/description   /property
   property namehive.exec.reducers.bytes.per.reducer/name 
value10/value descriptionsize per reducer.The default is 1G, 
i.e if the input size is 10G, it will use 10 reducers./description   
/property

  propertynamehive.exec.reducers.max/namevalue999/value   
 descriptionmax number of reducers will be used. If the one   
specified in the configuration parameter mapred.reduce.tasks is 
negative, hive will use this one as the max number of reducers when 
automatically determine number of reducers. /description   
/property
  propertynamehive.exec.scratchdir/name
value/tmp/hive-${user.name}/valuedescriptionScratch space for Hive 
jobs/description  /property
   property namehive.metastore.local/name valuetrue/value   
/property
/configuration

Any IDEA ?? 
  


Re: hive issue with sub-directories

2013-03-11 Thread bejoy_ks
Hi Suresh

AFAIK as of now a partition cannot contain sub directories, it can contain only 
files.

You may have to move the sub dirs out of the parent dir 'a' and create separate 
partitions for those.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Suresh Krishnappa suresh.krishna...@gmail.com
Date: Mon, 11 Mar 2013 10:58:05 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: hive issue with sub-directories

Hi Mark,
I am using external table in HIVE.

This is how I am adding the partition

 alter table mytable add partition (pt=1) location '/test/a/';

I am able to run HIVE queries only if '/test/a/b' folder is deleted.

How can I retain this folder structure and still issue queries?

Thanks
Suresh

On Sun, Mar 10, 2013 at 12:48 AM, Mark Grover
grover.markgro...@gmail.comwrote:

 Suresh,
 By default, the partition column name has to be appear in HDFS
 directory structure.

 e.g.
 /user/hive/warehouse/table name/partition col name=partition col
 value/data1.txt
 /user/hive/warehouse/table name/partition col name=partition col
 value/data2.txt


 On Thu, Mar 7, 2013 at 7:20 AM, Suresh Krishnappa
 suresh.krishna...@gmail.com wrote:
  Hi All,
  I have the following directory structure in hdfs
 
  /test/a/
  /test/a/1.avro
  /test/a/2.avro
  /test/a/b/
  /test/a/b/3.avro
 
  I created an external HIVE table using Avro Serde and added /test/a as a
  partition to this table.
 
  I am not able to run a select query. Always getting the error 'not a
 file'
  on '/test/a/b'
 
  Is this by design, a bug or am I missing some configuration?
  I am using HIVE 0.10
 
  Thanks
  Suresh
 




Re: java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil

2013-03-10 Thread bejoy_ks
Hi Sai

Local mode is just for trials, for any pre prod/production environment you need 
MR jobs.

Hive under the hood stores data in HDFS (mostly) and definitely we use 
hadoop/hive for larger data volumes. So MR should be in there to process them. 

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Ramki Palle ramki.pa...@gmail.com
Date: Sun, 10 Mar 2013 06:58:57 
To: user@hive.apache.org; Sai Saisaigr...@yahoo.in
Reply-To: user@hive.apache.org
Subject: Re: java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil

Well, you get the results faster.

Please check this:

https://cwiki.apache.org/Hive/gettingstarted.html#GettingStarted-Runtimeconfiguration

Under section   Hive, Map-Reduce and Local-Mode, it says

This can be very useful to run queries over small data sets - in such cases
local mode execution is usually significantly faster than submitting jobs
to a large cluster.

-Ramki.






On Sun, Mar 10, 2013 at 5:26 AM, Sai Sai saigr...@yahoo.in wrote:

 Ramki/John
 Many Thanks, that really helped. I have run the add jars in the new
 session and it appears to be running. However i was wondering about by
 passing MR, why would we do it and what is the use of it. Will appreciate
 any input.
 Thanks
 Sai


   --
 *From:* Ramki Palle ramki.pa...@gmail.com

 *To:* user@hive.apache.org; Sai Sai saigr...@yahoo.in
 *Sent:* Sunday, 10 March 2013 4:22 AM
 *Subject:* Re: java.lang.NoClassDefFoundError:
 com/jayway/jsonpath/PathUtil

 When you execute the following query,

 hive select * from twitter limit 5;

 Hive runs it in local mode and not use MapReduce.

 For the query,

 hive select tweet_id from twitter limit 5;

 I think you need to add JSON jars to overcome this error. You might have
 added these in a previous session. If you want these jars available for all
 sessions, insert the add jar statements to your $HOME/.hiverc file.


 To bypass MapReduce

 set hive.exec.mode.local.auto = true;

 to suggest Hive to use local mode to execute the query. If it still uses
 MR, try

 set hive.fetch.task.conversion = more;.


 -Ramki.



 On Sun, Mar 10, 2013 at 12:19 AM, Sai Sai saigr...@yahoo.in wrote:

 Just wondering if anyone has any suggestions:

 This executes successfully:

 hive select * from twitter limit 5;

 This does not work:

 hive select tweet_id from twitter limit 5; // I have given the exception
 info below:

 Here is the output of this:

 hive select * from twitter limit 5;
 OK

 tweet_idcreated_attextuser_iduser_screen_nameuser_lang
 122106088022745088Fri Oct 07 00:28:54 + 2011wkwkw -_- ayo saja
 mba RT @yullyunet: Sepupuuu, kita lanjalan yok.. Kita karokoe-an.. Ajak mas
 galih jg kalo dia mau.. @Dindnf: doremifas124735434Dindnfen
 122106088018558976Fri Oct 07 00:28:54 + 2011@egg486 특별히
 준비했습니다!252828803CocaCola_Koreako
 122106088026939392Fri Oct 07 00:28:54 + 2011My offer of free
 gobbies for all if @amityaffliction play Blair snitch project still
 stands.168590073SarahYoungBlooden
 122106088035328001Fri Oct 07 00:28:54 + 2011the girl nxt to me
 in the lib got her headphones in dancing and singing loud af like she the
 only one here haha267296295MONEYyDREAMS_en
 122106088005971968Fri Oct 07 00:28:54 + 2011@KUnYoong_B2UTY
 Bị lsao đấy269182160b2st_b2utyhpen
 Time taken: 0.154 seconds

 This does not work:

 hive select tweet_id from twitter limit 5;


 Total MapReduce jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks is set to 0 since there's no reduce operator
 Starting Job = job_201303050432_0094, Tracking URL =
 http://ubuntu:50030/jobdetails.jsp?jobid=job_201303050432_0094
 Kill Command = /home/satish/work/hadoop-1.0.4/libexec/../bin/hadoop job
 -kill job_201303050432_0094
 Hadoop job information for Stage-1: number of mappers: 1; number of
 reducers: 0
 2013-03-10 00:14:44,509 Stage-1 map = 0%,  reduce = 0%
 2013-03-10 00:15:14,613 Stage-1 map = 100%,  reduce = 100%
 Ended Job = job_201303050432_0094 with errors
 Error during job, obtaining debugging information...
 Job Tracking URL:
 http://ubuntu:50030/jobdetails.jsp?jobid=job_201303050432_0094
 Examining task ID: task_201303050432_0094_m_02 (and more) from job
 job_201303050432_0094

 Task with the most failures(4):
 -
 Task ID:
   task_201303050432_0094_m_00

 URL:

 http://ubuntu:50030/taskdetails.jsp?jobid=job_201303050432_0094tipid=task_201303050432_0094_m_00
 -
 Diagnostic Messages for this Task:
 java.lang.RuntimeException: Error in configuring object
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
 at 

Re: Accessing sub column in hive

2013-03-08 Thread bejoy_ks
Hi Sai


You can do it as
Select address.country from employees;
 

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Bennie Schut bsc...@ebuddy.com
Date: Fri, 8 Mar 2013 09:09:49 
To: user@hive.apache.orguser@hive.apache.org; 'Sai Sai'saigr...@yahoo.in
Reply-To: user@hive.apache.org
Subject: RE: Accessing sub column in hive

Perhaps worth posting the error. Some might know what the error means.

Also a bit unrelated to hive but please do yourself a favor and don't use float 
to store monetary values like salary. You will get rounding issues at some 
point in time when you do arithmetic on them. Considering you are using hadoop 
you probably have a lot of data so adding it all up will get you there really 
really fast. 
http://stackoverflow.com/questions/3730019/why-not-use-double-or-float-to-represent-currency


From: Sai Sai [mailto:saigr...@yahoo.in]
Sent: Thursday, March 07, 2013 12:54 PM
To: user@hive.apache.org
Subject: Re: Accessing sub column in hive

I have a table created like this successfully:

CREATE TABLE IF NOT EXISTS employees (name STRING,salary FLOAT,subordinates 
ARRAYSTRING,deductions   MAPSTRING,FLOAT,address STRUCTstreet:STRING, 
city:STRING, state:STRING, zip:INT, country:STRING)

I would like to access/display country column from my address struct.
I have tried this:

select address[country] from employees;

I get an error.

Please help.

Thanks
Sai



Re: Finding maximum across a row

2013-03-01 Thread bejoy_ks
Hi Sachin

You could get the detailed ateps from hive wiki itself

https://cwiki.apache.org/Hive/hiveplugins.html

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Sachin Sudarshana sachin.sudarsh...@gmail.com
Date: Fri, 1 Mar 2013 22:37:54 
To: user@hive.apache.org; bejoy...@yahoo.com
Reply-To: user@hive.apache.org
Subject: Re: Finding maximum across a row

Hi Bejoy,

I am new to UDF in Hive. Could you send me any link/tutorials on where i
can be able to learn about writing the UDF?

Thanks!

On Fri, Mar 1, 2013 at 10:22 PM, bejoy...@yahoo.com wrote:

 **
 Hi Sachin

 AFAIK There isn't one at the moment. But you can easily achieve this using
 a custom UDF.
 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: * Sachin Sudarshana sachin.sudarsh...@gmail.com
 *Date: *Fri, 1 Mar 2013 22:16:37 +0530
 *To: *user@hive.apache.org
 *ReplyTo: * user@hive.apache.org
 *Subject: *Finding maximum across a row

 Hi,

 Is there any function/method to find the maximum across a row in hive?

 Suppose i have a table like this:

 ColA   ColB   ColC
 2  5  7
 3  2  1

 I want the function to return

 7
 1


 Its urgently required. Any help would be greatly appreciated!



 --
 Thanks and Regards,
 Sachin Sudarshana




-- 
Thanks and Regards,
Sachin Sudarshana



Re: Hive queries

2013-02-25 Thread bejoy_ks
Hi Cyril

I believe you are using the derby meta store and then it should be an issue 
with the hive configs.

Derby is trying to create a metastore at your current dir from where you are 
starting hive. The tables exported by sqoop would be inside HIVE_HOME and hence 
you are not able to see the tables from getting on to hive CLI from other 
locations.

To have a universal metastore db configure a specific dir in 
javax.jdo.option.ConnectionURL in hive-site.xml . In your conn url configure 
the db name as databaseName=/home/hive/metastore_db

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Cyril Bogus cyrilbo...@gmail.com
Date: Mon, 25 Feb 2013 10:34:29 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Hive queries

I do not get any errors.
It is only when I run hive and try to query the tables I imported. Let's
say I want to only get numeric tuples for a given table. I cannot find the
table (show tables; is empty) unless I go in the hive home folder and run
hive again. I would expect the state of hive to be the same everywhere I
call it.
But so far it is not the case.


On Mon, Feb 25, 2013 at 10:22 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 any errors you see ?


 On Mon, Feb 25, 2013 at 8:48 PM, Cyril Bogus cyrilbo...@gmail.com wrote:

 Hi everyone,

 My setup is Hadoop 1.0.4, Hive 0.9.0, Sqoop 1.4.2-hadoop 1.0.0
 Mahout 0.7

 I have imported tables from a remote database directly into Hive using
 Sqoop.

 Somehow when I try to run Sqoop from Hadoop, the content

 Hive is giving me trouble in bookkeeping of where the imported tables are
 located. I have a Single Node setup.

 Thank you for any answer and you can ask question if I was not specific
 enough about my issue.

 Cyril




 --
 Nitin Pawar




Re: Security for Hive

2013-02-23 Thread bejoy_ks
Hi Austin

AFAIK at the moment you can control permissions gracefully only on a data level 
not on the metadata level. ie you can play with the hdfs permissions .

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Austin Chungath austi...@gmail.com
Date: Fri, 22 Feb 2013 23:11:51 
To: bejoy...@yahoo.combejoy...@yahoo.com; 
user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: RE: Security for Hive

  So that means any user can revoke or give permissions to any user for any
table in the metastore?

Sent from my Phone, please ignore typos
 --
From: bejoy...@yahoo.com
Sent: 22-02-2013 11:30 PM
To: user@hive.apache.org
Subject: Re: Security for Hive

Hi Sachin

Currently there is no such admin user concept in hive.
Regards
Bejoy KS

Sent from remote device, Please excuse typos
--
*From: * Sachin Sudarshana sachin.sudarsh...@gmail.com
*Date: *Fri, 22 Feb 2013 16:40:49 +0530
*To: *user@hive.apache.org
*ReplyTo: * user@hive.apache.org
*Subject: *Re: Security for Hive

Hi,
I have read about roles, user privileges, group privileges etc.
But these roles can be created by any user for any database/table. I would
like to know if there is a specific 'administrator' for hive who can log on
with his credentials and is the only one entitled to create roles, grant
privileges etc.

Thank you.

On Fri, Feb 22, 2013 at 4:19 PM, Jagat Singh jagatsi...@gmail.com wrote:

 You might want to read this

 https://cwiki.apache.org/Hive/languagemanual-auth.html




 On Fri, Feb 22, 2013 at 9:44 PM, Sachin Sudarshana 
 sachin.sudarsh...@gmail.com wrote:

 Hi,

 I have just started learning about hive.
 I have configured Hive to use mysql as the metastore instead of derby.
 If I wish to use GRANT and REVOKE commands, i can use it with any user. A
 user can issue GRANT or REVOKE commands to any other users' table since
 both the users' tables are present in the same warehouse.

 Isn't there a concept of superuser/admin in hive who alone has the
 authority to issue these commands ?

 Any answer is greatly appreciated!

 --
 Thanks and Regards,
 Sachin Sudarshana





-- 
Thanks and Regards,
Sachin Sudarshana



Re: Security for Hive

2013-02-22 Thread bejoy_ks
Hi Sachin

Currently there is no such admin user concept in hive.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Sachin Sudarshana sachin.sudarsh...@gmail.com
Date: Fri, 22 Feb 2013 16:40:49 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Security for Hive

Hi,
I have read about roles, user privileges, group privileges etc.
But these roles can be created by any user for any database/table. I would
like to know if there is a specific 'administrator' for hive who can log on
with his credentials and is the only one entitled to create roles, grant
privileges etc.

Thank you.

On Fri, Feb 22, 2013 at 4:19 PM, Jagat Singh jagatsi...@gmail.com wrote:

 You might want to read this

 https://cwiki.apache.org/Hive/languagemanual-auth.html




 On Fri, Feb 22, 2013 at 9:44 PM, Sachin Sudarshana 
 sachin.sudarsh...@gmail.com wrote:

 Hi,

 I have just started learning about hive.
 I have configured Hive to use mysql as the metastore instead of derby.
 If I wish to use GRANT and REVOKE commands, i can use it with any user. A
 user can issue GRANT or REVOKE commands to any other users' table since
 both the users' tables are present in the same warehouse.

 Isn't there a concept of superuser/admin in hive who alone has the
 authority to issue these commands ?

 Any answer is greatly appreciated!

 --
 Thanks and Regards,
 Sachin Sudarshana





-- 
Thanks and Regards,
Sachin Sudarshana



Re: Running Hive on multi node

2013-02-21 Thread bejoy_ks
Hi

Hive uses the hadoop installation specified in HADOOP_HOME. If your hadoop home 
is configured for fully distributed operation it'll utilize the cluster itself.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Hamza Asad hamza.asa...@gmail.com
Date: Thu, 21 Feb 2013 14:26:40 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Running Hive on multi node

Does hive automatically runs on multi node as i configured hadoop on multi
node OR i have to explicitly do its configuration??

-- 
*Muhammad Hamza Asad*



Re: Adding comment to a table for columns

2013-02-21 Thread bejoy_ks
Hi Gupta

Try out

DESCRIBE EXTENDED FORMATTED table-name

I vaguely recall a operation like this.
Please check hive wiki for the exact syntax.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Chunky Gupta chunky.gu...@vizury.com
Date: Thu, 21 Feb 2013 17:15:37 
To: user@hive.apache.org; bejoy...@yahoo.com; 
snehalata_bhas...@syntelinc.com
Reply-To: user@hive.apache.org
Subject: Re: Adding comment to a table for columns

Hi Bejoy, Bhaskar

I tried using FORMATTED, but it will not give me comments which I have put
while creating table. Its output is like :-

col_namedata_type   comment
cstring  from deserializer
timestring  from deserializer

Thanks,
Chunky.

On Thu, Feb 21, 2013 at 4:50 PM, bejoy...@yahoo.com wrote:

 **
 Hi Gupta

 You can the describe output in a formatted way using

 DESCRIBE FORMATTED table name;
 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: * Chunky Gupta chunky.gu...@vizury.com
 *Date: *Thu, 21 Feb 2013 16:46:30 +0530
 *To: *user@hive.apache.org
 *ReplyTo: * user@hive.apache.org
 *Subject: *Adding comment to a table for columns

 Hi,

 I am using this syntax to add comments for all columns :-

 CREATE EXTERNAL TABLE test ( c STRING COMMENT 'Common  class', time STRING
 COMMENT 'Common  time', url STRING COMMENT 'Site URL' ) PARTITIONED BY (dt
 STRING ) LOCATION 's3://BucketName/'

 Output of Describe Extended table is like :- (Output is just an example
 copied from internet)

 hive DESCRIBE EXTENDED table_name;

 Detailed Table Information Table(tableName:table_name,
 dbName:benchmarking, owner:root, createTime:1309480053, lastAccessTime:0,
 retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:session_key,
 type:string, comment:null), FieldSchema(name:remote_address, type:string,
 comment:null), FieldSchema(name:canister_lssn, type:string, comment:null),
 FieldSchema(name:canister_session_id, type:bigint, comment:null),
 FieldSchema(name:tltsid, type:string, comment:null),
 FieldSchema(name:tltuid, type:string, comment:null),
 FieldSchema(name:tltvid, type:string, comment:null),
 FieldSchema(name:canister_server, type:string, comment:null),
 FieldSchema(name:session_timestamp, type:string, comment:null),
 FieldSchema(name:session_duration, type:string, comment:null),
 FieldSchema(name:hit_count, type:bigint, comment:null),
 FieldSchema(name:http_user_agent, type:string, comment:null),
 FieldSchema(name:extractid, type:bigint, comment:null),
 FieldSchema(name:site_link, type:string, comment:null),
 FieldSchema(name:dt, type:string, comment:null), FieldSchema(name:hour,
 type:int, comment:null)],
 location:hdfs://hadoop2/user/hive/warehouse/benchmarking.db/table_name,
 inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
 outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
 compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
 serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)

 Is there any way of getting this detailed comments and column name in
 readable format, just like the output of Describe table_name ?.


 Thanks,

 Chunky.




Re: bucketing on a column with millions of unique IDs

2013-02-20 Thread bejoy_ks
Hi Li

The major consideration you should give is regarding the size of bucket. One 
bucket corresponds to a file in hdfs and you should ensure that every bucket is 
atleast a block size or in the worst case atleast majority of the buckets 
should be.

So based on the data size you should derive on this rather than the number of 
rows/records.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Echo Li echo...@gmail.com
Date: Wed, 20 Feb 2013 16:19:43 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: bucketing on a column with millions of unique IDs

Hi guys,

I plan to bucket a table by userid as I'm going to do intense calculation
using group by userid. there are about 110 million rows, with 7 million
unique userid, so my question is what is a good number of buckets for this
scenario, and how to determine number of buckets?

Any input is apprecaited :)

Echo



Re: Map join optimization issue

2013-02-15 Thread bejoy_ks
Hi 

In later versions of hive you actually don't need a map joint hint in your 
query. Just the following would suffice the purpose

Set hive.auto.convert.join=true 

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Mayuresh Kunjir mayuresh.kun...@gmail.com
Date: Fri, 15 Feb 2013 10:37:52 
To: useruser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Map join optimization issue

Thanks Aniket. I actually had not specified the map-join hint though. Sorry
for providing the wrong information earlier. I had only
set hive.auto.convert.join=true before firing my join query.

~Mayuresh



On Thu, Feb 14, 2013 at 10:44 PM, Aniket Mokashi aniket...@gmail.comwrote:

 I think hive.mapjoin.smalltable.filesize parameter will be disregarded in
 that case.


 On Thu, Feb 14, 2013 at 7:25 AM, Mayuresh Kunjir 
 mayuresh.kun...@gmail.com wrote:

 Yes, the hint was specified.
 On Feb 14, 2013 3:11 AM, Aniket Mokashi aniket...@gmail.com wrote:

 have you specified map-join hint in your query?


 On Thu, Feb 7, 2013 at 11:39 AM, Mayuresh Kunjir 
 mayuresh.kun...@gmail.com wrote:


 Hello all,


 I am trying to join two tables, the smaller being of size 4GB. When I
 set hive.mapjoin.smalltable.filesize parameter above 500MB, Hive tries to
 perform a local task to read the smaller file. This of-course fails since
 the file size is greater and the backup common join is then run. What I do
 not understand is why did Hive attempt a map join when small file size was
 greater than the smalltable.filesize parameter.


 ~Mayuresh




 --
 ...:::Aniket:::... Quetzalco@tl




 --
 ...:::Aniket:::... Quetzalco@tl




Re: CREATE EXTERNAL TABLE Fails on Some Directories

2013-02-15 Thread bejoy_ks

Hi Joseph

There are differences in the following ls commands

cloudera@localhost data]$ hdfs dfs -ls /715

This would list out all the contents in /715 in hdfs, if it is a dir

Found 1 items
-rw-r--r--   1 cloudera supergroup    7853975 2013-02-14 17:03 /715

The output clearly defines it is file as d is missing as the first char

[cloudera@localhost data]$ hdfs dfs -ls 715

This lists the dir 715 in your user's hdfs home dir. If your user is cloudera 
then usually your home dir might be /userdata/cloudera/ so in effect the dir 
listed is /userdata/cloudera/715


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Joseph D Antoni jdant...@yahoo.com
Date: Fri, 15 Feb 2013 08:55:50 
To: user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: CREATE EXTERNAL TABLE Fails on Some Directories

Not sure--I just truncated the file list from the ls--that was the first file 
(just obfuscated the name)

The command I used to create the directories was:

hdfs dfs -mkdir 715 
then 
hdfs dfs -put myfile.csv 715

[cloudera@localhost data]$ hdfs dfs -ls /715
Found 1 items
-rw-r--r--   1 cloudera supergroup    7853975 2013-02-14 17:03 /715
[cloudera@localhost data]$ hdfs dfs -ls 715
Found 13 items
-rw-r--r--   1 cloudera cloudera    7853975 2013-02-15 00:41 715/40-file.csv

Thanks






 From: Dean Wampler dean.wamp...@thinkbiganalytics.com
To: user@hive.apache.org; Joseph D Antoni jdant...@yahoo.com 
Sent: Friday, February 15, 2013 11:50 AM
Subject: Re: CREATE EXTERNAL TABLE Fails on Some Directories
 

Something's odd about this output; why is there no / in front of 715? I always 
get the full path when I run a -ls command. I would expect either:

/715/file.csv
or
/user/me/715/file.csv

Or is that what you meant by (didn't leave rest of ls results)?

dean


On Fri, Feb 15, 2013 at 10:45 AM, Joseph D Antoni jdant...@yahoo.com wrote:

[cloudera@localhost data]$ hdfs dfs -ls 715
Found 13 items
-rw-r--r--   1 cloudera cloudera    7853975 2013-02-15 00:41 715/file.csv 
(didn't leave rest of ls results)


Thanks on the directory--wasn't clear on that..

Joey









 From: Dean Wampler dean.wamp...@thinkbiganalytics.com
To: user@hive.apache.org; Joseph D Antoni jdant...@yahoo.com 
Sent: Friday, February 15, 2013 11:37 AM
Subject: Re: CREATE EXTERNAL TABLE Fails on Some Directories
 


You confirmed that 715 is an actual directory? It didn't become a file by 
accident?


By the way, you don't need to include the file name in the LOCATION. It will 
read all the files in the directory.


dean


On Fri, Feb 15, 2013 at 10:29 AM, Joseph D Antoni jdant...@yahoo.com wrote:

I'm trying to create a series of external tables for a time series of data 
(using the prebuilt Cloudera VM).


The directory structure in HDFS is as such:


/711
/712
/713
/714
/715
/716
/717


Each directory contains the same set of files, from a different day. They 
were all put into HDFS using the following script:


for i in *;do hdfs dfs -put $i in $dir;done


They all show up with the same ownership/perms in HDFS.


Going into Hive to build the tables, I built a set of scripts to do the 
loads--then did a sed (changing 711 to 712,713, etc) to a file for each day. 
All of my loads work, EXCEPT for 715 and 716. 


Script is as follows:


create external table 715_table_name
(col1 string,
col2 string)
row format
delimited fields terminated by ','
lines terminated by '\n'
stored as textfile
location '/715/file.csv';


This is failing with:


Error in Metadata MetaException(message:Got except: 
org.apache.hadoop.fs.FileAlreadyExistsException Parent Path is not a 
directory: /715 715...


Like I mentioned it works for all of the other directories, except 715 and 
716. Thoughts on troubleshooting path?


Thanks


Joey D'Antoni



-- 
Dean Wampler, Ph.D.
thinkbiganalytics.com
+1-312-339-1330






-- 
Dean Wampler, Ph.D.
thinkbiganalytics.com
+1-312-339-1330


Re: LOAD HDFS into Hive

2013-01-25 Thread bejoy_ks
Hi Venkataraman

You can just create an external table and give it location as the hdfs dir 
where the data resides.

No need to perform an explicit LOAD operation here.


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: venkatramanan venkatraman...@smartek21.com
Date: Fri, 25 Jan 2013 18:30:29 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: LOAD HDFS into Hive

Hi,

I need to load the hdfs data into the Hive table.

For example,

Am having the twitter data and its updated daily using the streaming 
API. These twitter responses are stored into the HDFS Path named like 
('TwitterData'). After that i try to load the data into the Hive. using 
the 'LOAD DATA stmt'. My problem is that hdfs data is lost once i load 
the data. is there any way to load the data without the hdfs data lose.

To Create the Table using the below stmt;

CREATE EXTERNAL TABLE Tweets (FromUserId String, Text string, 
FromUserIdString String, FromUser String, Geo String, Id BIGINT, 
IsoLangCode string, ToUserId INT, ToUserIdString string, CreatedAt 
string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED 
BY '\n';

To LOAD the data using the below stmt;

LOAD DATA INPATH '/twitter_sample' INTO TABLE tweets;

thanks in advance

Thanks,
Venkat



Re: An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread bejoy_ks
Hi David

An explain extended would give you the exact pointer.

From my understanding, this is how it could work. 

 You have two tables then two different map reduce job would be processing 
those. Based on the join keys, combination of corresponding columns would be 
chosen as key from mapper1 and mapper2. So if the combination of columns having 
the same value those records from two set of mappers would go into the same 
reducer.

On the reducer if there is a corresponding value for a key from table 1 to  
table 2/mapper 2 that value would be populated. If no val for mapper 2 then 
those columns from table 2 are made null.

If there is a key-value just from table 2/mapper 2 and no  corresponding value 
from mapper 1. That value is just discarded.


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: David Morel dmore...@gmail.com
Date: Thu, 24 Jan 2013 18:03:40 
To: user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: An explanation of LEFT OUTER JOIN and NULL values

Hi!

After hitting the curse of the last reducer many times on LEFT OUTER
JOIN queries, and trying to think about it, I came to the conclusion
there's something I am missing regarding how keys are handled in mapred
jobs.

The problem shows when I have table A containing billions of rows with
distinctive keys, that I need to join to table B that has a much lower
number of rows.

I need to keep all the A rows, populated with NULL values from the B
side, so that's what a LEFT OUTER is for.

Now, when transforming that into a mapred job, my -naive- understanding
would be that for every key on the A table, a missing key on the B table
would be generated with a NULL value. If that were the case, I fail to
understand why all NULL valued B keys would end up on the same reducer,
since the key defines which reducer is used, not the value.

So, obviously, this is not how it works.

So my question is: how is this construct handled?

Thanks a lot!

D.Morel


Re: An explanation of LEFT OUTER JOIN and NULL values

2013-01-24 Thread bejoy_ks
Hi David,

The default partitioner used in map reduce is the hash partitioner. So based on 
your keys they are send to a particular reducer.

May be in your current data set, the keys that have no values in table are all 
falling in the same hash bucket and hence being processed by the same reducer.

If you are noticing a skew on a particular reducer, sometimes  a simple work 
around like increasing the no of reducers explicitly might help you get pass 
the hurdle.

Also please ensure you have enabled skew join optimization.
 
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: David Morel dmore...@gmail.com
Date: Thu, 24 Jan 2013 18:39:56 
To: user@hive.apache.org; bejoy...@yahoo.com
Reply-To: user@hive.apache.org
Subject: Re: An explanation of LEFT OUTER JOIN and NULL values

On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote:

 Hi David

 An explain extended would give you the exact pointer.

 From my understanding, this is how it could work.

 You have two tables then two different map reduce job would be
 processing those. Based on the join keys, combination of corresponding
 columns would be chosen as key from mapper1 and mapper2. So if the
 combination of columns having the same value those records from two
 set of mappers would go into the same reducer.

 On the reducer if there is a corresponding value for a key from table
 1 to  table 2/mapper 2 that value would be populated. If no val for
 mapper 2 then those columns from table 2 are made null.

 If there is a key-value just from table 2/mapper 2 and no
 corresponding value from mapper 1. That value is just discarded.

Hi Bejoy,

Thanks! So schematically, something like this, right?

mapper1 (bigger table):
K1-A, V1A
K2-A, V2A
K3-A, V3A

mapper2 (joined, smaller table):
K1-B, V1B

reducer1:
K1-A, V1A 
K1-B, V1B

returns:
K1, V1A, V1B etc

reducer2:
K2-A, V2A
*no* K2-B, V so: K2-B, NULL is created, same for next row.
K3-A, V3A

returns:
K2, V2A, NULL etc
K3, V3A, NULL etc

I still don't understand why my reducer2 (and only this one, which
apparently gets all the keys for which we don't have a row on table B)
would become overloaded. Am I completely misunderstanding the whole
thing?

David

 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos

 -Original Message-
 From: David Morel dmore...@gmail.com
 Date: Thu, 24 Jan 2013 18:03:40
 To: user@hive.apache.orguser@hive.apache.org
 Reply-To: user@hive.apache.org
 Subject: An explanation of LEFT OUTER JOIN and NULL values

 Hi!

 After hitting the curse of the last reducer many times on LEFT OUTER
 JOIN queries, and trying to think about it, I came to the conclusion
 there's something I am missing regarding how keys are handled in
 mapred jobs.

 The problem shows when I have table A containing billions of rows with
 distinctive keys, that I need to join to table B that has a much lower
 number of rows.

 I need to keep all the A rows, populated with NULL values from the B
 side, so that's what a LEFT OUTER is for.

 Now, when transforming that into a mapred job, my -naive-
 understanding would be that for every key on the A table, a missing
 key on the B table would be generated with a NULL value. If that were
 the case, I fail to understand why all NULL valued B keys would end up
 on the same reducer, since the key defines which reducer is used, not
 the value.

 So, obviously, this is not how it works.

 So my question is: how is this construct handled?

 Thanks a lot!

 D.Morel



Re: Mapping HBase table in Hive

2013-01-13 Thread bejoy_ks
Hi Ibrahim.

 SQOOP is used to import data from rdbms to hbase in your case. 

Please get the schema from hbase for your corresponding table and post it here.

We can point out how your mapping could be.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Ibrahim Yakti iya...@souq.com
Date: Sun, 13 Jan 2013 11:22:51 
To: useruser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Mapping HBase table in Hive

Thanks Bejoy,

what do you mean by:

 If you need to map a full CF to a hive column, the data type of the hive
 column should be a Map.


suppose I used sqoop to move data from mysql to hbase and used id as a
column family, all the other columns will be QF then, right?

The integration document is not clear, I think it needs more clarification
or maybe I am still missing something.

--
Ibrahim


On Tue, Jan 8, 2013 at 9:35 PM, bejoy...@yahoo.com wrote:

 data type of



Re: Map Reduce Local Task

2013-01-08 Thread bejoy_ks
Hi Santhosh

As long as the smaller table size is in the range of a few MBs. It is a good 
candidate for map join.

If the smaller table size is still more then you can take a look at bucketed 
map joins.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Santosh Achhra santoshach...@gmail.com
Date: Wed, 9 Jan 2013 00:11:37 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Map Reduce Local Task

Thank you Dean,

One of our table is very small, it has only 16,000 rows and other big table
has 45 million plus records. Wont doing a loacl task help in this case ?

Good wishes,always !
Santosh


On Tue, Jan 8, 2013 at 11:59 PM, Dean Wampler 
dean.wamp...@thinkbiganalytics.com wrote:

 more aggressive about trying to convert a join to a local task, where it
 bypasses the job tracker. When you're experimenting with queries on a small
 data set, it can make things much faster, but won't be useful for large
 data sets where you need the cluster.




Re: Mapping HBase table in Hive

2013-01-08 Thread bejoy_ks
Hi Ibrahim

The hive hbase integration totally depends on the hbase table schema and not 
the schema of the source table in mysql.

You need to provide the column family qualifier mapping in there.

Get the hbase table's schema from hbase shell.

suppose you have the schema as
Id
CF1.qualifier1
CF1.qualifier2
CF1.qualifier3

You need to match each of these ColumnFamily:Qualifier to corresponding columns 
in hive. 

So in hbase.columns.mapping you need to provide these CF:QL in order.

If you need to map a full CF to a hive column, the data type of the hive column 
should be a Map.

You can get detailed hbase to hive integration document from hive wiki .


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Ibrahim Yakti iya...@souq.com
Date: Tue, 8 Jan 2013 15:45:32 
To: useruser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Mapping HBase table in Hive

Hello,

suppose I have the following table (orders) in MySQL:

*** 1. row ***
  Field: id
   Type: int(10) unsigned
   Null: NO
Key: PRI
Default: NULL
  Extra: auto_increment
*** 2. row ***
  Field: value
   Type: int(10) unsigned
   Null: NO
Key:
Default: NULL
  Extra:
*** 3. row ***
  Field: date_lastchange
   Type: timestamp
   Null: NO
Key:
Default: CURRENT_TIMESTAMP
  Extra: on update CURRENT_TIMESTAMP
*** 4. row ***
  Field: date_inserted
   Type: timestamp
   Null: NO
Key:
Default: -00-00 00:00:00

I imported it into HBase with column family id

I want to create an external table in Hive to query the HBase table, I am
not able to get the mapping parameters (*hbase.columns.mapping*), it is
confusing, if anybody can explain it to me please. I used the following
query:

CREATE EXTERNAL TABLE hbase_orders(id bigint, value bigint, date_lastchange
string, date_inserted string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
(hbase.columns.mapping =  ? ? ? ? ? ?) TBLPROPERTIES (hbase.table.name
= orders);

Is there any way to build the Hive tables automatically or I should go with
the same process with each table?


Thanks in advanced.

--
Ibrahim



Re: View with map join fails

2013-01-08 Thread bejoy_ks
Looks like there is a bug with mapjoin + view. Please check hive jira to see if 
there an issue open against this else  file a new jira.

From my understanding, When you enable map join, hive parser would create back 
up jobs. These back up jobs are executed only if map join fails. In normal 
cases when map join succeeds these jobs are filtered out and not executed.

'1116112419, job is filtered out (removed at runtime).'


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Santosh Achhra santoshach...@gmail.com
Date: Tue, 8 Jan 2013 17:11:18 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: View with map join fails

Hello,

I have created a view  as shown below.

*CREATE VIEW V1 AS*
*select /*+ MAPJOIN(t1) ,MAPJOIN(t2)  */ t1.f1, t1.f2, t1.f3, t1.f4, t2.f1,
t2.f2, t2.f3 from TABLE1 t1 join TABLE t2 on ( t1.f2= t2.f2 and t1.f3 =
t2.f3 and t1.f4 = t2.f4 ) group by t1.f1, t1.f2, t1.f3, t1.f4, t2.f1,
t2.f2, t2.f3*

View get created successfully however when I execute below mentioned SQL or
any SQL on the view  get NULLPOINTER exception error

*hive select count (*) from V1;*
*FAILED: NullPointerException null*
*hive*

Is there anything wrong with the view creation ?

Next I created view without MAPJOIN hints

*CREATE VIEW V1 AS*
*select  t1.f1, t1.f2, t1.f3, t1.f4, t2.f1, t2.f2, t2.f3 from TABLE1 t1
join TABLE t2 on ( t1.f2= t2.f2 and t1.f3 = t2.f3 and t1.f4 = t2.f4 ) group
by t1.f1, t1.f2, t1.f3, t1.f4, t2.f1, t2.f2, t2.f3*

Before executing select SQL I excute *set  hive.auto.convert.join=true; *

I am getting beloow mentioned warnings
java.lang.InstantiationException:
org.apache.hadoop.hive.ql.parse.ASTNodeOrigin
Continuing ...
java.lang.RuntimeException: failed to evaluate: unbound=Class.new();
Continuing ...


And I see from log that total 5 mapreduce jobs are
started however when don't set auto.convert.join to true, I see only 3
mapreduce jobs getting invoked.
*Total MapReduce jobs = 5*
*Ended Job = 1116112419, job is filtered out (removed at runtime).*
*Ended Job = -33256989, job is filtered out (removed at runtime).*
*WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.*


Good wishes,always !
Santosh



Re: External table with partitions

2013-01-06 Thread bejoy_ks
Hi Oded

If you have created the directories manually that would come visible to the 
hive table only if the partitions/ sub dirs are added to the meta data using
'ALTER TABLE ... ADD PARTITION' . 
Partitions are not retrieved implicitly into hive tabe even if you have a 
proper sub dir structure.

Similarly if you don't need a particular partition on your table permanently 
you can always delete them using the alter table command.

If you are intending to use a particular partition alone in your query no need 
to alter the partitions. Just append a where clause to the query that has scope 
only on the required partitions.

Hope this helps.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Oded Poncz o...@ubimo.com
Date: Sun, 6 Jan 2013 16:07:26 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: External table with partitions

Is it possible to instruct hive to get only specific files from a
partitioned external table?
For example I have the following directory structure

data/dd=2012-12-31/a1.txt
data/dd=2012-12-31/a2.txt
data/dd=2012-12-31/a3.txt
data/dd=2012-12-31/a4.txt

data/dd=2012-12-31/b1.txt
data/dd=2012-12-31/b2.txt
data/dd=2012-12-31/b2.txt

Is it possible to add 2012-12-31 as a partition and tell hive to load only
the a* files to the table?
Thanks,



Re: External table with partitions

2013-01-06 Thread bejoy_ks
Sorry, I din understand your query on first look through.

Like Jagat said, you may need to go with a temp table for this.

Do a hadoop fs -cp ../../a.* destn dir

Create a external table with location as 'destn dir'.

CREATE EXERNAL TABLE tmp tble name LIKE src table name LOCATION '' ;

NB: I just gave the syntax from memory. please check the syntax in hive user 
guide.
Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: bejoy...@yahoo.com
Date: Sun, 6 Jan 2013 14:39:45 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: External table with partitions

Hi Oded

If you have created the directories manually that would come visible to the 
hive table only if the partitions/ sub dirs are added to the meta data using
'ALTER TABLE ... ADD PARTITION' . 
Partitions are not retrieved implicitly into hive tabe even if you have a 
proper sub dir structure.

Similarly if you don't need a particular partition on your table permanently 
you can always delete them using the alter table command.

If you are intending to use a particular partition alone in your query no need 
to alter the partitions. Just append a where clause to the query that has scope 
only on the required partitions.

Hope this helps.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Oded Poncz o...@ubimo.com
Date: Sun, 6 Jan 2013 16:07:26 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: External table with partitions

Is it possible to instruct hive to get only specific files from a
partitioned external table?
For example I have the following directory structure

data/dd=2012-12-31/a1.txt
data/dd=2012-12-31/a2.txt
data/dd=2012-12-31/a3.txt
data/dd=2012-12-31/a4.txt

data/dd=2012-12-31/b1.txt
data/dd=2012-12-31/b2.txt
data/dd=2012-12-31/b2.txt

Is it possible to add 2012-12-31 as a partition and tell hive to load only
the a* files to the table?
Thanks,



Re: Map side join

2012-12-13 Thread bejoy_ks
Hi Souvik

Is your input files compressed using some non splittable compression codec?

Do you have enough free slots while this job is running?

Make sure that the job is not running locally.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee souvikbaner...@gmail.com
Date: Wed, 12 Dec 2012 14:27:27 
To: user@hive.apache.org; bejoy...@yahoo.com
Reply-To: user@hive.apache.org
Subject: Re: Map side join

Hi Bejoy,

Yes I ran the pi example. It was fine.
Regarding the HIVE Job what I found is that it took 4 hrs for the first map
job to get completed.
Those map tasks were doing their job and only reported status after
completion. It is indeed taking too long time to finish. Nothing I could
find relevant in the logs.

Thanks and regards,
Souvik.

On Wed, Dec 12, 2012 at 8:04 AM, bejoy...@yahoo.com wrote:

 **
 Hi Souvik

 Apart from hive jobs is the normal mapreduce jobs like the wordcount
 running fine on your cluster?

 If it is working, for the hive jobs are you seeing anything skeptical in
 task, Tasktracker or jobtracker logs?


 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: * Souvik Banerjee souvikbaner...@gmail.com
 *Date: *Tue, 11 Dec 2012 17:12:20 -0600
 *To: *user@hive.apache.org; bejoy...@yahoo.com
 *ReplyTo: * user@hive.apache.org
 *Subject: *Re: Map side join

 Hello Everybody,

 Need help in for on HIVE join. As we were talking about the Map side join
 I tried that.
 I set the flag set hive.auto.convert.join=true;

 I saw Hive converts the same to map join while launching the job. But the
 problem is that none of the map job progresses in my case. I made the
 dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
 done very quickly.
 No luck with any change of settings.
 Failing to progress with the default setting changes these settings.
 set hive.mapred.local.mem=1024; // Initially it was 216 I guess
 set hive.join.cache.size=10; // Initialliu it was 25000

 Also on Hadoop side I made this changes

 mapred.child.java.opts -Xmx1073741824

 But I don't see any progress. After more than 40 minutes of run I am at 0%
 map completion state.
 Can you please throw some light on this?

 Thanks a lot once again.

 Regards,
 Souvik.



 On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee 
 souvikbaner...@gmail.comwrote:

 Hi Bejoy,

 That's wonderful. Thanks for your reply.
 What I was wondering if HIVE can do map side join with more than one
 condition on JOIN clause.
 I'll simply try it out and post the result.

 Thanks once again.

 Regards,
 Souvik.

  On Fri, Dec 7, 2012 at 2:10 PM, bejoy...@yahoo.com wrote:

 **
 Hi Souvik

 In earlier versions of hive you had to give the map join hint. But in
 later versions just set hive.auto.convert.join = true;
 Hive automatically selects the smaller table. It is better to give the
 smaller table as the first one in join.

 You can use a map join if you are joining a small table with a large
 one, in terms of data size. By small, better to have the smaller table size
 in range of MBs.
 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: *Souvik Banerjee souvikbaner...@gmail.com
 *Date: *Fri, 7 Dec 2012 13:58:25 -0600
 *To: *user@hive.apache.org
 *ReplyTo: *user@hive.apache.org
 *Subject: *Map side join

 Hello everybody,

 I have got a question. I didn't came across any post which says
 somethign about this.
 I have got two tables. Lets say A and B.
 I want to join A  B in HIVE. I am currently using HIVE 0.9 version.
 The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
 B.id2) AND (A.id3 = B.id3)

 Can I ask HIVE to use map side join in this scenario? Should I give a
 hint to HIVE by saying /*+mapjoin(B)*/

 Get back to me if you want any more information in this regard.

 Thanks and regards,
 Souvik.







Re: Map side join

2012-12-13 Thread bejoy_ks
Hi Souvik

To have the new hdfs block size in effect on the already existing files, you 
need to re copy them into hdfs.

To play with the number of mappers you can set lesser value like 64mb for min 
and max split size.

Mapred.min.split.size and mapred.max.split.size

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee souvikbaner...@gmail.com
Date: Thu, 13 Dec 2012 12:00:16 
To: user@hive.apache.org; bejoy...@yahoo.com
Subject: Re: Map side join

Hi Bejoy,

The input files are non-compressed text file.
There are enough free slots in the cluster.

Can you please let me know can I increase the no of mappers?
I tried reducing the HDFS block size to 32 MB from 128 MB. I was expecting
to get more mappers. But still it's launching same no of mappers like it
was doing while the HDFS block size was 128 MB. I have enough map slots
available, but not being able to utilize those.


Thanks and regards,
Souvik.


On Thu, Dec 13, 2012 at 11:12 AM, bejoy...@yahoo.com wrote:

 **
 Hi Souvik

 Is your input files compressed using some non splittable compression codec?

 Do you have enough free slots while this job is running?

 Make sure that the job is not running locally.

 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: * Souvik Banerjee souvikbaner...@gmail.com
 *Date: *Wed, 12 Dec 2012 14:27:27 -0600
 *To: *user@hive.apache.org; bejoy...@yahoo.com
 *ReplyTo: * user@hive.apache.org
 *Subject: *Re: Map side join

 Hi Bejoy,

 Yes I ran the pi example. It was fine.
 Regarding the HIVE Job what I found is that it took 4 hrs for the first
 map job to get completed.
 Those map tasks were doing their job and only reported status after
 completion. It is indeed taking too long time to finish. Nothing I could
 find relevant in the logs.

 Thanks and regards,
 Souvik.

 On Wed, Dec 12, 2012 at 8:04 AM, bejoy...@yahoo.com wrote:

 **
 Hi Souvik

 Apart from hive jobs is the normal mapreduce jobs like the wordcount
 running fine on your cluster?

 If it is working, for the hive jobs are you seeing anything skeptical in
 task, Tasktracker or jobtracker logs?


 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: * Souvik Banerjee souvikbaner...@gmail.com
 *Date: *Tue, 11 Dec 2012 17:12:20 -0600
 *To: *user@hive.apache.org; bejoy...@yahoo.com
 *ReplyTo: * user@hive.apache.org
 *Subject: *Re: Map side join

 Hello Everybody,

 Need help in for on HIVE join. As we were talking about the Map side join
 I tried that.
 I set the flag set hive.auto.convert.join=true;

 I saw Hive converts the same to map join while launching the job. But the
 problem is that none of the map job progresses in my case. I made the
 dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
 done very quickly.
 No luck with any change of settings.
 Failing to progress with the default setting changes these settings.
 set hive.mapred.local.mem=1024; // Initially it was 216 I guess
 set hive.join.cache.size=10; // Initialliu it was 25000

 Also on Hadoop side I made this changes

 mapred.child.java.opts -Xmx1073741824

 But I don't see any progress. After more than 40 minutes of run I am at
 0% map completion state.
 Can you please throw some light on this?

 Thanks a lot once again.

 Regards,
 Souvik.



 On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee souvikbaner...@gmail.com
  wrote:

 Hi Bejoy,

 That's wonderful. Thanks for your reply.
 What I was wondering if HIVE can do map side join with more than one
 condition on JOIN clause.
 I'll simply try it out and post the result.

 Thanks once again.

 Regards,
 Souvik.

  On Fri, Dec 7, 2012 at 2:10 PM, bejoy...@yahoo.com wrote:

 **
 Hi Souvik

 In earlier versions of hive you had to give the map join hint. But in
 later versions just set hive.auto.convert.join = true;
 Hive automatically selects the smaller table. It is better to give the
 smaller table as the first one in join.

 You can use a map join if you are joining a small table with a large
 one, in terms of data size. By small, better to have the smaller table size
 in range of MBs.
 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: *Souvik Banerjee souvikbaner...@gmail.com
 *Date: *Fri, 7 Dec 2012 13:58:25 -0600
 *To: *user@hive.apache.org
 *ReplyTo: *user@hive.apache.org
 *Subject: *Map side join

 Hello everybody,

 I have got a question. I didn't came across any post which says
 somethign about this.
 I have got two tables. Lets say A and B.
 I want to join A  B in HIVE. I am currently using HIVE 0.9 version.
 The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
 B.id2) AND (A.id3 = B.id3)

 Can I ask HIVE to use map side join in this scenario? Should I give a
 hint to HIVE by saying /*+mapjoin(B)*/

 Get back to me if you want any more information in this regard.

 Thanks and 

Re: Map side join

2012-12-12 Thread bejoy_ks
Hi Souvik

Apart from hive jobs is the normal mapreduce jobs like the wordcount running 
fine on your cluster?

If it is working, for the hive jobs are you seeing anything skeptical in task, 
Tasktracker or jobtracker logs?


Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee souvikbaner...@gmail.com
Date: Tue, 11 Dec 2012 17:12:20 
To: user@hive.apache.org; bejoy...@yahoo.com
Reply-To: user@hive.apache.org
Subject: Re: Map side join

Hello Everybody,

Need help in for on HIVE join. As we were talking about the Map side join I
tried that.
I set the flag set hive.auto.convert.join=true;

I saw Hive converts the same to map join while launching the job. But the
problem is that none of the map job progresses in my case. I made the
dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
done very quickly.
No luck with any change of settings.
Failing to progress with the default setting changes these settings.
set hive.mapred.local.mem=1024; // Initially it was 216 I guess
set hive.join.cache.size=10; // Initialliu it was 25000

Also on Hadoop side I made this changes

mapred.child.java.opts -Xmx1073741824

But I don't see any progress. After more than 40 minutes of run I am at 0%
map completion state.
Can you please throw some light on this?

Thanks a lot once again.

Regards,
Souvik.



On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee souvikbaner...@gmail.comwrote:

 Hi Bejoy,

 That's wonderful. Thanks for your reply.
 What I was wondering if HIVE can do map side join with more than one
 condition on JOIN clause.
 I'll simply try it out and post the result.

 Thanks once again.

 Regards,
 Souvik.

  On Fri, Dec 7, 2012 at 2:10 PM, bejoy...@yahoo.com wrote:

 **
 Hi Souvik

 In earlier versions of hive you had to give the map join hint. But in
 later versions just set hive.auto.convert.join = true;
 Hive automatically selects the smaller table. It is better to give the
 smaller table as the first one in join.

 You can use a map join if you are joining a small table with a large one,
 in terms of data size. By small, better to have the smaller table size in
 range of MBs.
 Regards
 Bejoy KS

 Sent from remote device, Please excuse typos
 --
 *From: *Souvik Banerjee souvikbaner...@gmail.com
 *Date: *Fri, 7 Dec 2012 13:58:25 -0600
 *To: *user@hive.apache.org
 *ReplyTo: *user@hive.apache.org
 *Subject: *Map side join

 Hello everybody,

 I have got a question. I didn't came across any post which says somethign
 about this.
 I have got two tables. Lets say A and B.
 I want to join A  B in HIVE. I am currently using HIVE 0.9 version.
 The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
 B.id2) AND (A.id3 = B.id3)

 Can I ask HIVE to use map side join in this scenario? Should I give a
 hint to HIVE by saying /*+mapjoin(B)*/

 Get back to me if you want any more information in this regard.

 Thanks and regards,
 Souvik.






Re: Map side join

2012-12-07 Thread bejoy_ks
Hi Souvik

In earlier versions of hive you had to give the map join hint. But in later 
versions just set hive.auto.convert.join = true;
Hive automatically selects the smaller table. It is better to give the smaller 
table as the first  one in join.

You can use a map join if you are joining a small table with a large one, in 
terms of data size. By small, better to have the smaller table size in range of 
MBs.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee souvikbaner...@gmail.com
Date: Fri, 7 Dec 2012 13:58:25 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Map side join

Hello everybody,

I have got a question. I didn't came across any post which says somethign
about this.
I have got two tables. Lets say A and B.
I want to join A  B in HIVE. I am currently using HIVE 0.9 version.
The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
B.id2) AND (A.id3 = B.id3)

Can I ask HIVE to use map side join in this scenario? Should I give a hint
to HIVE by saying /*+mapjoin(B)*/

Get back to me if you want any more information in this regard.

Thanks and regards,
Souvik.



Re: Doubt in INSERT query in Hive?

2012-02-15 Thread bejoy_ks
Hi Bhavesh
   INSERT INTO is supported in hive 0.8 . An upgrade would get you things 
rolling. 
LOAD DATA inefficient? What was the performance overhead you were facing here?

Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Bhavesh Shah bhavesh25s...@gmail.com
Date: Wed, 15 Feb 2012 14:33:29 
To: user@hive.apache.org; d...@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Doubt in INSERT query in Hive?

Hello,
Whenever we want to insert into table we use:
INSERT OVERWRITE TABLE TBL_NAME
(SELECT )
Due to this, table gets overwrites everytime.

I don't want to overwrite table, I want append it everytime.
I thought about LOAD TABLE , but writing the file may take more time and I
don't think so that it will efficient.

Does Hive Support INSERT INTO TABLE TAB_NAME?
(I am using hive-0.7.1)
Is there any patch for it? (But I don't know How to apply patch ?)

Pls suggest me as soon as possible.
Thanks.



-- 
Regards,
Bhavesh Shah



Re: Doubt in INSERT query in Hive?

2012-02-15 Thread bejoy_ks
Bhavesh
   In this case if you are not using INSERT INTO, you may need some tmp 
table write the query output to that. Load that data from there to your target 
table's data dir. 
You are not writing that to any file while doing the LOAD DATA operation. 
Rather you are just moving the files(in hdfs) from the source location to the 
table's data dir (where the previous data files are present). In hdfs move 
operation there is just a meta data operation happening at file system level. 

 Go with INSERT INTO as it is a cleaner way in hql perspective.
Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Bhavesh Shah bhavesh25s...@gmail.com
Date: Wed, 15 Feb 2012 15:03:07 
To: user@hive.apache.org; bejoy...@yahoo.com
Reply-To: user@hive.apache.org
Subject: Re: Doubt in INSERT query in Hive?

Hi Bejoy K S,
Thanks for your reply.
The overhead is, in select query I have near about 85 columns. Writing this
in the file and again loading it may take some time.
For that reason I am thinking that it will be inefficient.



-- 
Regards,
Bhavesh Shah


On Wed, Feb 15, 2012 at 2:51 PM, bejoy...@yahoo.com wrote:

 **
 Hi Bhavesh
 INSERT INTO is supported in hive 0.8 . An upgrade would get you things
 rolling.
 LOAD DATA inefficient? What was the performance overhead you were facing
 here?
 Regards
 Bejoy K S

 From handheld, Please excuse typos.
 --
 *From: * Bhavesh Shah bhavesh25s...@gmail.com
 *Date: *Wed, 15 Feb 2012 14:33:29 +0530
 *To: *user@hive.apache.org; d...@hive.apache.org
 *ReplyTo: * user@hive.apache.org
 *Subject: *Doubt in INSERT query in Hive?

 Hello,
 Whenever we want to insert into table we use:
 INSERT OVERWRITE TABLE TBL_NAME
 (SELECT )
 Due to this, table gets overwrites everytime.

 I don't want to overwrite table, I want append it everytime.
 I thought about LOAD TABLE , but writing the file may take more time and I
 don't think so that it will efficient.

 Does Hive Support INSERT INTO TABLE TAB_NAME?
 (I am using hive-0.7.1)
 Is there any patch for it? (But I don't know How to apply patch ?)

 Pls suggest me as soon as possible.
 Thanks.



 --
 Regards,
 Bhavesh Shah





Re: parallel inserts ?

2012-02-15 Thread bejoy_ks
Hi John
   Yes Insert is parallel in default for hive. Hive QL gets transformed to 
mapreduce jobs and hence definitely it is parallel. The only case it is not 
parallel is when you have just 1 reducer . It is just reading and processing 
the input files and in parallel using map reduce jobs from the source table 
data dir and writes the desired output files to the destination table dir.  
 
Hive is just an abstraction over map reduce and can't be compared 
against a db in terms of features. Almost every data processing operation is 
just some map reduce jobs. 
Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: John B johnb4...@gmail.com
Date: Wed, 15 Feb 2012 10:59:09 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: parallel inserts ?

Other sql datbases typically can parallelize selects but are unable to
automatically parallelize inserts.

With the most recent stable hiveql will the following statement have
the --insert-- automatically parallelized ?

 INSERT OVERWRITE TABLE pv_gender
 SELECT pv_users.gender
 FROM pv_users


I understand there is now 'insert into ..select from' syntax. Is the
insert part of that statement automatically parallelized ?

What is the highest insert speed anybody has seen - and I am not
talking about imports I mean inserts from one table to another ?



Re: external partitioned table

2012-02-08 Thread bejoy_ks
Hi Koert
As you are creating dir/sub dirs using mapreduce jobs out of hive, hive 
is unaware of these sub dirs. There is no other way in such cases other than an 
add partition DDL to register the dir with a hive partition. 
If you are using oozie or shell to trigger your jobs,you can accomplish  it as
-use java to come up with the correct add partition statement and write those 
statement(s) into a file 
-execute the file using hive -f fileName

Hope it helps!..


Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Koert Kuipers ko...@tresata.com
Date: Wed, 8 Feb 2012 11:04:18 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: external partitioned table

hello all,

we have an external partitioned table in hive.

we add to this table by having map-reduce jobs (so not from hive) create
new subdirectories with the right format (partitionid=partitionvalue).

however hive doesn't pick them up automatically. we have to go into hive
shell and run alter table sometable add partition
(partitionid=partitionvalue). to make matter worse hive doesnt really lend
itself to running such an add-partition-operation from java (or for that
matter: hive doesn't lend itself to any easy programmatic manipulations...
grrr. but i will stop now before i go on a a rant).

any suggestions how to approach this? thanks!

best, koert



Re: Error when Creating an UDF

2012-02-06 Thread bejoy_ks
Hi
One of your jar is not available and may be that has the required UDF or 
any related methods.

Hive was not able to locate your first jar

'/scripts/hiveMd5.jar does not exist'

Just fix this with the correct location. Everything should work fine.
 
Regards
Bejoy K S

From handheld, Please excuse typos.

-Original Message-
From: Jean-Charles Thomas jctho...@autoscout24.com
Date: Mon, 6 Feb 2012 16:51:58 
To: user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Error when Creating an UDF

Hi everybody,

i am trying to create an UDF follwing the example in the Hive Wiki.
Everything is fine but the CREATE statement (see below) where an error occurs:

hive add jar /scripts/hiveMd5.jar;
/scripts/hiveMd5.jar does not exist
hive add jar /scripts/hive/udf/Md5.jar;
Added /scripts/hive/udf/Md5.jar to class path
Added resource: /scripts/hive/udf/Md5.jar
hive CREATE TEMPORARY FUNCTION mymd5 AS 'com.autoscout24.hive.udf.Md5';
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.FunctionTask
hive

in the Hive log, there is no much more:

2012-02-06 16:16:36,096 ERROR ql.Driver (SessionState.java:printError(343)) - 
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.FunctionTask

Any Help is welcome,

Thanks a lot for hlep,

Jean-Charles





Re: Important Question

2012-01-25 Thread bejoy_ks
Real Time.. Definitely not hive. Go in for HBase, but don't expect Hbase to be 
as flexible as RDBMS. You need to choose your Row Key and Column Families 
wisely as per your requirements.
For data mining and analytics you can mount Hive table  over corresponding 
Hbase table and play on with SQL like queries.



Regards
Bejoy K S

-Original Message-
From: Dalia Sobhy dalia.mohso...@hotmail.com
Date: Wed, 25 Jan 2012 17:01:08 
To: u...@hbase.apache.org; user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Important Question


Dear all,
I am developing an API for medical use i.e Hospital admissions and all about 
patients, thus transactions and queries and realtime data is important here...
Therefore both real-time and analytical processing is a must..
Therefore which best suits my application Hbase or Hive or another method ??
Please reply quickly bec this is critical thxxx a million ;)
  


Re: Question on bucketed map join

2012-01-19 Thread bejoy_ks
Corrected a few typos in previous mail

Hi Avrila
Hi Avrila
   AFAIK the bucketed map join is not default in hive and it happens only 
when the configuration parameter hive.optimize.bucketmapjoin  is set to true. 
You may be getting the same execution plan because hive.optimize.bucketmapjoin  
is set to true  in the hive configuration xml file. To cross confirm the same 
could you explicitly set this to false
(set hive.optimize.bucketmapjoin = false;
) in your hive session and get the query execution plan from explain command. 
Please find some pointers in line
1. Should I see sth different in the explain extended output if I set and unset 
the hive.optimize.bucketmapjoin option?
[Bejoy]Yes, you should be seeing different plans for both.
Try EXPLAIN your join query after setting this
set hive.optimize.bucketmapjoin = false;

2. Should I see something different in the output of hive while running the 
query if again I set and unset the hive.optimize.bucketmapjoin?
[Bejoy] No,Hive output should be the same. What ever is the execution plan for 
an join, optimally the end result should be same.

3. Is it possible that even though I set bucketmapjoin to true, Hive will still 
perform a normal map-side join for some reason? How can I check if this has 
actually happened?
[Bejoy] Hive would perform a plain map side join only if the following 
parameter is enabled. (default it is disabled)
set hive.auto.convert.join = true; you need to check this value in your 
configurations.
If it is enabled irrespective of the table size hive would always try a map 
join, it would come to a normal join only after the map join attempt fails.
AFAIK, if the number of buckets are same or multiples between the two tables 
involved in a join and if the join is on the same columns that are bucketed, 
with bucketmapjoin enabled it shouldn't execute a plain mapside join but a 
bucketed map side join would be triggered.

Hope it helps!..


Regards
Bejoy K S

-Original Message-
From: Bejoy Ks bejoy...@yahoo.com
Date: Thu, 19 Jan 2012 09:22:08 
To: user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Question on bucketed map join

Hi Avrila
   AFAIK the bucketed map join is not default in hive and it happens only 
when the values is set to true. It could be because the same value is already 
set in the hive configuration xml file. To cross confirm the same could you 
explicitly set this to false 

(set hive.optimize.bucketmapjoin = false;)and get the query execution plan from 
explain command. 


Please some pointers in line

1. Should I see sth different in the explain extended output if I set and unset 
the hive.optimize.bucketmapjoin option?
[Bejoy] you should be seeing the same
Try EXPLAIN your join query after setting this
set hive.optimize.bucketmapjoin = false;


2. Should I see something different in the output of hive while running 
the query if again I set and unset the hive.optimize.bucketmapjoin?
[Bejoy] No,Hive output should be the same. What ever is the execution plan for 
an join, optimally the end result should be same. 


3.
 Is it possible that even though I set bucketmapjoin to true, Hive will 
still perform a normal map-side join for some reason? How can I check if
 this has actually happened?
[Bejoy] Hive would perform a plain map side join only if the following 
parameter is enabled. (default it is disabled)

set hive.auto.convert.join = true; you need to check this value in your 
configurations.
If it is enabled irrespective of the table size hive would always try a map 
join, it would come to a normal join only after the map join attempt fails.
AFAIK, if the number of buckets are same or multiples between the two tables 
involved in a join and if the join is on the same columns that are bucketed, 
with bucketmapjoin enabled it shouldn't execute a plain mapside join a bucketed 
map side join would be triggered.

Hope it helps!..

Regards
Bejoy.K.S




 From: Avrilia Floratou flora...@cs.wisc.edu
To: user@hive.apache.org 
Sent: Thursday, January 19, 2012 9:23 PM
Subject: Question on bucketed map join
 
Hi,

I have two tables with 8 buckets each on the same key and want to join them.
I ran explain extended and get the plan produced by HIVE which shows that a 
map-side join is a possible plan.

I then set in my script the hive.optimize.bucketmapjoin option to true and 
reran the explain extended query. I get the exact same plans as output.

I ran the query with and without the bucketmapjoin optimization and saw no 
difference in the running time.

I have the following questions:

1. Should I see sth different in the explain extended output if I set and unset 
the hive.optimize.bucketmapjoin option?

2. Should I see something different in the output of hive while running the 
query if again I set and unset the hive.optimize.bucketmapjoin?

3. Is it possible that even though I set bucketmapjoin to true, Hive will still 
perform a normal 

Re: Insert based on whether string contains

2012-01-04 Thread bejoy_ks
I agree with Matt on that aspect. The solution proposed by me was purely based 
on the sample data provided where there were  3 digit comma separated values. 
If there are chances of 4 digit values as well in event_list you may need to 
revisit the solution.

Regards
Bejoy K S

-Original Message-
From: Tucker, Matt matt.tuc...@disney.com
Date: Wed, 4 Jan 2012 08:56:44 
To: user@hive.apache.orguser@hive.apache.org; Bejoy Ksbejoy...@yahoo.com
Reply-To: user@hive.apache.org
Subject: Re: Insert based on whether string contains 

The find_in_set() UDF is a safer choice for doing a search for that value, as 
%239% could also match 2390, which has a different meaning in Omniture logs.



On Jan 4, 2012, at 8:46 AM, Bejoy Ks 
bejoy...@yahoo.commailto:bejoy...@yahoo.com wrote:

Hi Dave

   If I get your requirement correct, you need to load data into 
video_plays_for_sept  table FROM omniture table only if omniture.event_list 
contain the string 239.

Try the following query, it should work fine.

INSERT OVERWRITE TABLE video_plays_for_sept
SELECT concat(visid_high, visid_low), geo_city, geo_country, geo_region FROM 
omniture WHERE event_list LIKE  ‘%239%’;

Hope it helps!..

Regards,
Bejoy.K.S


From: Dave Houston r...@crankyadmin.netmailto:r...@crankyadmin.net
To: user@hive.apache.orgmailto:user@hive.apache.org
Sent: Wednesday, January 4, 2012 6:41 PM
Subject: Insert based on whether string contains

Hi there, i have a string that has '239, 236, 232, 934' (not always in that 
order) and want to insert into another table if 239 is in the string.

INSERT OVERWRITE TABLE video_plays_for_sept

SELECT concat(visid_high, visid_low), geo_city, geo_country, geo_region from 
omniture where regexp_extract(event_list, '\d+') = 239;

is that I have at the minute but always returns 0 Rows loaded to 
video_plays_for_sept


Many thanks

Dave Houston
r...@crankyadmin.netmailto:r...@crankyadmin.net







Re: Schemas/Databases in Hive

2011-12-22 Thread bejoy_ks
Ranjith
   Hive do support multiple data bases if you are on some of the latest 
versions of hive try
Create database testdb;
Use testdb;

It should give you what you are looking for.

Regards
Bejoy K S

-Original Message-
From: Raghunath, Ranjith ranjith.raghuna...@usaa.com
Date: Thu, 22 Dec 2011 17:02:09 
To: user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Schemas/Databases in Hive

What is the intent of having tables in different databases or schemas in Hive? 
Thanks

Thank you,
Ranjith



Re: Schemas/Databases in Hive

2011-12-22 Thread bejoy_ks
Also multiple databases have proved helpful for me in organizing tables into 
corresponding databases when you have quite a large number of tables to manage.
Also I believe it'd be helpful in providing access restrictions.

 
Regards
Bejoy K S

-Original Message-
From: bejoy...@yahoo.com
Date: Thu, 22 Dec 2011 17:19:16 
To: user@hive.apache.org
Reply-To: bejoy...@yahoo.com
Subject: Re: Schemas/Databases in Hive

Ranjith
   Hive do support multiple data bases if you are on some of the latest 
versions of hive try
Create database testdb;
Use testdb;

It should give you what you are looking for.

Regards
Bejoy K S

-Original Message-
From: Raghunath, Ranjith ranjith.raghuna...@usaa.com
Date: Thu, 22 Dec 2011 17:02:09 
To: user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Schemas/Databases in Hive

What is the intent of having tables in different databases or schemas in Hive? 
Thanks

Thank you,
Ranjith




Re: Loading data into hive tables

2011-12-08 Thread bejoy_ks
Adithya
  The answer is yes. SQOOP is the tool you are looking for. It has an 
import option to load data from from any jdbc compliant database into hive. It 
even creates the hive table for you by refering to the source db table.

Hope It helps!..

Regards
Bejoy K S

-Original Message-
From: Aditya Singh30 aditya_sing...@infosys.com
Date: Fri, 9 Dec 2011 09:57:26 
To: user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Loading data into hive tables

Hi,
I want to know if there is any way to load data directly from 
some other DB, say Oracle/MySQL etc., into hive tables, without getting the 
data from DB into a text/rcfile/sequence file in a specific format and then 
loading the data from that file into hive table.

Regards,
Aditya

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***



Re: Hive query failing on group by

2011-10-19 Thread bejoy_ks
Hi Mark
 What does your Map reduce job logs say? Try figuring out the error form 
there. From hive CLI you could hardly find out the root cause of your errors. 
From job tracker web UI  http://hostname:50030/jobtracker.jsp you can easily 
browse to failed tasks and get the actual exception from there. If you are not 
able to figure out from there then please post in those logs with your table 
schema.


Regards
Bejoy K S

-Original Message-
From: Mark Kerzner mark.kerz...@shmsoft.com
Date: Wed, 19 Oct 2011 09:06:13 
To: Hive useruser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Hive query failing on group by

HI,

I am trying to figure out what I am doing wrong with this query and the
unusual error I am getting. Also suspicious is the reduce % going up and
down.

select trans.property_id, day(trans.log_timestamp) from trans JOIN opts on
trans.ext_booking_id[ext_booking_id] = opts.ext_booking_id group by
day(trans.log_timestamp), trans.property_id;

2011-10-19 08:55:19,778 Stage-1 map = 0%,  reduce = 0%
2011-10-19 08:55:22,786 Stage-1 map = 100%,  reduce = 0%
2011-10-19 08:55:29,804 Stage-1 map = 100%,  reduce = 33%
2011-10-19 08:55:32,811 Stage-1 map = 100%,  reduce = 0%
2011-10-19 08:55:39,829 Stage-1 map = 100%,  reduce = 33%
2011-10-19 08:55:43,839 Stage-1 map = 100%,  reduce = 0%
2011-10-19 08:55:50,855 Stage-1 map = 100%,  reduce = 33%
2011-10-19 08:55:54,864 Stage-1 map = 100%,  reduce = 0%
2011-10-19 08:56:00,878 Stage-1 map = 100%,  reduce = 33%
2011-10-19 08:56:04,887 Stage-1 map = 100%,  reduce = 0%
2011-10-19 08:56:05,891 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201110111849_0024 with errors
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapRedTask

Thank you,
Mark



Re: Hive query failing on group by

2011-10-19 Thread bejoy_ks
Looks like some data problem. Were you using the GROUP BY  query on same data 
set?
But if count(*) also throws an error then it comes to square 1, 
installation/configuration problem with hive or map reduce.

Regards
Bejoy K S

-Original Message-
From: Mark Kerzner mark.kerz...@shmsoft.com
Date: Wed, 19 Oct 2011 10:55:34 
To: user@hive.apache.org; bejoy...@yahoo.com
Reply-To: user@hive.apache.org
Subject: Re: Hive query failing on group by

Bejoy,

I've been using this install of Hive for some time now, and simple queries
and joins work fine. It's the GROUP BY that I have problems with, sometimes
even with COUNT(*).

I am trying to isolate the problem now, and reduce it to the smallest query
possible. I am also trying to find a workaround (I noticed that sometimes
rephrasing queries for Hive helps), since I need this for a project.

Thank you,
Mark

On Wed, Oct 19, 2011 at 10:25 AM, bejoy...@yahoo.com wrote:

 ** Mark
 To ensure your hive installation is fine run two queries
 SELECT * FROM trans LIMIT 10;
 SELECT * FROM trans WHERE ***;
 You can try this for couple of different tables. If these queries return
 results and work fine as desired then your hive could be working good.

 If it works good as the second step issue a simple join between two tables
 on primitive data type columns. If that also looks good then you can kind of
 confirm that the bug is with your hive query.

 We can look into that direction then.



 Regards
 Bejoy K S
 --
 *From: * Mark Kerzner mark.kerz...@shmsoft.com
 *Date: *Wed, 19 Oct 2011 10:02:57 -0500
 *To: *user@hive.apache.org
 *ReplyTo: * user@hive.apache.org
 *Subject: *Re: Hive query failing on group by

 Vikas,

 I am using Cloudera CDHU1 on Ubuntu. I get the same results on RedHat
 CDHU0.

 Mark

 On Wed, Oct 19, 2011 at 9:47 AM, Vikas Srivastava 
 vikas.srivast...@one97.net wrote:

 install hive with RPM this is correpted!!

 On Wed, Oct 19, 2011 at 8:01 PM, Mark Kerzner 
 mark.kerz...@shmsoft.comwrote:

 Here is what my hive logs say

 hive -hiveconf hive.root.logger=DEBUG

 2011-10-19 09:24:35,148 ERROR DataNucleus.Plugin
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires
 org.eclipse.core.resources but it cannot be resolved.
 2011-10-19 09:24:35,150 ERROR DataNucleus.Plugin
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires
 org.eclipse.core.runtime but it cannot be resolved.
 2011-10-19 09:24:35,150 ERROR DataNucleus.Plugin
 (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires
 org.eclipse.text but it cannot be resolved.


 On Wed, Oct 19, 2011 at 9:21 AM, bejoy...@yahoo.com wrote:

 ** Hi Mark
 What does your Map reduce job logs say? Try figuring out the error form
 there. From hive CLI you could hardly find out the root cause of your
 errors. From job tracker web UI  http://hostname:50030/jobtracker.jsp
 you can easily browse to failed tasks and get the actual exception from
 there. If you are not able to figure out from there then please post in
 those logs with your table schema.

 Regards
 Bejoy K S
 --
 *From: * Mark Kerzner mark.kerz...@shmsoft.com
 *Date: *Wed, 19 Oct 2011 09:06:13 -0500
 *To: *Hive useruser@hive.apache.org
 *ReplyTo: * user@hive.apache.org
 *Subject: *Hive query failing on group by

 HI,

 I am trying to figure out what I am doing wrong with this query and the
 unusual error I am getting. Also suspicious is the reduce % going up and
 down.

 select trans.property_id, day(trans.log_timestamp) from trans JOIN opts
 on trans.ext_booking_id[ext_booking_id] = opts.ext_booking_id group by
 day(trans.log_timestamp), trans.property_id;

 2011-10-19 08:55:19,778 Stage-1 map = 0%,  reduce = 0%
 2011-10-19 08:55:22,786 Stage-1 map = 100%,  reduce = 0%
 2011-10-19 08:55:29,804 Stage-1 map = 100%,  reduce = 33%
 2011-10-19 08:55:32,811 Stage-1 map = 100%,  reduce = 0%
 2011-10-19 08:55:39,829 Stage-1 map = 100%,  reduce = 33%
 2011-10-19 08:55:43,839 Stage-1 map = 100%,  reduce = 0%
 2011-10-19 08:55:50,855 Stage-1 map = 100%,  reduce = 33%
 2011-10-19 08:55:54,864 Stage-1 map = 100%,  reduce = 0%
 2011-10-19 08:56:00,878 Stage-1 map = 100%,  reduce = 33%
 2011-10-19 08:56:04,887 Stage-1 map = 100%,  reduce = 0%
 2011-10-19 08:56:05,891 Stage-1 map = 100%,  reduce = 100%
 Ended Job = job_201110111849_0024 with errors
 FAILED: Execution Error, return code 2 from
 org.apache.hadoop.hive.ql.exec.MapRedTask

 Thank you,
 Mark





 --
 With Regards
 Vikas Srivastava

 DWH  Analytics Team
 Mob:+91 9560885900
 One97 | Let's get talking !






Re: upgrading hadoop package

2011-09-01 Thread bejoy_ks
Hi Li
  AFAIK 0.21 is not really a stable version of hadoop . So if this upgrade 
is on a production cluster it'd be better to go in with 0.20.203.
Regards
Bejoy K S

-Original Message-
From: Shouguo Li the1plum...@gmail.com
Date: Thu, 1 Sep 2011 11:41:46 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: upgrading hadoop package

hey guys,

i'm planning to upgrade my hadoop cluster from 0.20.1 to 0.21 to take
advantage of new bz2 splitting feature. i found a simple upgrade guide,
http://wiki.apache.org/hadoop/Hadoop_Upgrade
but i can't find anything that's related to hive. do we need to do anything
for hive? is the migration transparent to hive?
thx!



Re: Re:Re: Re: RE: Why a sql only use one map task?

2011-08-25 Thread bejoy_ks
Hi Daniel
 In the hadoop eco system the number of map tasks is actually decided 
by the job basically based  no of input splits . Setting mapred.map.tasks 
wouldn't assure that only that many number of map tasks are triggered. What 
worked out here for you is that you were specifying that a map tasks should 
process a min data volume by setting value for mapred.min.split size.
 So in your case in real there were 9 input splits but when you imposed a 
constrain on the min data that a map task should handle, the map tasks came 
down to 3. 
Regards
Bejoy K S

-Original Message-
From: Daniel,Wu hadoop...@163.com
Date: Thu, 25 Aug 2011 20:02:43 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re:Re:Re: Re: RE: Why a sql only use one map task?

after I set
set mapred.min.split.size=2;

Then it will kick off 3 map tasks (the file I have is 500M).  So looks like we 
need to set mapred.min.split.size instead of mapred.map.tasks to control how 
many maps to kick off.


At 2011-08-25 19:38:30,Daniel,Wu hadoop...@163.com wrote:

It works, after I set as you said, but looks like I can't control the map task, 
it always use 9 maps, even if I set
set mapred.map.tasks=2;


Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempts
map100.00%


900900 / 0
reduce100.00%


100100 / 0



At 2011-08-25 06:35:38,Ashutosh Chauhan hashut...@apache.org wrote:
This may be because CombineHiveInputFormat is combining your splits in one map 
task. If you don't want that to happen, do:
hive set hive.input.format=org.apache.hadoop.hive.ql.io.HiveI nputFormat


2011/8/24 Daniel,Wuhadoop...@163.com

I pasted the inform I pasted blow, the map capacity is 6. And no matter how I 
set  mapred.map.tasks, such as 3,  it doesn't work, as it always use 1 map task 
(please see the completed job information).



Cluster Summary (Heap Size is 16.81 MB/966.69 MB)
Running Map TasksRunning Reduce TasksTotal SubmissionsNodesOccupied Map 
SlotsOccupied Reduce SlotsReserved Map SlotsReserved Reduce SlotsMap Task 
CapacityReduce Task CapacityAvg. Tasks/NodeBlacklisted NodesExcluded Nodes
0063664.


Completed Jobs
JobidPriorityUserNameMap % CompleteMap TotalMaps CompletedReduce % 
CompleteReduce TotalReduces CompletedJob Scheduling InformationDiagnostic Info
job_201108242119_0001NORMALoracleselect count(*) from test(Stage-1)100.00%


00100.00%


1 1NANA
job_201108242119_0002NORMALoracleselect count(*) from test(Stage-1)100.00%


11100.00%


1 1NANA
job_201108242119_0003NORMALoracleselect count(*) from test(Stage-1)100.00%


11100.00%


1 1NANA
job_201108242119_0004NORMALoracleselect period_key,count(*) 
from...period_key(Stage-1)100.00%


11100.00%


3 3NANA
job_201108242119_0005NORMALoracleselect period_key,count(*) 
from...period_key(Stage-1)100.00%


11100.00%


3 3NANA
job_201108242119_0006NORMALoracleselect period_key,count(*) 
from...period_key(Stage-1)100.00%


11100.00%


3 3NANA



At 2011-08-24 18:19:38,wd w...@wdicc.com wrote:
What about your total Map Task Capacity?
you may check it from http://your_jobtracker:50030/jobtracker.jsp


2011/8/24 Daniel,Wu hadoop...@163.com:
 I checked my setting, all are with the default value.So per the book of
 Hadoop the definitive guide, the split size should be 64M. And the file
 size is about 500M, so that's about 8 splits. And from the map job
 information (after the map job is done), I can see it gets 8 split from one
 node. But anyhow it starts only one map task.



 At 2011-08-24 02:28:18,Aggarwal, Vaibhav vagg...@amazon.com wrote:

 If you actually have splittable files you can set the following setting to
 create more splits:



 mapred.max.split.size appropriately.



 Thanks

 Vaibhav



 From: Daniel,Wu [mailto:hadoop...@163.com]
 Sent: Tuesday, August 23, 2011 6:51 AM
 To: hive
 Subject: Why a sql only use one map task?



   I run the following simple sql
 select count(*) from sales;
 And the job information shows it only uses one map task.

 The underlying hadoop has 3 data/data nodes. So I expect hive should kick
 off 3 map tasks, one on each task nodes. What can make hive only run one map
 task? Do I need to set something to kick off multiple map task?  in my
 config, I didn't change hive config.















Re: Hive crashing after an upgrade - issue with existing larger tables

2011-08-18 Thread bejoy_ks
A small correction to my previous post. The CDH version is CDH u1 not u0
Sorry for the confusion

Regards
Bejoy K S

-Original Message-
From: Bejoy Ks bejoy...@yahoo.com
Date: Thu, 18 Aug 2011 05:51:58 
To: hive user groupuser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Hive crashing after an upgrade - issue with existing larger tables

Hi Experts

        I was working on hive with larger volume data  with hive 0.7 . Recently 
my hive installation was upgraded to 0.7.1 . After the upgrade I'm having a lot 
of issues with queries that were already working fine with larger data. The 
queries that took seconds to return results is now taking hours, for most 
larger tables even the map reduce jobs are not getting triggered. Queries like 
Select * and describe are working fine since they don't involve any map reduce 
jobs. For the jobs that didn't even get triggered I got the following error 
from job tracker

Job initialization failed: java.io.IOException: Split metadata size exceeded 
1000. 
Aborting job job_201106061630_6993 at 
org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
 
at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:807) 
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:701) 
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4013) 
at 
org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) 
at java.lang.Thread.run(Thread.java:619) 


Looks like some metadata issue. My cluster is on CDH3-u0 . Has anyone faced 
similar issues before. Please share your thoughts what could be the probable 
cause of the error.

Thank You



Re: why need to copy when run a sql with a single map

2011-08-10 Thread bejoy_ks
Hi
  Hive queries are parsed into hadoop map reduce jobs. In map reduce jobs, 
between map and reduce tasks there are two phases, copy-phase and sort-phase 
together known as sort and shuffle phase. So the copy task indicated in hive 
job  here should be the copy phase of map reduce. It does the copying of map 
output from map task nodes to corresponding reduce task nodes.

Regards
Bejoy K S

-Original Message-
From: Daniel,Wu hadoop...@163.com
Date: Wed, 10 Aug 2011 20:07:48 
To: hiveuser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: why need to copy when run a sql with a single map

I run a single query like

select retailer_key,count(*) from records group by retailer_key;

it uses a single map as shown below, since the file is already on HDFS, so I 
think hadoop/hive doesn't need to copy anything.


Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempts
map100.00%


100100 / 0
reduce100.00%


100100 / 0

but the final chart in the job  report shows copy takes about 33% of the 
total time, and the rest are sort, and reduce.  So why it should copy here, 
or copy means something elso?
 oracle@oracle-MS-7623:~/test$ hadoop fs -lsr /

drwxr-xr-x   - oracle supergroup  0 2011-08-10 19:46 /user
drwxr-xr-x   - oracle supergroup  0 2011-08-10 19:46 /user/hive
drwxr-xr-x   - oracle supergroup  0 2011-08-10 19:59 
/user/hive/warehouse
drwxr-xr-x   - oracle supergroup  0 2011-08-10 19:59 
/user/hive/warehouse/records
-rw-r--r--   1 oracle supergroup   41600256 2011-08-10 19:59 
/user/hive/warehouse/records/test.txt





Hive or pig for sequential iterations like those using foreach

2011-08-08 Thread bejoy_ks
Hi 
   I've been successful using hive for a past few projects. Now for a 
particular use case I'm bit confused what to choose, Hive or Pig. My project 
involves a step by step sequential work flow. In every step I retrieve some 
values based on some query, use these values as input to new queries 
iterative(similar to foreach implementation in Pig) and so on. Is hive a good 
choice here when I'm having 11 sequence of operation as described?  The second 
confusion for me is, does hive support 'foreach' equivalent functionality?

Please advise. 

I'm from JAVA background, not much into  db development so not sure of any such 
concepts in SQL.

Thanks 

Regards
Bejoy K S



Re: Hive or pig for sequential iterations like those using foreach

2011-08-08 Thread bejoy_ks
Thanks Amareshwari, the article gave me much valuable hints to decide my 
choice. But on curiosity, does hive support stage by stage iterative 
processing? If so how? 

Thank You
Regards
Bejoy K S

-Original Message-
From: Amareshwari Sri Ramadasu amar...@yahoo-inc.com
Date: Mon, 8 Aug 2011 17:14:21 
To: user@hive.apache.orguser@hive.apache.org; 
bejoy...@yahoo.combejoy...@yahoo.com
Reply-To: user@hive.apache.org
Subject: Re: Hive or pig for sequential iterations like those using foreach

You can have a look at typical use cases of Pig and Hive here 
http://developer.yahoo.com/blogs/hadoop/posts/2010/08/pig_and_hive_at_yahoo/

Thanks
Amareshwari

On 8/8/11 5:10 PM, bejoy...@yahoo.com bejoy...@yahoo.com wrote:

Hi
   I've been successful using hive for a past few projects. Now for a 
particular use case I'm bit confused what to choose, Hive or Pig. My project 
involves a step by step sequential work flow. In every step I retrieve some 
values based on some query, use these values as input to new queries 
iterative(similar to foreach implementation in Pig) and so on. Is hive a good 
choice here when I'm having 11 sequence of operation as described?  The second 
confusion for me is, does hive support 'foreach' equivalent functionality?

Please advise.

I'm from JAVA background, not much into  db development so not sure of any such 
concepts in SQL.

Thanks

Regards
Bejoy K S





Re: NPE with hive.cli.print.header=true;

2011-08-01 Thread bejoy_ks
Hi Ayon
AFAIK hive is supposed to behave so. If you set the 
hive.cli.print.header=true for enabling column headers then some commands like 
'desc' is not expected to work. Not sure whether there is some patch recently 
out for this.

Regards
Bejoy K S

-Original Message-
From: Ayon Sinha ayonsi...@yahoo.com
Date: Mon, 1 Aug 2011 17:29:17 
To: Hive Mailinglistuser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: NPE with hive.cli.print.header=true;

With 
set hive.cli.print.header=true;


I get NPE's for desc as well as use

Exception in thread main java.lang.NullPointerException
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

Is there a patch for this?
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



Re: Partition by existing field?

2011-07-08 Thread bejoy_ks
Hi Travis
 From my understanding of your requirement, Dynamic Partitions in hive 
is the most suitable solution.

I have written a blogpost on such requirements please refer
 
http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html
 for an understanding on the implementation . You can refer the hive wiki as 
well.

Please revert for any clarification
Regards
Bejoy K S

-Original Message-
From: Travis Powell tpow...@tealeaf.com
Date: Fri, 8 Jul 2011 13:11:58 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Partition by existing field?

Can I partition by an existing field?

 

I have a 10 GB file with a date field and an hour of day field. Can I
load this file into a table, then insert-overwrite into another
partitioned table that uses those fields as a partition? Would something
like the following work?

 

INSERT OVERWRITE TABLE tealeaf_event
PARTITION(dt=evt.datestring,hour=evt.hour) SELECT * FROM staging_event
evt;

 

Thanks!

Travis




Re: Hive create table

2011-05-25 Thread bejoy_ks
Hi Jinhang
   I don't think hive supports multi character delimiters. The hassle free 
option here would be to preprocess the data using mapreduce to replace the 
multi character delimiter with another permissible one that suits your data.
Regards
Bejoy K S

-Original Message-
From: jinhang du dujinh...@gmail.com
Date: Wed, 25 May 2011 19:56:16 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Hive create table

Hi all,

I want to custom the delimiter of the table in a row.
Like my data format is '124‘, and how could I create a table (int,
int, int)

Thanks.

-- 
dujinhang



Re: Hadoop error 2 while joining two large tables

2011-03-17 Thread bejoy_ks
Try out CDH3b4 it has hive 0.7 and the latest of other hadoop tools. When you 
work with open source it is definitely a good practice to upgrade those with 
latest versions. With newer versions bugs would be minimal , performance would 
be better and you get more functionalities. Your query looks fine an upgrade of 
hive could sort things out. 
Regards
Bejoy K S

-Original Message-
From: Edward Capriolo edlinuxg...@gmail.com
Date: Thu, 17 Mar 2011 08:51:05 
To: user@hive.apache.orguser@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Re: Hadoop error 2 while joining two large tables

I am pretty sure the cloudera distro has an upgrade path to a more recent hive.

On Thursday, March 17, 2011, hadoop n00b new2h...@gmail.com wrote:
 Hello All,

 Thanks a lot for your response. To clarify a few points -

 I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of 
 Hive as we have to use Cloudera distro only.

 All records in the smaller table have at least one record in the larger table 
 (of course a few exceptions could be there but only a few).

 The join is using ON clause. The query is something like -

 select ...
 from
 (
   (select ... from smaller_table)
   join
   (select from larger_table)
   on (smaller_table.col = larger_table.col)
 )

 I will try out setting mapred.child.java.opts -Xmx to a higher value and let 
 you know.

 Is there a pattern or rule of thumb to follow on when to add more nodes?

 Thanks again!

 On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong sw...@netflix.com wrote:



 In addition, put the smaller table on the left-hand side of a JOIN:

 SELECT ... FROM small_table JOIN large_table ON ...




 From: Bejoy Ks [mailto:bejoy...@yahoo.com]
 Sent: Wednesday, March 16, 2011 11:43 AM

 To: user@hive.apache.org
 Subject: Re: Hadoop error 2 while joining two large tables






 Hey hadoop n00b
     I second Mark's thought. But definitely you can try out re framing your 
 query to get things rolling. I'm not sure on your hive Query.But still, from 
 my experience with joins on huge tables (record counts in the range of 
 hundreds of millions) you should give join conditions with JOIN ON clause 
 rather than specifying all conditions in WHERE.

 Say if you have a query this way
 SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
 a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3  b.Column2;

 You can definitely re frame this query as
 SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
 ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3  b.Column2;

 From my understanding Hive supports equijoins so you can't have the 
 inequality conditions there within JOIN ON, inequality should come to WHERE. 
 This approach has worked for me when I encountered a similar situation as 
 yours some time ago. Try this out,hope it helps.

 Regards
 Bejoy.K.S








 From: Sunderlin, Mark mark.sunder...@teamaol.com
 To: user@hive.apache.org user@hive.apache.org
 Sent: Wed, March 16, 2011 11:22:09 PM
 Subject: RE: Hadoop error 2 while joining two large tables




 hadoop n00b asks, “Is adding more nodes the solution to such problem?”

 Whatever else answers you get, you should append “ … and add more nodes.” 
 More nodes is never a bad thing ;-)


 ---
 Mark E. Sunderlin
 Solutions Architect |AOL Data Warehouse

 P: 703-256-6935 | C: 540-327-6222

 AIM: MESunderlin
 22000 AOL Way