Re: How to find hive version using hive editor in hue ?

2016-02-18 Thread Bennie Schut

Not directly but indirectly doing:

set system:sun.java.command;

That will likely give the the jar name which includes the version.

On 18/02/16 08:12, Abhishek Dubey wrote:


Thanks in advance..

*Warm Regards,*
*Abhishek Dubey*





Re: Strict mode and joins

2015-10-19 Thread Bennie Schut

Hi Edward,

That's possibly due to using unix_timestamp (although the error message 
seems misleading if that proves true) . It's technically correct it 
shouldn't be flagged as deterministic because every time you call it 
you'll get a different answer as time progresses. However reality is I 
just want it called 1 time which is during planning and if I flag is as 
deterministic this is exactly what happens so you can do this:


@UDFType(deterministic = true)
public class UnixTimeStamp extends GenericUDFUnixTimeStamp {
// Making the udf deterministic which is kind of cheating but makes 
partition pruning work.

}

And then register the udf like you normally would.

If that's not helping do some creative sub-querying might help like
FROM (select * from entry_hourly_v3 where dt=2015101517 ) 
entry_hourly_v3 INNER JOIN article_meta ON


Bennie.

On 15/10/15 23:06, Edward Capriolo wrote:

So I have strict mode on and I like to keep it that way.

I am trying to do this query.

INSERT OVERWRITE TABLE vertical_stats_recent PARTITION (dt=2015101517)
SELECT ...
FROM entry_hourly_v3 INNER JOIN article_meta ON
entry_hourly_v3.entry_id = article_meta.entry_id
INNER JOIN channel_meta ON
channel_meta.section_name = article_meta.channel

WHERE entry_hourly_v3.dt=2015101517
AND article_meta.dt=2015101517
AND channel_meta.hitdate=20151015
AND article_meta.publish_timestamp > ((unix_timestamp() * 1000) - 
(1000 * 60 * 60 * 24 * 2))

GROUP

entry_hourly_v3, channel_meta and article_meta are partitioned tables.

*Your query has the following error(s):*

Error while compiling statement: FAILED: SemanticException [Error 
10041]: No partition predicate found for Alias "entry_hourly_v3" Table 
"entry_hourly_v3"


I also tried putting views on the table and I had no luck.

Is there any way I can do this query without turning strict mode off?






Re: HiveServer2 OOM

2015-10-12 Thread Bennie Schut
In my experience having looked at way to many heap dumps from 
hiveserver2 it always end up being a seriously over partitioned table 
and a user who decided to do a full table scan basically requesting all 
partitions. This often is by accident for example when using 
unix_timestamp to convert dates you don't realize it's not flagged as 
deterministic and as a consequence accidentally doing a full table scan. 
Sometimes not so accidental.
If the 650M is the file on disk you might just be looking at compression 
at work. Our hprof files are often significantly smaller than the memory 
they actually occupy.
If you are hit by this consider using strict mode. It's annoying but 
also makes these problems more visible.



On 10/10/15 16:09, Sanjeev Verma wrote:

Even having enough heap size my hiveserver2 going outofmemory, I enable
heap dump on error which producing 650MB of heap although I have
hiveserver2 configured with 8GB Heap.

here is the stacktrace of the thread which went in to OOM,could anybody let
me know why it throwing OOM

"pool-2-thread-4" prio=5 tid=40 RUNNABLE
  at java.lang.OutOfMemoryError.(OutOfMemoryError.java:48)
  at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
 Local Variable: byte[]#1567
 Local Variable: java.lang.StringCoding$StringDecoder#1
  at java.lang.StringCoding.decode(StringCoding.java:193)
  at java.lang.String.(String.java:416)
  at java.lang.String.(String.java:481)
  at
org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:355)
  at
org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:347)
  at
org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.read(FieldSchema.java:490)
  at
org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.read(FieldSchema.java:476)
  at
org.apache.hadoop.hive.metastore.api.FieldSchema.read(FieldSchema.java:410)
  at
org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.read(StorageDescriptor.java:1309)
 Local Variable:
org.apache.hadoop.hive.metastore.api.StorageDescriptor#8459
 Local Variable: org.apache.hadoop.hive.metastore.api.FieldSchema#276777
  at
org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.read(StorageDescriptor.java:1288)
  at
org.apache.hadoop.hive.metastore.api.StorageDescriptor.read(StorageDescriptor.java:1150)
  at
org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.read(Partition.java:994)
  at
org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.read(Partition.java:929)
  at org.apache.hadoop.hive.metastore.api.Partition.read(Partition.java:821)
  at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.read(ThriftHiveMetastore.java:56468)
 Local Variable: org.apache.hadoop.hive.metastore.api.Partition#8450
  at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.read(ThriftHiveMetastore.java:56447)
 Local Variable:
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme#1
  at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result.read(ThriftHiveMetastore.java:56381)
 Local Variable: org.apache.thrift.protocol.TBinaryProtocol#10
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 Local Variable: java.lang.String#802229
 Local Variable: org.apache.thrift.protocol.TMessage#2
  at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partitions(ThriftHiveMetastore.java:1751)
 Local Variable:
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result#1
  at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions(ThriftHiveMetastore.java:1736)
 Local Variable:
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client#8
  at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitions(HiveMetaStoreClient.java:880)
  at sun.reflect.GeneratedMethodAccessor36.invoke()
 Local Variable: sun.reflect.GeneratedMethodAccessor36#1
 Local Variable: org.apache.hadoop.hive.metastore.HiveMetaStoreClient#8
 Local Variable: java.lang.Short#129
  at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
 Local Variable:
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient#8
 Local Variable: java.lang.reflect.Method#397
 Local Variable: java.lang.Object[]#24405
  at com.sun.proxy.$Proxy10.listPartitions()
 Local Variable: com.sun.proxy.$Proxy10#8
 Local Variable: java.lang.String#674524
  at
org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785)
  at
org.

Re: PL/SQL to HiveQL translation

2013-07-25 Thread Bennie Schut

Hi Jerome,

Yes it looks like you could stop using GET_SEMAINE  and directly joining 
"calendrier_hebdo" with "calendrier" for example. For "FCALC_IDJOUR" you 
will have to make a udf so I hope you have some java skills :)
The "calendrier" tables suggests you have star schema with a calendar 
table. If on oracle you partitioned on a date and use a subquery to get 
the dates you want from the fact table you can expect this to be a 
problem in hive. Partition pruning works during planning it will not 
know which partitioned to prune and thus run on all the data in the fact 
table and filter after it's done making partitioning useless. There are 
ways of working around this, it seems most people decide to use a 
"deterministic" udf which produces the dates and this causes the udfs to 
be run during planning and partition pruning magically works again.

Hope this helps.

Bennie.

Op 25-7-2013 09:50, Jérôme Verdier schreef:

Hi,

I need some help to translate a PL/SQL script in HiveQL.

Problem : my PL/SQL script is calling two functions.

you can see the script below :

SELECT
  in_co_societe as co_societe,
  'SEMAINE' as co_type_periode,
  a.type_entite as type_entite,
  a.code_entite as code_entite,
  a.type_rgrp_produits  as type_rgrp_produits,
  a.co_rgrp_produitsas co_rgrp_produits,
  SUM(a.MT_CA_NET_TTC)  as MT_CA_NET_TTC,
  SUM(a.MT_OBJ_CA_NET_TTC)  as MT_OBJ_CA_NET_TTC,
  SUM(a.NB_CLIENTS) as NB_CLIENTS,
  SUM(a.MT_CA_NET_TTC_COMP) as MT_CA_NET_TTC_COMP,
  SUM(a.MT_OBJ_CA_NET_TTC_COMP) as MT_OBJ_CA_NET_TTC_COMP,
  SUM(a.NB_CLIENTS_COMP)as NB_CLIENTS_COMP
from
  kpi.thm_ca_rgrp_produits_jour/*@o_bi.match.eu 
*/ a

WHERE
a.co_societe = in_co_societe
AND a.dt_jour between
  (
SELECT
  cal.dt_jour_deb
FROM ods.calendrier_hebdo cal
WHERE cal.co_societe = in_co_societe
AND cal.co_an_semaine = ods.package_date.get_semaine(
  ods.package_date.fcalc_idjour(
CASE
  WHEN TO_CHAR(D_Dernier_Jour,'') = 
TO_CHAR(D_Dernier_Jour-364,'') THEN

NEXT_DAY(D_Dernier_Jour-364,1)-7
  ELSE
D_Dernier_Jour-364
END
  )
)
  )
  AND D_Dernier_Jour-364
-- On ne calcule rien si la semaine est compl¿¿te
AND (
  TO_CHAR(D_Dernier_Jour,'DDMM') <> '3112'
  AND TO_CHAR(D_Dernier_Jour,'D') <> '7'
)
GROUP BY
  a.type_entite,
  a.code_entite,
  a.type_rgrp_produits,
  a.co_rgrp_produits;

The function ods.package_date.get_semaine is :

FUNCTION GET_SEMAINE
   (ID_DEB  IN NUMBER)
  RETURN NUMBER
  IS
SEMAINE  NUMBER(10);
  BEGIN
SELECT CO_AN_SEMAINE
INTO   SEMAINE
FROM   CALENDRIER
WHERE  ID_JOUR = ID_DEB;

RETURN (SEMAINE);
  EXCEPTION
WHEN NO_DATA_FOUND THEN
  RETURN (0);
WHEN OTHERS THEN
  RETURN (0);
  END;

The function ods.package_date.fcalc_idjour is below :

FUNCTION FCALC_IDJOUR
   (DATE_REFERENCE  IN DATE)
  RETURN NUMBER
  IS
NM_ANNEENUMBER := TO_NUMBER(TO_CHAR(DATE_REFERENCE,''));
NM_MOIS NUMBER := 
TO_NUMBER(SUBSTR(TO_CHAR(DATE_REFERENCE,'MM'),5,2));
NM_JOUR NUMBER := 
TO_NUMBER(SUBSTR(TO_CHAR(DATE_REFERENCE,'MMDD'),7,2));

IDJOUR_CALCULE  NUMBER := 0;
  BEGIN
IF NM_ANNEE < 1998
OR DATE_REFERENCE IS NULL THEN
  IDJOUR_CALCULE := 0;
ELSE
  IDJOUR_CALCULE := ((NM_ANNEE - 1998) * 600) + ((NM_MOIS - 01) * 
50) + NM_JOUR;

END IF;

RETURN IDJOUR_CALCULE;
DBMS_OUTPUT.PUT_LINE(IDJOUR_CALCULE);
  END FCALC_IDJOUR;

Is it possible to translate this in one  HiveQL script ?




Re: which approach is better

2013-07-18 Thread Bennie Schut
The best way to restore is from a backup. We use distcp to keep this 
scalable : http://hadoop.apache.org/docs/r1.2.0/distcp2.html
The data we feed to hdfs also gets pushed to this backup and the 
metadatabase from hive also gets pushed here. So this combination works 
well for us (had to use it once).
Even if a namenode could never crash and all software worked fine 100% 
of the time there is always the one crazy user/admin who will find a way 
to wipe all data.

To me backups are not optional.

Op 17-7-2013 20:17, Hamza Asad schreef:
I use data to generates reports on daily basis, Do couple of analysis 
and its insert once and read many on daily basis.  But My main purpose 
is to secure my data and easily recover it even if my hadoop(datanode) 
OR HDFS crashes. As uptill now, i'm using approach in which data has 
been retrieved directly from HDFS and few days back my hadoop crashes 
and when i repair it, i was unable to recover my Old data which 
resides on HDFS. So please let me know do i have to make architectural 
change OR is there any way to recover data which resides in crashed HDFS



On Wed, Jul 17, 2013 at 11:00 PM, Nitin Pawar > wrote:


what's the purpose of data storage?
whats the read and write throughput you expect?
whats the way you will access data while read?
whats are your SLAs on both read and write?

there will be more questions others will ask so be ready for that :)



On Wed, Jul 17, 2013 at 11:10 PM, Hamza Asad
mailto:hamza.asa...@gmail.com>> wrote:

Please let me knw which approach is better. Either i save my
data directly to HDFS and run hive (shark) queries over it OR
store my data in HBASE, and then query it.. as i want to
ensure efficient data retrieval and data remains safe and can
easily recover if hadoop crashes.

-- 
*/Muhammad Hamza Asad/*





-- 
Nitin Pawar





--
*/Muhammad Hamza Asad/*




Re: Moving hive from one server to another

2013-07-03 Thread Bennie Schut

Unfortunately the ip is stored with each partition in the metadatabase.
I once did an update on the metatdata for our server to replace all old 
ip's with new ip's. It's not pretty but that actually works.


Op 28-6-2013 06:29, Manickam P schreef:

Hi,

What are the steps one should follow to move hive from one server to 
another along with hadoop?
I've moved my hadoop master node from one server to another and then 
moved my hive also. I started all my hadoop nodes successfully but 
getting error while executing hive queries. It shows the below error 
and shows my old master node ip address.


*java.net.ConnectException: Call to
192.168.99.33/192.168.99.33:5 failed on connection exception:
java.net.ConnectException: Connection refused*
*
Job Submission failed with exception
'java.net.ConnectException(Call to
192.000.00.33/192.000.00.33:5 failed on connection exception:
java.net.ConnectException: Connection refused)'
java.lang.IllegalArgumentException: Can not create a Path from an
empty string

*

I checked my hive-site.xml file i have given the correct new ip 
address. Anyone pls tell me where would be the mistake here. 
I don't have any clue.



Thanks,
Manickam P




RE: Hive Problems Reading Avro+Snappy Data

2013-04-08 Thread Bennie Schut
Just so you know there is still at least one bug using avro+compression like 
snappy:
https://issues.apache.org/jira/browse/HIVE-3308
There's a simple one line patch but unfortunately it's not committed yet.


From: Thomas, Matthew [mailto:mtho...@verisign.com]
Sent: Monday, April 08, 2013 1:59 PM
To: user@hive.apache.org
Subject: Re: Hive Problems Reading Avro+Snappy Data

Thanks Chuck.

I think the problem is the job configuration on the query.  I logged back into 
the system this morning and started a new Hive client shell and issued a series 
of more complex queries against the Avro+Snappy table and they all worked fine. 
 So I started trying to recall what could have been different between my new 
Hive client shell and the old one returning NULLs.  I am able to reproduce the 
NULLs being returned by setting "SET hive.exec.compress.output=true;".  A brand 
new Hive client has that set to false and all the queries come back normal, but 
the second I set it to true the NULLs return.

Best,

Matt

From: , Chuck 
mailto:chuck.conn...@nuance.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Sunday, April 7, 2013 7:32 PM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: RE: Hive Problems Reading Avro+Snappy Data

When you do SELECT *, Hive does not run a real MapReduce job, so it is not a 
good test. Something is wrong with SerDe or InputFormat.

Chuck


From: Thomas, Matthew [mailto:mtho...@verisign.com]
Sent: Sunday, April 07, 2013 5:41 PM
To: user@hive.apache.org
Subject: Hive Problems Reading Avro+Snappy Data

Hive users,

I am having problems performing "complex" queries on Avro+Snappy data.  If I do 
a "SELECT * FROM Blah LIMIT 50", I see the data coming back as it should be.  
But if I perform any kind of more complex query such as "SELECT count(*) FROM 
Blah" I am receive several rows of NULL values.  My workflow of how I created 
the table is described below along with some of the setup.

- I am running CDH4.2 with Avro 1.7.3

hive> select * From mthomas_testavro limit 1;
OK
Field1Field2
03-19-2013a
03-19-2013b
03-19-2013c
03-19-2013c
Time taken: 0.103 seconds

hive> select count(*) From mthomas_testavro;
...
Total MapReduce CPU Time Spent: 6 seconds 420 msec
OK
NULL
NULL
NULL
NULL
Time taken: 17.634 seconds
...


CREATE EXTERNAL TABLE mthomas_testavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/tmp/testavro/'
TBLPROPERTIES (
'avro.schema.literal'='{
"namespace": "hello.world",
"name": "some_schema",
"type": "record",
"fields": [
{ "name":"field1","type":"string"},
{ "name":"field2","type":"string"}
]
}')
;

SET avro.output.codec=snappy;
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET 
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM 
identical_table_inGzip_format;

If I cat the output file in the external table, I see 
"Objavro.codec^Lsnappyavro.schema?{"type"..." at the beginning followed by the 
rest of the schema and binary data.  So I am assuming the snappy compression 
worked.  Furthermore, I also tried to query this table via Impala and both 
queries worked just fine.

Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308  ???

Any ideas?

Thanks.

Matt
"This message (including any attachments) is intended only for the use of the 
individual or entity to which it is addressed, and may contain information that 
is non-public, proprietary, privileged, confidential and exempt from disclosure 
under applicable law or may be constituted as attorney work product. If you are 
not the intended recipient, you are hereby notified that any use, 
dissemination, distribution, or copying of this communication is strictly 
prohibited. If you have received this message in error, notify sender 
immediately and delete this message immediately."


hive & starschemas.

2013-04-02 Thread Bennie Schut
Hi all,

I've been using hive with snappy and avro combined for a little while now 
compared to our older star schema setup with hive and wanted to share this 
experience with other hive users:
http://tech.ebuddy.com/2013/03/28/from-star-schema-to-complete-denormalization/
I realize there is more to be said about star schema's than simply looking at 
speed but for us speed was the main motivation in the past.

Bennie.


RE: Getting Slow Query Performance!

2013-03-12 Thread Bennie Schut
Well it's probably worth  to know 30G is really hitting rock bottom when you 
talk about big data. Hadoop is linearly scalable so probably going to 3 or 4 
similar machines could get you below the mysql time but it's hardly a fair 
comparison.
Setting it up I would suggest reading the hadoop docs: 
http://hadoop.apache.org/docs/current/
These hardware specs give you an idea why it's an unusual case: 
http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/

To give you some hints. Each node needs to be configure on how much resources 
it's allowed to take. This is a balance between several parameters:
mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum, 
mapred.child.java.opts
There are tons more configurations but this is where you start. Different 
hardware and different jobs require different configurations so try it out.
Since you are extremely tight on ram you probably want to reduce memory usage 
on most processes like the namenode/jobtracker/hive and on each node drop the 
memory requirements for tasktracker/datanode.
Also don't put your nodes on 100MB links they are almost always to slow.

Bennie.

From: Gobinda Paul [mailto:gobi...@live.com]
Sent: Tuesday, March 12, 2013 11:01 AM
To: user@hive.apache.org
Subject: RE: Getting Slow Query Performance!


Thnx for your reply , i am new to hadoop and hive .My goal is to process a big 
data using hadoop,
this is my university project ( Data Mining ) ,need to show that hadoop is 
better than mysql in case
of Big data(30-100GB+) Processing,i know hadoop does that.To do so,can you 
please suggest me,
how many node is required to show the performance  and what type of 
configuration is required for each node.


From: bsc...@ebuddy.com
To: user@hive.apache.org
CC: gobi...@live.com
Date: Tue, 12 Mar 2013 10:40:33 +0100
Subject: RE: Getting Slow Query Performance!
Generally a single hadoop machine will perform worse then a single mysql 
machine. People normally use hadoop when they have so much data it won't really 
fit on a single machine and it would require specialized hardware (Stuff like 
SAN's) to run.
30GB of data really isn't that much and 2GB of ram is really not what hadoop is 
designed to work on. It really likes to have lots of memory.
I also don't see the hadoop configuration files so perhaps you only have 1 
mapper and 1 reducer. But this is not a typical use-case so I doubt you'll see 
snappy performance after tweaking the configs.




RE: Getting Slow Query Performance!

2013-03-12 Thread Bennie Schut
Generally a single hadoop machine will perform worse then a single mysql 
machine. People normally use hadoop when they have so much data it won't really 
fit on a single machine and it would require specialized hardware (Stuff like 
SAN's) to run.
30GB of data really isn't that much and 2GB of ram is really not what hadoop is 
designed to work on. It really likes to have lots of memory.
I also don't see the hadoop configuration files so perhaps you only have 1 
mapper and 1 reducer. But this is not a typical use-case so I doubt you'll see 
snappy performance after tweaking the configs.

From: Gobinda Paul [mailto:gobi...@live.com]
Sent: Tuesday, March 12, 2013 10:10 AM
To: user@hive.apache.org
Subject: Getting Slow Query Performance!



i use sqoop to import 30GB data ( two table employee(aprox 21 GB)  and 
salary(aprox 9GB ) into hadoop(Single Node) via hive.

i run a sample query like SELECT 
EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY 
WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT>90;

In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to 
execute that query .

CPU: Pentium(R) Dual-Core  CPU  E5700  @ 3.00GHz
RAM:  2GB
HDD: 500GB


Here IS My hive-site.xml conf.






  
javax.jdo.option.ConnectionURL

jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true
  
  
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
  
  
javax.jdo.option.ConnectionUserName
root
  
  
javax.jdo.option.ConnectionPassword
123456
  
  
hive.hwi.listen.host
 0.0.0.0
 This is the host address the Hive Web Interface will listen 
on
  
  
hive.hwi.listen.port

This is the port the Hive Web Interface will listen 
on
   
   
hive.hwi.war.file
/lib/hive-hwi-0.9.0.war
This is the WAR file with the jsp content for Hive Web 
Interface
   

  
  mapred.reduce.tasks
-1
The default number of reduce tasks per job.  Typically 
set
to a prime close to the number of available hosts.  Ignored when
mapred.job.tracker is "local". Hadoop set this to 1 by default, 
whereas hive uses -1 as its default value.
By setting this property to -1, Hive will automatically figure out 
what should be the number of reducers.

   

   
 hive.exec.reducers.bytes.per.reducer
 10
 size per reducer.The default is 1G, i.e if the input size is 
10G, it will use 10 reducers.
   


  
hive.exec.reducers.max
999
max number of reducers will be used. If the one
specified in the configuration parameter mapred.reduce.tasks is
negative, hive will use this one as the max number of reducers when
automatically determine number of reducers.

   

  
hive.exec.scratchdir
/tmp/hive-${user.name}
Scratch space for Hive jobs
  

   
 hive.metastore.local
 true
   




Any IDEA ??


RE: Accessing sub column in hive

2013-03-08 Thread Bennie Schut
Perhaps worth posting the error. Some might know what the error means.

Also a bit unrelated to hive but please do yourself a favor and don't use float 
to store monetary values like salary. You will get rounding issues at some 
point in time when you do arithmetic on them. Considering you are using hadoop 
you probably have a lot of data so adding it all up will get you there really 
really fast. 
http://stackoverflow.com/questions/3730019/why-not-use-double-or-float-to-represent-currency


From: Sai Sai [mailto:saigr...@yahoo.in]
Sent: Thursday, March 07, 2013 12:54 PM
To: user@hive.apache.org
Subject: Re: Accessing sub column in hive

I have a table created like this successfully:

CREATE TABLE IF NOT EXISTS employees (name STRING,salary FLOAT,subordinates 
ARRAY,deductions   MAP,address STRUCT)

I would like to access/display country column from my address struct.
I have tried this:

select address["country"] from employees;

I get an error.

Please help.

Thanks
Sai


RE: Re:RE: Problem with Hive JDBC server

2013-02-07 Thread Bennie Schut
What jdbc driver are you using? Also compiled from trunk? I ask because I 
remember a jira a while back where the jdbc driver didn’t let the server know 
the connection should be closed ().
If that’s the case updating the jdbc driver could work. However that might be a 
bit of a long shot.

From: Gabor Makrai [mailto:makrai.l...@gmail.com]
Sent: Wednesday, February 06, 2013 12:45 PM
To: 王锋; Bennie Schut
Cc: user@hive.apache.org
Subject: Re: Re:RE: Problem with Hive JDBC server

Hi guys,

Bad news for me. I checked out and compiled the Hive trunk and got the same 
problem.
I attached to output of command lsof before and after my test program with 100 
"SHOW TABLES" iterations. Is there any explanation why my JDBC server process 
doesn't release those files?

Thanks,
Gabor

On Tue, Feb 5, 2013 at 6:20 AM, 王锋 
mailto:wfeng1...@163.com>> wrote:


I got it. pls see  https://issues.apache.org/jira/browse/THRIFT-1205

I upgrade the thrift to libthrift-0.9.0.

thanks



At 2013-02-05 13:06:05,"王锋" mailto:wfeng1...@163.com>> wrote:

When I was using hiveserver ,the exception was thrown:

2060198 Hive history 
file=/tmp/hdfs/hive_job_log_hdfs_201302010032_1918750748.txt
2060199 Exception in thread "pool-1-thread-95" java.lang.OutOfMemoryError: Java 
heap space
2060200 at 
org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:353)
2060201 at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:215)
2060202 at 
org.apache.hadoop.hive.service.ThriftHive$Processor.process(ThriftHive.java:730)
2060203 at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
2060204 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
2060205 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
2060206 at java.lang.Thread.run(Thread.java:722)

I using Hive-0.7.1-cdh3u1 with thrift-0.5.0.jar and thrift-fb303-0.5.0.jar.
how can it be fixed? how about hive-0.7-1 using thrift -0.9.0?  thanks.



At 2013-02-04 19:19:16,"Bennie Schut" 
mailto:bsc...@ebuddy.com>> wrote:
Looking at the versions you might be hitting 
https://issues.apache.org/jira/browse/HIVE-3481 which is fixed in 0.10

On my dev machine the test runs with success :Running time: 298.952409914
This includes this patch so it’s worth looking at.

From: Gabor Makrai [mailto:makrai.l...@gmail.com<mailto:makrai.l...@gmail.com>]
Sent: Monday, February 04, 2013 11:58 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Problem with Hive JDBC server

Yes, of course! I attached the code!

On Mon, Feb 4, 2013 at 11:57 AM, Gabor Makrai 
mailto:makrai.l...@gmail.com>> wrote:
Yes, of course! :) I attached the code!

On Mon, Feb 4, 2013 at 11:53 AM, Bennie Schut 
mailto:bsc...@ebuddy.com>> wrote:
Since it’s small can you post the code?

From: Gabor Makrai [mailto:makrai.l...@gmail.com<mailto:makrai.l...@gmail.com>]
Sent: Monday, February 04, 2013 11:45 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Problem with Hive JDBC server

Hi guys,

I'm writing you because I experienced a very strange problem which probably 
affects all Hive distribution.
I made a small "only main function" Java program where I'm only connecting to 
my Hive JDBC, and getting the list of the database tables (LIST TABLES) and 
closing the ResultSet, the Statement and the Connection and doing this a 1000 
times. The problem is that the running Hive JDBC server does not release files 
and with time it will throw Exception because, it will get "Too many open 
files" IOException from the JVM.

I tested with Hive 0.9, 0.8.1, and the patched Hive 0.9 installed in CDH4.1.1.

If it is a know issue, than could you tell me the solution for it? If it is 
not, than I can create a new ticket in Jira, and with a little help, I probably 
can fix the problem and contribute the solution for it.

Thanks,
Gabor







RE: Hive JDBC driver query statement timeout.

2013-02-06 Thread Bennie Schut
Normally that would be stmt.setQueryTimout however that call isn't implemented 
yet. So to answer, no there isn't.

  public void setQueryTimeout(int seconds) throws SQLException {
throw new SQLException("Method not supported");
  }

You might find a parameter called "hive.stats.jdbc.timeout" but don't be 
fooled, that only for the "stats" package not for the client side jdbc driver.
Strangely enough I just realized I never actually missed it. Normally when 
there are problems with a query either Hive or hadoop throws an exception and 
that will end up on the client. But I can understand how others could have more 
time sensitive queries/results and would like to see this option.

From: Kugathasan Abimaran [mailto:abimar...@hsenidmobile.com]
Sent: Wednesday, February 06, 2013 4:37 AM
To: user@hive.apache.org
Subject: Hive JDBC driver query statement timeout.

Hi,

Is there a way to set the hive statement query timeout in hive jdbc driver?

--
Thanks,
With Regards,

Abimaran


Out Of Memory on localmode.

2013-02-05 Thread Bennie Schut
Hi,

Just in case anyone else ever runs into this.
Lately our cluster kept on killing itself with an OOM message in the kernel 
log. It took me a while to realize why this happened since no single process 
was causing it.
I traced it back to a few queries running concurrently on a really small 
datasets. This caused all of these queries to run localmode. Then I realized 
there isn't a limit to how many queries can run in localmode and since they use 
the normal hadoop memory settings it's pretty easy to hit OOM on a machine this 
way.
I'm not sure about the long term solution (some kind of limit on the number of 
localmode processes), but for now I'll probably disable localmode on these 
queries.

Bennie.


RE: Problem with Hive JDBC server

2013-02-04 Thread Bennie Schut
Looking at the versions you might be hitting 
https://issues.apache.org/jira/browse/HIVE-3481 which is fixed in 0.10

On my dev machine the test runs with success :Running time: 298.952409914
This includes this patch so it's worth looking at.

From: Gabor Makrai [mailto:makrai.l...@gmail.com]
Sent: Monday, February 04, 2013 11:58 AM
To: user@hive.apache.org
Subject: Re: Problem with Hive JDBC server

Yes, of course! I attached the code!

On Mon, Feb 4, 2013 at 11:57 AM, Gabor Makrai 
mailto:makrai.l...@gmail.com>> wrote:
Yes, of course! :) I attached the code!

On Mon, Feb 4, 2013 at 11:53 AM, Bennie Schut 
mailto:bsc...@ebuddy.com>> wrote:
Since it's small can you post the code?

From: Gabor Makrai [mailto:makrai.l...@gmail.com<mailto:makrai.l...@gmail.com>]
Sent: Monday, February 04, 2013 11:45 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Problem with Hive JDBC server

Hi guys,

I'm writing you because I experienced a very strange problem which probably 
affects all Hive distribution.
I made a small "only main function" Java program where I'm only connecting to 
my Hive JDBC, and getting the list of the database tables (LIST TABLES) and 
closing the ResultSet, the Statement and the Connection and doing this a 1000 
times. The problem is that the running Hive JDBC server does not release files 
and with time it will throw Exception because, it will get "Too many open 
files" IOException from the JVM.

I tested with Hive 0.9, 0.8.1, and the patched Hive 0.9 installed in CDH4.1.1.

If it is a know issue, than could you tell me the solution for it? If it is 
not, than I can create a new ticket in Jira, and with a little help, I probably 
can fix the problem and contribute the solution for it.

Thanks,
Gabor




RE: Problem with Hive JDBC server

2013-02-04 Thread Bennie Schut
Since it's small can you post the code?

From: Gabor Makrai [mailto:makrai.l...@gmail.com]
Sent: Monday, February 04, 2013 11:45 AM
To: user@hive.apache.org
Subject: Problem with Hive JDBC server

Hi guys,

I'm writing you because I experienced a very strange problem which probably 
affects all Hive distribution.
I made a small "only main function" Java program where I'm only connecting to 
my Hive JDBC, and getting the list of the database tables (LIST TABLES) and 
closing the ResultSet, the Statement and the Connection and doing this a 1000 
times. The problem is that the running Hive JDBC server does not release files 
and with time it will throw Exception because, it will get "Too many open 
files" IOException from the JVM.

I tested with Hive 0.9, 0.8.1, and the patched Hive 0.9 installed in CDH4.1.1.

If it is a know issue, than could you tell me the solution for it? If it is 
not, than I can create a new ticket in Jira, and with a little help, I probably 
can fix the problem and contribute the solution for it.

Thanks,
Gabor


RE: Loading a Hive table simultaneously from 2 different sources

2013-01-24 Thread Bennie Schut
The benefit of using the partitioned approach is really nicely described in the 
oreilly book "Programming Hive". (Thanks for writing it Edward)
For me the ability to drop a single partition if there's any doubt about the 
quality of the data of just one job is a large benefit.

From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
Sent: Thursday, January 24, 2013 3:52 PM
To: user@hive.apache.org
Subject: Re: Loading a Hive table simultaneously from 2 different sources

Partition the table and load the data into different partitions. That or build 
the data outside he table and then use scripting to move the data in using LOAD 
DATA INPATH or copying.
On Thu, Jan 24, 2013 at 9:44 AM, Krishnan K 
mailto:kkrishna...@gmail.com>> wrote:
Hi All,

Could you please let me know what would happen if we try to load a table from 2 
different sources at the same time ?

I had tried this earlier and got an error for 1 load job and while the other 
job loaded the data successfully into the table..

I guess it was because of lock acquired on the table by the first load process.

Is there anyway to handle this ?

Please give your insights.

Regards,
Krishnan





RE: Effecient partitions usage in join

2012-11-23 Thread Bennie Schut
Well this is the udf:

package com.ebuddy.dwhhive.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.Text;

import java.text.SimpleDateFormat;
import java.util.Calendar;

@Description(
name = "currentisodate",
value = "currentisodate() - Get the current date. Incorrectly made"
+ " deterministic to get partition pruning to work."
)

@UDFType(deterministic = true)
public class CurrentIsoDate extends UDF {

public static Text evaluate() {
String pattern = "-MM-dd";
SimpleDateFormat timeFormat = new SimpleDateFormat(pattern);
return new Text(timeFormat.format(Calendar.getInstance().getTime()));
}
}


And this is how we use it to query the last 30days:
ADD jar /opt/hive/udf/udf-2.1.2-jar-with-dependencies.jar;
CREATE TEMPORARY FUNCTION currectisodate AS 
'com.ebuddy.dwhhive.udf.CurrentIsoDate';
select count(*) from test where record_date_iso >= DATE_SUB(currentisodate(), 
30);

I’ve always had a preference for iso dates since they sort nicely: 2012-11-23 
but you can obviously pick your own pattern.


From: Dima Datsenko [mailto:di...@microsoft.com]
Sent: Thursday, November 22, 2012 4:07 PM
To: Bennie Schut; user@hive.apache.org
Subject: RE: Effecient partitions usage in join

Hi Benny,

The udf solution sounds like a plan. Much better than generating hive query 
with hardcoded partition out of table B. Can you please provide a sample of 
what you’re doing there?

Thanks,
Dima

From: Bennie Schut [mailto:bsc...@ebuddy.com]
Sent: יום ה 22 נובמבר 2012 16:28
To: user@hive.apache.org<mailto:user@hive.apache.org>
Cc: Dima Datsenko
Subject: RE: Effecient partitions usage in join

Unfortunately at the moment partition pruning is a bit limited in hive. When 
hive creates the query plan it decides what partitions to use. So if you put 
hardcoded list of partition_id items in the where clause it will know what to 
do. In the case of a join (or a subquery) it would have to run the query before 
it can know what it can prune.  There are obvious solutions to this but they 
are simply not implemented at the moment.
Generally speaking people try to work around this by not normalizing the data. 
So if you plan on doing a clean star schema with a calendar table then do 
yourself a favor and but the actual date in the fact table and not a 
meaningless key.
It’s also good to realize you can (in some special cases) work around it by 
using udf’s. I’ve used it once by creating a udf which produced the current 
date which I flagged as deterministic (ugly I know). This causes the planner to 
run the udf during planning and use the result as if it’s a constant and thus 
partition pruning works again. It’s currently the only way I know to select x 
days of data with partition pruning working.


From: Dima Datsenko [mailto:di...@microsoft.com]
Sent: Thursday, November 22, 2012 2:56 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Effecient partitions usage in join

Hi Guys,

I wonder if you could help me.

I have a huge Hive table partitioned by some field. It has thousands of 
partitions.
Now I have another small table containing tens of partitions id. I’d like to 
get the data only from those partitions.

However when I run
Select * from A join B on (A.partition_id = B.partition_id),
It reads all data from A, then from B and on reduce stage performs join.

I tried /*+ MAPJOIN*/ it ran faster sparing reduce operation, but still read 
the whole A table.

Is there a more efficient way to perform the query w/o reading the whole A 
content?


Thanks
Dima


RE: Effecient partitions usage in join

2012-11-22 Thread Bennie Schut
Unfortunately at the moment partition pruning is a bit limited in hive. When 
hive creates the query plan it decides what partitions to use. So if you put 
hardcoded list of partition_id items in the where clause it will know what to 
do. In the case of a join (or a subquery) it would have to run the query before 
it can know what it can prune.  There are obvious solutions to this but they 
are simply not implemented at the moment.
Generally speaking people try to work around this by not normalizing the data. 
So if you plan on doing a clean star schema with a calendar table then do 
yourself a favor and but the actual date in the fact table and not a 
meaningless key.
It's also good to realize you can (in some special cases) work around it by 
using udf's. I've used it once by creating a udf which produced the current 
date which I flagged as deterministic (ugly I know). This causes the planner to 
run the udf during planning and use the result as if it's a constant and thus 
partition pruning works again. It's currently the only way I know to select x 
days of data with partition pruning working.


From: Dima Datsenko [mailto:di...@microsoft.com]
Sent: Thursday, November 22, 2012 2:56 PM
To: user@hive.apache.org
Subject: Effecient partitions usage in join

Hi Guys,

I wonder if you could help me.

I have a huge Hive table partitioned by some field. It has thousands of 
partitions.
Now I have another small table containing tens of partitions id. I'd like to 
get the data only from those partitions.

However when I run
Select * from A join B on (A.partition_id = B.partition_id),
It reads all data from A, then from B and on reduce stage performs join.

I tried /*+ MAPJOIN*/ it ran faster sparing reduce operation, but still read 
the whole A table.

Is there a more efficient way to perform the query w/o reading the whole A 
content?


Thanks
Dima


RE: Show job progress when using JDBC to run HIVE query

2012-09-17 Thread Bennie Schut
The jdbc driver uses thrift so if thrift can't then jdbc can't.

This can be surprisingly difficult to do. Hive can split a query into x hadoop 
jobs and some will run in parallel and some will run in sequence.
I've used oracle in the past (10 and 11) and I could also never find out how 
long a large job would take, which leads me to suspect it's not a trivial thing 
to do.


-Original Message-
From: MiaoMiao [mailto:liy...@gmail.com] 
Sent: Monday, September 17, 2012 6:17 AM
To: user@hive.apache.org
Subject: Re: Show job progress when using JDBC to run HIVE query


Not familiar with JDBC, but thrift seems can't.

On Sat, Sep 15, 2012 at 3:17 AM, Haijia Zhou  wrote:
> Hi, All
>  I have am writing a Hive client to run a Hive query using Hive JDBC driver.
>  Since the data amount is huge I really would like to see the progress 
> when the query is running.
>  Is there anyway I can get the job progress?
> Thanks
> Haijia


RE: Loading data into data_dim table

2012-07-25 Thread Bennie Schut
Hi Prabhu,

Be careful when going into the direction of calendar dimensions. While strictly 
speaking this is a cleaner dwh design you will for sure run into issues you 
might not expect. Consider this is probably what you would want to do (roughly) 
to query a day:

select count(*)
from fact f
  join dim_date d on (d.date_id = f.date_id)
where ddate = '2020-12-22'

That won't trigger partition pruning and the query will walk over all records 
in the fact table (I doubt that's what you would want). Pruning happens during 
the creation of the query plan and at that time it doesn't know how many 
records the dim_date table will return so can't do any partition pruning for 
you. If you would want partitioning to work in this case you would have to do:

select count(*)
from fact f
where f.dateid =7662

Which kind of defeats the purpose of the dim_date table :( At this point in 
time I would simply point the date in the fact table and use functions to get 
things like month. It's annoying but it works so:

select count(*)
from fact f
where date = '2020-12-22'

Bennie.


From: prabhu k [mailto:prabhu.h...@gmail.com]
Sent: Wednesday, July 25, 2012 1:59 PM
To: user@hive.apache.org; bejoy...@yahoo.com
Subject: Re: Loading data into data_dim table

Thanks for your help :)

it's data has been loaded fine now,

select * from dim_date;

76622020-12-22 00:00:00.000 20204   12  3   52  13  
4   357 83  22  3   DecemberDec Tuesday Tue
76632020-12-23 00:00:00.000 20204   12  3   52  13  
4   358 84  23  4   DecemberDec Wednesday   
Wed
76642020-12-24 00:00:00.000 20204   12  3   52  13  
4   359 85  24  5   DecemberDec Thursday
Thu
76652020-12-25 00:00:00.000 20204   12  3   52  13  
4   360 86  25  6   DecemberDec Friday  Fri
76662020-12-26 00:00:00.000 20204   12  3   52  13  
4   361 87  26  7   DecemberDec Saturday
Sat
76672020-12-27 00:00:00.000 20204   12  3   53  14  
5   362 88  27  1   DecemberDec Sunday  Sun
76682020-12-28 00:00:00.000 20204   12  3   53  14  
5   363 89  28  2   DecemberDec Monday  Mon
76692020-12-29 00:00:00.000 20204   12  3   53  14  
5   364 90  29  3   DecemberDec Tuesday Tue
76702020-12-30 00:00:00.000 20204   12  3   53  14  
5   365 91  30  4   DecemberDec Wednesday   
Wed
76712020-12-31 00:00:00.000 20204   12  3   53  14  
5   366 92  31  5   DecemberDec Thursday
Thu
Time taken: 0.401 seconds
Thanks,
Prabhu.
On Wed, Jul 25, 2012 at 5:20 PM, Bejoy KS 
mailto:bejoy...@yahoo.com>> wrote:
Hi Prabhu

Your data is tab delimited use /t as the delimiter while creating table.

fields terminated by '/t'

Not sure this is the right / or not. If this doesn't work try the other one.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

From: prabhu k mailto:prabhu.h...@gmail.com>>
Date: Wed, 25 Jul 2012 17:10:09 +0530
To: mailto:user@hive.apache.org>>
ReplyTo: user@hive.apache.org
Subject: Re: Loading data into data_dim table

Thanks for the reply.

I have tried the with delimited fields terminated by '|'  and delimited fields 
terminated by ','  while selecting the table both Im getting null .

when i see the HDFS file looks like below.
bin/hadoop fs -cat /user/hive/warehoure/time.txt

7666 2020-12-26 00:00:00.00020204   12  3   52 13   4  361  
87 26   7   DecemberDec SaturdaySat 20201226
2020/12/26  Dec 26 2020 2020-12-26
7667 2020-12-27 00:00:00.00020204   12  3   53 14   5  362  
88 27   1   DecemberDec Sunday  Sun 20201227
2020/12/27 Dec 27 2020  2020-12-27
7668 2020-12-28 00:00:00.00020204   12  3   53 14   5  363  
89 28   2   DecemberDec Monday  Mon 20201228
2020/12/28 Dec 28 2020  2020-12-28
7669 2020-12-29 00:00:00.00020204   12  3   53 14   5  364  
90 29   3   DecemberDec Tuesday Tue 20201229
2020/12/29 Dec 29 2020  2020-12-29
7670 2020-12-30 00:00:00.00020204   12  3   53 14   5  365  
91 30   4   DecemberDec Wednesday   Wed 20201230
2020/12/30  Dec 30 2020 2020-12-30
7671 2020-12-31 00:00:00.00020204   12  3   53 14   5  366  
92 31   5   DecemberDec ThursdayThu 20201231
2020/12/31  

Re: hive runs slowly

2011-10-24 Thread Bennie Schut

"inner join" is simply translated to "join" they are the same thing
(HIVE-2191)
I'm guessing he means removing the join from the where part of the query
and using the "select a,b from a join b on (a.id=b.id)" syntax.

On 10/22/2011 05:05 AM, john smith wrote:
You mean select a,b from a inner join b on (a.id =b.id 
) ? or Does those brackets make some difference? Because 
the inner keyword is no where mentioned in the language manual 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins


Any hints?




On Fri, Oct 21, 2011 at 8:47 PM, Edward Capriolo 
mailto:edlinuxg...@gmail.com>> wrote:




On Fri, Oct 21, 2011 at 10:21 AM, john smith
mailto:js1987.sm...@gmail.com>> wrote:

Hi Edward,

Thanks for replying. I have been using the query

"select a,b from a,b where a.id =b.id
 ".  According to my knowledge of Hive, it reads
data of both A and B and emits  pairs as map outputs and then performs cartesian joins
on reduce side for the same join_keys .

Is this the cartesian join you are referring to? or Is it the
cartesian product of the total table (as in sql) ? or Am I
missing something?

Can you please throw some light on the functionality of
mapred.mode=strict ?

Thanks,
jS

On Fri, Oct 21, 2011 at 7:29 PM, Edward Capriolo
mailto:edlinuxg...@gmail.com>> wrote:



On Fri, Oct 21, 2011 at 9:22 AM, john smith
mailto:js1987.sm...@gmail.com>>
wrote:

Hi list,

I am also facing the same problem. My reducers hang at
this position and it takes hours to complete a single
reduce task. Can any hive guru help us out with this
issue.

Thanks,
jS

2011/10/21 bangbig mailto:lizhongliangg...@163.com>>

HI all,

HIVE runs too slowly when it is doing such things(see the 
log below), what's the problem? because I'm joining two large table?

it runs pretty fast at first. when the job finishes 95%, it 
begins to slow down.

--

INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
forwarding 104400 rows
2011-10-21 16:55:57,427 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104500 rows
2011-10-21 16:55:57,545 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104600 rows
2011-10-21 16:55:57,686 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104700 rows
2011-10-21 16:55:57,806 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104800 rows
2011-10-21 16:55:57,926 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104900 rows
2011-10-21 16:55:58,045 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105000 rows
2011-10-21 16:55:58,164 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105100 rows
2011-10-21 16:55:58,284 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105200 rows
2011-10-21 16:55:58,405 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105300 rows
2011-10-21 16:55:58,525 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105400 rows
2011-10-21 16:55:58,644 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105500 rows
2011-10-21 16:55:58,764 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105600 rows
2011-10-21 16:55:58,883 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105700 rows
2011-10-21 16:55:59,003 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105800 rows
2011-10-21 16:55:59,122 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105900 rows
2011-10-21 16:55:59,242 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106000 rows
2011-10-21 16:55:59,361 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106100 rows
2011-10-21 16:55:59,482 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106200 rows
2011-10-21 16:55:59,601 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106300 rows





It is hard to say without seeing the query, the table
definition, and the explain. Please send the query.
Although I have a theory:

This query is not good:
 

Re: Organizing a Hive Meetup for Hadoop World NYC

2011-10-12 Thread Bennie Schut

I'll be at hadoop world. Is the hive meetup still happening?

On 08/29/2011 10:03 PM, Carl Steinbach wrote:

Hi Ed,

This is a one-time event targeted at Hadoop World attendees, though
others are welcome to attend as well.

Thanks.

Carl

On Mon, Aug 29, 2011 at 12:09 PM, Edward Capriolo 
mailto:edlinuxg...@gmail.com>> wrote:


Carl,

Do you mean a one time Hive meetup or do you mean a recurring one?

I ask because the hadoop-nyc meetup is slowing down alot.
http://www.meetup.com/Hadoop-NYC/. So supporting a hadoop and
specific hive meetup seem difficult.

Edward

On Mon, Aug 29, 2011 at 2:36 PM, Carl Steinbach mailto:c...@cloudera.com>> wrote:

Dear Hive users,

Hadoop World 2011 (http://hadoopworld.com/) will be held
November 8th
and 9th in NYC. This year we're also planning to organize a
Hive Meetup.
These events are a good place for users to interact with each
other
and with the Hive development team.

In order to help with organization, I set up a form with a few
questions about what kind of meetup the community wants, and which
evening is best:


https://docs.google.com/spreadsheet/viewform?formkey=dENBelpZaDc3X1gxbmpFem01MzJPT0E6MQ

Please fill this out, and feel free to contact me directly if
you have
any questions.

Thanks!

Carl







Re: hive zookeeper locks.

2011-09-08 Thread Bennie Schut
Somewhere lower in my config file I set a incorrect LockManager so now 
it works :)


On 09/07/2011 04:02 PM, Bennie Schut wrote:

I've been trying to play with locks in hive using zookeeper but can't
find documentation on how to configure it. I now have:

hive.supports.concurrency
true



hive.zookeeper.quorum
localhost


But I keep getting errors like this:

11/09/07 15:47:57 ERROR exec.DDLTask: FAILED: Error in metadata: show
Locks LockManager not specified
org.apache.hadoop.hive.ql.metadata.HiveException: show Locks LockManager
not specified
  at org.apache.hadoop.hive.ql.exec.DDLTask.showLocks(DDLTask.java:1791)
  at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:306)
  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
  at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
  at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1060)
  at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:897)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:745)
  at
org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:116)
  at
org.apache.hadoop.hive.service.ThriftHive$Processor$execute.process(ThriftHive.java:699)
  at
org.apache.hadoop.hive.service.ThriftHive$Processor.process(ThriftHive.java:677)
  at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
  at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:619)

Any idea what I'm doing wrong?





hive zookeeper locks.

2011-09-07 Thread Bennie Schut
I've been trying to play with locks in hive using zookeeper but can't 
find documentation on how to configure it. I now have:


hive.supports.concurrency
true



hive.zookeeper.quorum
localhost


But I keep getting errors like this:

11/09/07 15:47:57 ERROR exec.DDLTask: FAILED: Error in metadata: show 
Locks LockManager not specified
org.apache.hadoop.hive.ql.metadata.HiveException: show Locks LockManager 
not specified

at org.apache.hadoop.hive.ql.exec.DDLTask.showLocks(DDLTask.java:1791)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:306)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)

at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1060)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:897)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:745)
at 
org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:116)
at 
org.apache.hadoop.hive.service.ThriftHive$Processor$execute.process(ThriftHive.java:699)
at 
org.apache.hadoop.hive.service.ThriftHive$Processor.process(ThriftHive.java:677)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:619)

Any idea what I'm doing wrong?



Re: Trouble creating indexes with psql metastore

2011-06-23 Thread Bennie Schut

I have a similar problem with a trunk build and a mysql metastore.
Doing: alter table IDXS modify column DEFERRED_REBUILD boolean not null;
Doesn't seem to fix it. Perhaps because mysql converts the boolean into 
a "tinyint(1)"?


Is there an easy way to make it fail with an error instead of getting an OK?


On 06/22/2011 09:04 PM, Esteban Gutierrez wrote:



Hi Clint,

Indeed this is a bug, "DEFERRED_REBUILD" should be boolean and not 
bit(1) in "IDXS".


Regards,
Esteban.

--
Support Engineer, Cloudera.




On Wed, Jun 22, 2011 at 11:25 AM, Clint Green > wrote:


Dear Hive User List,

I am trying to build indexes on a hive 0.7.1 environment using
postgresql as the metastore, but it is failing silently.

The following command doesn’t generate any errors:

hive> CREATE TABLE t (i INT);

OK

Time taken: 0.299 seconds

hive> CREATE INDEX i ON TABLE t (i) with ‘COMPACT’ WITH DEFERRED
REBUILD;

OK

Time taken: 0.287 seconds

A directory is created for the index in  “/usr/hive/warehouse/”,
but the index can’t be found:

hive> show tables;

OK

t

Time taken: 0.163 seconds

hive> show indexes on t;

OK

Time taken: 0.303 seconds

There are no errors in the hive.log file, and I am running the
0.7.1 release.

Thank you,

Clint

--

The information contained in this message may be privileged and/or
confidential and protected from disclosure. If the reader of this
message is not the intended recipient or an employee or agent
responsible for delivering this message to the intended recipient,
you are hereby notified that any dissemination, distribution or
copying of this communication is strictly prohibited. If you have
received this communication in error, please notify the sender
immediately by replying to this message and deleting the material
from any computer.






Re: Hive connecting to squirrel on windows

2011-05-17 Thread Bennie Schut
If its 0.7 and "IOException: The system cannot find the path specified" 
then you ran into HIVE-2054. It seems Carl backported it to 0.7.1 so try 
that.

If it's something else please post the error.

On 05/17/2011 04:56 AM, Raghunath, Ranjith wrote:


I have followed the document outlining how to perform the connection 
listed in http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface. 
However, I keep getting a error when trying to connect. I would 
appreciate any input on this.


Thank you,

Ranjith





Re: hive hbase handler metadata NullPointerException

2011-03-29 Thread Bennie Schut

I case anyone else runs into this.

I ended up deleting the hbase dir on hdfs installing hbase-0.90.1 and 
copying and pasting the hbase configs in the hive-site.xml
Rebuild hive by setting hbase.version to 0.90.1 on 
"ivy/libraries.properties"
Then the cli started working but the service still wasn't working so I 
suddenly realized jobs started with the service probably didn't have 
access to the jars so I run:

add jar /opt/hive/lib/hive-hbase-handler-0.8.0-SNAPSHOT.jar
add jar /opt/hive/lib/hbase-0.90.1-SNAPSHOT.jar
add jar /opt/hive/lib/zookeeper-3.3.1.jar

And then it all started working. This wasn't really evident from the 
documentation but in hindsight makes sense. This took a lot more time to 
figure out then I'm willing to admit ;-)


Bennie.

On 03/14/2011 12:05 AM, amit jaiswal wrote:

Hi,

I am also facing the same issue (hive-0.7, hbase-0.90.1, hadoop-0.20.2).

Any help?

-amit

--------
*From:* Bennie Schut 
*To:* "user@hive.apache.org" 
*Sent:* Wed, 9 March, 2011 4:39:49 AM
*Subject:* hive hbase handler metadata NullPointerException

Hi All,

I was trying out hbase 0.89.20100924 with hive trunk with hadoop 0.20.2

When I'm running a simple insert I get this:
java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 5 more
Caused by: java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
... 10 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 13 more
Caused by: java.lang.RuntimeException: Map operator initialization failed
at 
org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121)
... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:335)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:62)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at 
org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98)
... 18 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:280)
... 30 more
insert overwrite table hbase_table_1 select cldr_id, iso_date from 
calendar;


I could create the table just fine. like :
CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.

Re: Hive & MS SQL Server

2011-03-24 Thread Bennie Schut
Interesting. Sounds like a valid reason. I haven't used any version after 2k 
myself hopefully the changing the default schema works.

Op 24 mrt. 2011 om 16:33 heeft "shared mailinglists" 
mailto:shared.mailingli...@gmail.com>> het 
volgende geschreven:

Thanks Bernie, hopefully they will.

Were a small Java development team within a predominately MS development house. 
We’re hopefully introducing new ideas but the normal company politics dictate 
that we should use SQL Server. That way maintenance, backup, recover etc etc 
can be handed over to the internal MS db team while freeing us guys to 
concentrate on better things like Hadoop & Hive :-) I assumed with the DB just 
being a metadata store that the database wouldn’t be an issue but were 
struggling a bit:-(


On 24 March 2011 15:23, Bennie Schut 
<<mailto:bsc...@ebuddy.com>bsc...@ebuddy.com<mailto:bsc...@ebuddy.com>> wrote:
Sorry to become a bit offtopic but how do you get into a situation where 
sqlserver 2005 becomes a requirement for a hive internal meta store?

I doubt many of the developers of hive will have access to this database so I 
don't expect a lot of response on this. But hopefully someone can prove me 
wrong :)

Bennie.



On 03/24/2011 04:01 PM, shared mailinglists wrote:

Hi Hive users :-)

Does anybody have experience of using Hive with MS SQL Server 2005? I’m 
currently stumped with the following issue 
<https://issues.apache.org/jira/browse/HIVE-1391> 
https://issues.apache.org/jira/browse/HIVE-1391 where Hive (or DataNucleus?) 
confuses the COLUMNS table it requires internally with that of the default SQL 
Server sys.COLUMNS or information_schema.COLUMNS View and therefore does not 
automatically create the required metadata table when running the Hive CLI.


Has anybody managed to get Hive to work with SQLServer 2005 or know how I can 
configure Hive to use a different table name to COLUMNS ? Unfortunately we have 
to use SQL Server and do not have the option to use Derby or MySQL etc.

Many thanks,


Andy.





Re: Hive & MS SQL Server

2011-03-24 Thread Bennie Schut
Sorry to become a bit offtopic but how do you get into a situation where 
sqlserver 2005 becomes a requirement for a hive internal meta store?


I doubt many of the developers of hive will have access to this database 
so I don't expect a lot of response on this. But hopefully someone can 
prove me wrong :)


Bennie.


On 03/24/2011 04:01 PM, shared mailinglists wrote:


Hi Hive users :-)

Does anybody have experience of using Hive with MS SQL Server 2005? 
I’m currently stumped with the following issue 
https://issues.apache.org/jira/browse/HIVE-1391 where Hive (or 
DataNucleus?) confuses the COLUMNS table it requires internally with 
that of the default SQL Server sys.COLUMNS or 
information_schema.COLUMNS View and therefore does not automatically 
create the required metadata table when running the Hive CLI.



Has anybody managed to get Hive to work with SQLServer 2005 or know 
how I can configure Hive to use a different table name to COLUMNS ? 
Unfortunately we have to use SQL Server and do not have the option to 
use Derby or MySQL etc.


Many thanks,


Andy.





IOException on hadoop 0.20.2 with trunk.

2011-03-24 Thread Bennie Schut
So far I'm not able to reproduce this on our dev environment(only when 
going live) but when trying trunk I get errors like attached.
I considering making a jira out of it but I'm not sure what query is 
causing this.





java.io.IOException: Call to batiatus-int.ebuddy.com/10.10.0.5:9000 failed on 
local exception: java.nio.channels.ClosedChannelException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
at org.apache.hadoop.ipc.Client.call(Client.java:743)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy7.complete(Unknown Source)
at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy7.complete(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3264)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3188)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:57)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:209)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
at 
org.apache.hadoop.mapred.JobClient.copyRemoteFiles(JobClient.java:524)
at 
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at 
org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:404)
at 
org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:123)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:47)
Caused by: java.nio.channels.ClosedChannelException
at 
sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:113)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:156)
at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
Job Submission failed with exception 'java.io.IOException(Call to 
batiatus-int.ebuddy.com/10.10.0.5:9000 failed on local exception: 
java.nio.channels.ClosedChannelException)'Continuing ...

java.io.IOException: Call to batiatus-int.ebuddy.com/10.10.0.5:9000 failed on 
local exception: java.nio.channels.ClosedByInterruptException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
at org.apache.hadoop.ipc.Client.call(Client.java:743)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy4.delete(Unknown Source)
at sun.reflect.GeneratedMethodAccessor198.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy4.delete(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:582)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:227)
at 
org.apache.hadoop.hive.ql.exec.Utilities.clearMapRedWork(Utilities.java:174)
at 
org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:428)
at 
org.apache.hadoop.hive.ql.exec.M

hive hbase handler metadata NullPointerException

2011-03-09 Thread Bennie Schut

Hi All,

I was trying out hbase 0.89.20100924 with hive trunk with hadoop 0.20.2

When I'm running a simple insert I get this:

java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 5 more
Caused by: java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
... 10 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 13 more
Caused by: java.lang.RuntimeException: Map operator initialization failed
at 
org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121)
... 18 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:335)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:62)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at 
org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98)
... 18 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:280)
... 30 more

insert overwrite table hbase_table_1 select cldr_id, iso_date from calendar;

I could create the table just fine. like :
CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");

I've set the properties hbase.master, hbase.zookeeper.quorum, 
hbase.zookeeper.property.clientPort should that be enough?


Thanks
Bennie.


Re: Trouble using mysql metastore

2011-03-03 Thread Bennie Schut
Yeah we have it on the lib folder of hive 
"mysql-connector-java-5.1.6.jar" but I also find the name mysql.jar a 
bit suspicious.
Just download from http://www.mysql.com/downloads/connector/j/ and move 
it somewhere on the classpath


On 03/02/2011 08:42 PM, Viral Bajaria wrote:

This definitely looks like a CLASSPATH error.

Where did you get the mysql.jar from ? Can you open it up and make 
sure that it includes the com.mysql.jdbc.Driver namespace ?


I am guessing the mysql.jar is not the one that you need. you can 
download a new one from the mysql website.


To be clear, I don't even have a mysql jar in my /lib folder under 
hive. I only have it under my hadoop /lib folder and the name of the 
file is mysql-connector-java-5.0.8-bin.jar


-Viral

On Wed, Mar 2, 2011 at 10:14 AM, Ajo Fod <mailto:ajo@gmail.com>> wrote:


Hi Bennie,

Thanks for the response !

I had CLASSPATH set to include
/usr/share/java/mysql.jar
... in addition, I just copied the mysql.jar to the lib directory
of hive.

I still get the same bug.

Any other ideas?

Thanks,
-Ajo




On Wed, Mar 2, 2011 at 7:01 AM, Bennie Schut mailto:bsc...@ebuddy.com>> wrote:

Usually this is caused by not having the mysql jdbc driver on
the classpath (it's not default included in hive).
Just put the mysql jdbc driver in the hive folder under "lib/"

On 03/02/2011 03:15 PM, Ajo Fod wrote:

I've checked the mysql connection with a separate java file
with the same string.

Also, I've checked the code works by running it against the
original derby metastore.

Thanks,
Ajo.

Some of the variables set:
javax.jdo.option.ConnectionURL =
jdbc:mysql://192.168.1.5/metastore?createDatabaseIfNotExist=true
<http://192.168.1.5/metastore?createDatabaseIfNotExist=true>
javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName=username
javax.jdo.option.ConnectionPassword=password

Here is the stack trace: ...

org.apache.hadoop.hive.ql.metadata.HiveException:
javax.jdo.JDOFatalInternalException: Error creating
transactional connection factory
NestedThrowables:
java.lang.reflect.InvocationTargetException
at

org.apache.hadoop.hive.ql.metadata.Hive.getTablesByPattern(Hive.java:919)
at

org.apache.hadoop.hive.ql.metadata.Hive.getTablesByPattern(Hive.java:904)
at

org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:7098)
at

org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:6576)
at

org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:238)
at
org.apache.hadoop.hive.ql.Driver.compile(Driver.java:340)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:773)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
at
org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:307)
at
org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:314)
at
org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:487)
at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: javax.jdo.JDOFatalInternalException: Error
creating transactional connection factory
NestedThrowables:
java.lang.reflect.InvocationTargetException
at

org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:425)
at

org.datanucleus.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:601)
at

org.datanucleus.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:286)
at

org.datanucleus.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:182)
at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImp

Re: Trouble using mysql metastore

2011-03-02 Thread Bennie Schut
Usually this is caused by not having the mysql jdbc driver on the 
classpath (it's not default included in hive).

Just put the mysql jdbc driver in the hive folder under "lib/"

On 03/02/2011 03:15 PM, Ajo Fod wrote:
I've checked the mysql connection with a separate java file with the 
same string.


Also, I've checked the code works by running it against the original 
derby metastore.


Thanks,
Ajo.

Some of the variables set:
javax.jdo.option.ConnectionURL = 
jdbc:mysql://192.168.1.5/metastore?createDatabaseIfNotExist=true 


javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName=username
javax.jdo.option.ConnectionPassword=password

Here is the stack trace: ...

org.apache.hadoop.hive.ql.metadata.HiveException: 
javax.jdo.JDOFatalInternalException: Error creating transactional 
connection factory

NestedThrowables:
java.lang.reflect.InvocationTargetException
at 
org.apache.hadoop.hive.ql.metadata.Hive.getTablesByPattern(Hive.java:919)
at 
org.apache.hadoop.hive.ql.metadata.Hive.getTablesByPattern(Hive.java:904)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:7098)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:6576)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:238)

at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:340)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:773)
at 
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
at 
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
at 
org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:307)
at 
org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:314)

at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:487)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: javax.jdo.JDOFatalInternalException: Error creating 
transactional connection factory

NestedThrowables:
java.lang.reflect.InvocationTargetException
at 
org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:425)
at 
org.datanucleus.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:601)
at 
org.datanucleus.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:286)
at 
org.datanucleus.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:182)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1958)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1953)
at 
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1159)
at 
javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:803)
at 
javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:698)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:234)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:261)
at 
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:196)
at 
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:171)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:352)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.executeWithRetry(HiveMetaStore.java:306)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:449)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:232)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:197)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:108)
at 
org.apache.hadoop.hive.ql.meta

Re: OutOfMemory errors on joining 2 large tables.

2011-02-23 Thread Bennie Schut
We filter nulls already before the tables are filled but then this will 
probably cause a skew in the keys like Paul was saying. I'm running some 
queries on the keys to see if that's the case.
I do expect there will be large differences in distribution of some of 
the keys.
I'm looking at "set hive.optimize.skewjoin=true" before the query to see 
if that helps. Will try that later.


On 02/23/2011 05:25 AM, Mapred Learn wrote:

Oops I meant nulls.

Sent from my iPhone

On Feb 22, 2011, at 8:22 PM, Mapred Learn  wrote:


Check if you can filter non-nulls. That might help.

Sent from my iPhone

On Feb 22, 2011, at 12:46 AM, Bennie Schut  wrote:


I've just set the "hive.exec.reducers.bytes.per.reducer" to as low as 100k 
which caused this job to run with 999 reducers. I still have 5 tasks failing with an 
outofmemory.

We have jvm reuse set to 8 but dropping it to 1 seems to greatly reduce this 
problem:
set mapred.job.reuse.jvm.num.tasks = 1;

It's still puzzling me how it can run out of memory. It seems like some of the 
reducers get an unequally large share of the work.


On 02/18/2011 10:53 AM, Bennie Schut wrote:

When we try to join two large tables some of the reducers stop with an
OutOfMemory exception.

Error: java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)

at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)

at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)



When looking at garbage collection for these reduce tasks it's
continually doing garbage collections.
Like this:
2011-02-17T14:36:08.295+0100: 1250.547: [Full GC [PSYoungGen:
111055K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809466K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1496600
secs] [Times: user=1.08 sys=0.00, real=0.15 secs]
2011-02-17T14:36:08.600+0100: 1250.851: [Full GC [PSYoungGen:
111057K->53660K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809468K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1360010
secs] [Times: user=1.00 sys=0.01, real=0.13 secs]
2011-02-17T14:36:08.915+0100: 1251.167: [Full GC [PSYoungGen:
111058K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809468K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1325960
secs] [Times: user=0.94 sys=0.00, real=0.14 secs]
2011-02-17T14:36:09.205+0100: 1251.457: [Full GC [PSYoungGen:
111055K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809466K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1301610
secs] [Times: user=0.99 sys=0.00, real=0.13 secs]


“mapred.child.java.opts” set to “-Xmx1024M -XX:+UseCompressedOops
-XX:+UseParallelOldGC -XX:+UseNUMA -Djava.net.preferIPv4Stack=true
-verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails
-Xloggc:/opt/hadoop/logs/task_@tas...@.gc.log”

I've been reducing this parameter “hive.exec.reducers.bytes.per.reducer”
to as low as 200M but I still get the OutOfMemory errors. I would have
expected this would drop the amount of data send to the reducers and
thus not have the OutOfMemory errors to happen.

Any idea's on why this happens?

I'm using a trunk build from around 2011-02-03




Re: OutOfMemory errors on joining 2 large tables.

2011-02-22 Thread Bennie Schut
I've just set the "hive.exec.reducers.bytes.per.reducer" to as low as 
100k which caused this job to run with 999 reducers. I still have 5 
tasks failing with an outofmemory.


We have jvm reuse set to 8 but dropping it to 1 seems to greatly reduce 
this problem:

set mapred.job.reuse.jvm.num.tasks = 1;

It's still puzzling me how it can run out of memory. It seems like some 
of the reducers get an unequally large share of the work.



On 02/18/2011 10:53 AM, Bennie Schut wrote:

When we try to join two large tables some of the reducers stop with an
OutOfMemory exception.

Error: java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)

at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)

at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)



When looking at garbage collection for these reduce tasks it's
continually doing garbage collections.
Like this:
2011-02-17T14:36:08.295+0100: 1250.547: [Full GC [PSYoungGen:
111055K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809466K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1496600
secs] [Times: user=1.08 sys=0.00, real=0.15 secs]
2011-02-17T14:36:08.600+0100: 1250.851: [Full GC [PSYoungGen:
111057K->53660K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809468K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1360010
secs] [Times: user=1.00 sys=0.01, real=0.13 secs]
2011-02-17T14:36:08.915+0100: 1251.167: [Full GC [PSYoungGen:
111058K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809468K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1325960
secs] [Times: user=0.94 sys=0.00, real=0.14 secs]
2011-02-17T14:36:09.205+0100: 1251.457: [Full GC [PSYoungGen:
111055K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809466K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1301610
secs] [Times: user=0.99 sys=0.00, real=0.13 secs]


“mapred.child.java.opts” set to “-Xmx1024M -XX:+UseCompressedOops
-XX:+UseParallelOldGC -XX:+UseNUMA -Djava.net.preferIPv4Stack=true
-verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails
-Xloggc:/opt/hadoop/logs/task_@tas...@.gc.log”

I've been reducing this parameter “hive.exec.reducers.bytes.per.reducer”
to as low as 200M but I still get the OutOfMemory errors. I would have
expected this would drop the amount of data send to the reducers and
thus not have the OutOfMemory errors to happen.

Any idea's on why this happens?

I'm using a trunk build from around 2011-02-03




OutOfMemory errors on joining 2 large tables.

2011-02-18 Thread Bennie Schut
When we try to join two large tables some of the reducers stop with an 
OutOfMemory exception.


Error: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508) 

at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408) 

at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261) 

at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195) 




When looking at garbage collection for these reduce tasks it's 
continually doing garbage collections.

Like this:
2011-02-17T14:36:08.295+0100: 1250.547: [Full GC [PSYoungGen: 
111055K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)] 
809466K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1496600 
secs] [Times: user=1.08 sys=0.00, real=0.15 secs]
2011-02-17T14:36:08.600+0100: 1250.851: [Full GC [PSYoungGen: 
111057K->53660K(233024K)] [ParOldGen: 698410K->698410K(699072K)] 
809468K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1360010 
secs] [Times: user=1.00 sys=0.01, real=0.13 secs]
2011-02-17T14:36:08.915+0100: 1251.167: [Full GC [PSYoungGen: 
111058K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)] 
809468K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1325960 
secs] [Times: user=0.94 sys=0.00, real=0.14 secs]
2011-02-17T14:36:09.205+0100: 1251.457: [Full GC [PSYoungGen: 
111055K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)] 
809466K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1301610 
secs] [Times: user=0.99 sys=0.00, real=0.13 secs]



“mapred.child.java.opts” set to “-Xmx1024M -XX:+UseCompressedOops 
-XX:+UseParallelOldGC -XX:+UseNUMA -Djava.net.preferIPv4Stack=true 
-verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails 
-Xloggc:/opt/hadoop/logs/task_@tas...@.gc.log”


I've been reducing this parameter “hive.exec.reducers.bytes.per.reducer” 
to as low as 200M but I still get the OutOfMemory errors. I would have 
expected this would drop the amount of data send to the reducers and 
thus not have the OutOfMemory errors to happen.


Any idea's on why this happens?

I'm using a trunk build from around 2011-02-03


Re: what char represents NULL value in hive?

2011-02-10 Thread Bennie Schut
At least on trunk it seems on external tables(perhaps also TextFile?) 
this works for integer values but not for string values. For a string it 
will then return as an empty string which you then have to find with " 
where field = '' " but I would prefer to use " where field is null ".

Not sure if this should be filed as a bug or a missing feature ;-)
Perhaps there is a relation with HIVE-1791?

On 01/21/2011 02:03 PM, Ajo Fod wrote:

For a tab separated file, I think it is the null string ... i.e no
characters. So, for example

12\ta\t\t2
1\tb\ta\t1

reads
12 a   2
1b  a  1

On Fri, Jan 21, 2011 at 1:09 AM, lei liu  wrote:

I generate HDFS file , then I load the file to one hive table. There are
some colums are don't have value, I need to set these colums to NULL.   I
want to know what char represents NULL value in hive.


Thanks,

LiuLei







Re: Too many open files

2011-01-07 Thread Bennie Schut
From what I understood it will then be possible to tell hive it's loading a csv 
while you are in fact loading something else (sequence files forinstance). I 
don't think that's a big deal.

Op 7 jan. 2011 om 11:41 heeft "Terje Marthinussen" 
mailto:tmarthinus...@gmail.com>> het volgende 
geschreven:

Seems like this works for me too!

That probably saved me for a bunch of hours tracing this down through hive and 
hadoop

Do you know what the side effect of setting this to false would be?.

Thanks!
Terje

On Fri, Jan 7, 2011 at 4:39 PM, Bennie Schut 
<<mailto:bsc...@ebuddy.com>bsc...@ebuddy.com<mailto:bsc...@ebuddy.com>> wrote:
In the past I ran into a similar problem which was actually caused by a bug in 
hadoop. Someone was nice enough to come up with a workaround for this. Perhaps 
you are running into a similar problem. I also had this problem when calling 
lots of “load file” commands. After adding this to the hive-site.xml we never 
had this problem again:

  
  
hive.fileformat.check
false
  


From: Terje Marthinussen 
[mailto:<mailto:tmarthinus...@gmail.com>tmarthinus...@gmail.com<mailto:tmarthinus...@gmail.com>]
Sent: Friday, January 07, 2011 4:14 AM
To: <mailto:user@hive.apache.org> 
user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Too many open files

No, the problem is connections to datanodes on port 50010.

Terje
On Fri, Jan 7, 2011 at 11:46 AM, Shrijeet Paliwal 
<<mailto:shrij...@rocketfuel.com>shrij...@rocketfuel.com<mailto:shrij...@rocketfuel.com>>
 wrote:
You mentioned that you got the code from trunk so fair to assume you
are not hitting <https://issues.apache.org/jira/browse/HIVE-1508> 
https://issues.apache.org/jira/browse/HIVE-1508
Worth checking still. Are all the open files -  hive history files
(they look like hive_job_log*.txt) ? Like Viral suggested you can
check that by monitoring open files.

-Shrijeet

On Thu, Jan 6, 2011 at 6:15 PM, Viral Bajaria 
<<mailto:viral.baja...@gmail.com>viral.baja...@gmail.com<mailto:viral.baja...@gmail.com>>
 wrote:
> Hi Terje,
>
> I have asked about this issue in an earlier thread but never got any
> response.
>
> I get this exception when I am using Hive over Thrift and submitting 1000s
> of LOAD FILE commands. If you actively monitor the open file count of the
> user under which I run the hive instance, it keeps on creeping yup for every
> LOAD FILE command sent to it.
>
> I have a temporary fix by increasing the # of open file(s) to 6+ and
> then periodically restarting my thrift server (once every 2 days) to release
> the open file handlers.
>
> I would appreciate some feedback. (trying to find my earlier email)
>
> Thanks,
> Viral
>
> On Thu, Jan 6, 2011 at 4:57 PM, Terje Marthinussen 
> <<mailto:tmarthinus...@gmail.com>tmarthinus...@gmail.com<mailto:tmarthinus...@gmail.com>>
> wrote:
>>
>> Hi,
>> While loading some 10k+ .gz files through HiveServer with LOAD FILE etc.
>> etc.
>> 11/01/06 22:12:42 INFO exec.CopyTask: Copying data from file:XXX.gz to
>> hdfs://YYY
>> 11/01/06 22:12:42 INFO hdfs.DFSClient: Exception in
>> createBlockOutputStream java.net.SocketException: Too many open files
>> 11/01/06 22:12:42 INFO hdfs.DFSClient: Abandoning block
>> blk_8251287732961496983_1741138
>> 11/01/06 22:12:48 INFO hdfs.DFSClient: Exception in
>> createBlockOutputStream java.net.SocketException: Too many open files
>> 11/01/06 22:12:48 INFO hdfs.DFSClient: Abandoning block
>> blk_-2561354015640936272_1741138
>> 11/01/06 22:12:54 WARN hdfs.DFSClient: DataStreamer Exception:
>> java.io.IOException: Too many open files
>> at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
>> at sun.nio.ch.EPollArrayWrapper.(EPollArrayWrapper.java:69)
>> at sun.nio.ch.EPollSelectorImpl.(EPollSelectorImpl.java:52)
>> at
>> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>> at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>> at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>> at
>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>> at java.io.DataOutputStream.write(DataOutputStream.java:90)
>> at
>> org.apache.had

RE: Too many open files

2011-01-06 Thread Bennie Schut
In the past I ran into a similar problem which was actually caused by a bug in 
hadoop. Someone was nice enough to come up with a workaround for this. Perhaps 
you are running into a similar problem. I also had this problem when calling 
lots of "load file" commands. After adding this to the hive-site.xml we never 
had this problem again:

  
  
hive.fileformat.check
false
  


From: Terje Marthinussen [mailto:tmarthinus...@gmail.com]
Sent: Friday, January 07, 2011 4:14 AM
To: user@hive.apache.org
Subject: Re: Too many open files

No, the problem is connections to datanodes on port 50010.

Terje
On Fri, Jan 7, 2011 at 11:46 AM, Shrijeet Paliwal 
mailto:shrij...@rocketfuel.com>> wrote:
You mentioned that you got the code from trunk so fair to assume you
are not hitting https://issues.apache.org/jira/browse/HIVE-1508
Worth checking still. Are all the open files -  hive history files
(they look like hive_job_log*.txt) ? Like Viral suggested you can
check that by monitoring open files.

-Shrijeet

On Thu, Jan 6, 2011 at 6:15 PM, Viral Bajaria 
mailto:viral.baja...@gmail.com>> wrote:
> Hi Terje,
>
> I have asked about this issue in an earlier thread but never got any
> response.
>
> I get this exception when I am using Hive over Thrift and submitting 1000s
> of LOAD FILE commands. If you actively monitor the open file count of the
> user under which I run the hive instance, it keeps on creeping yup for every
> LOAD FILE command sent to it.
>
> I have a temporary fix by increasing the # of open file(s) to 6+ and
> then periodically restarting my thrift server (once every 2 days) to release
> the open file handlers.
>
> I would appreciate some feedback. (trying to find my earlier email)
>
> Thanks,
> Viral
>
> On Thu, Jan 6, 2011 at 4:57 PM, Terje Marthinussen 
> mailto:tmarthinus...@gmail.com>>
> wrote:
>>
>> Hi,
>> While loading some 10k+ .gz files through HiveServer with LOAD FILE etc.
>> etc.
>> 11/01/06 22:12:42 INFO exec.CopyTask: Copying data from file:XXX.gz to
>> hdfs://YYY
>> 11/01/06 22:12:42 INFO hdfs.DFSClient: Exception in
>> createBlockOutputStream java.net.SocketException: Too many open files
>> 11/01/06 22:12:42 INFO hdfs.DFSClient: Abandoning block
>> blk_8251287732961496983_1741138
>> 11/01/06 22:12:48 INFO hdfs.DFSClient: Exception in
>> createBlockOutputStream java.net.SocketException: Too many open files
>> 11/01/06 22:12:48 INFO hdfs.DFSClient: Abandoning block
>> blk_-2561354015640936272_1741138
>> 11/01/06 22:12:54 WARN hdfs.DFSClient: DataStreamer Exception:
>> java.io.IOException: Too many open files
>> at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
>> at sun.nio.ch.EPollArrayWrapper.(EPollArrayWrapper.java:69)
>> at sun.nio.ch.EPollSelectorImpl.(EPollSelectorImpl.java:52)
>> at
>> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>> at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>> at
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>> at
>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>> at java.io.DataOutputStream.write(DataOutputStream.java:90)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2314)
>> 11/01/06 22:12:54 WARN hdfs.DFSClient: Error Recovery for block
>> blk_2907917521214666486_1741138 bad datanode[0] 
>> 172.27.1.34:50010
>> 11/01/06 22:12:54 WARN hdfs.DFSClient: Error Recovery for block
>> blk_2907917521214666486_1741138 in pipeline 
>> 172.27.1.34:50010,
>> 172.27.1.4:50010: bad datanode 
>> 172.27.1.34:50010
>> Exception in thread "DataStreamer for file YYY block blk_29
>> 07917521214666486_1741138" java.lang.NullPointerException
>> at
>> org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:351)
>> at
>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:313)
>> at
>> org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:860)
>> at org.apache.hadoop.ipc.Client.call(Client.java:720)
>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>> at $Proxy9.recoverBlock(Unknown Source)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2581)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2102)
>> at
>> org.apache.had

Filtering is supported only on partition keys of type string

2010-11-08 Thread Bennie Schut
Hi,

I just recently updated to trunk, was lagging a few months behind. Now I'm 
getting errors like: "Filtering is supported only on partition keys of type 
string"
It seems some type checking was added on 
org.apache.hadoop.hive.metastore.parser.ExpressionTree.java:161 which makes 
sure partition keys are string but my current schema actually contains 
partition keys which are numbers.
I've removed this check and all is running fine again but was this done for a 
reason? I think I should be able to partition on anything right?

Bennie.



RE: NOT IN query

2010-11-04 Thread Bennie Schut
You can use a left outer join which works in all databases.

select a.value
from tablea a
   left outer join tableb b on (b.value = a.value)
where b.value is null;

Databases are generally pretty good at doing joins so this usually performs 
good.


From: איל (Eyal) [mailto:eya...@gmail.com]
Sent: Wednesday, November 03, 2010 1:14 PM
To: hive-u...@hadoop.apache.org
Subject: NOT IN query

Hi,

I have a table A with some values and another table B with some other values

How do I get all the distinct values from A that are NOT in B

e.g

if table A has values 1,2,3,4,1,2,3,5,6,7  and B has values 2,3,4,5,6

then the result should be 1,7

Thanks
  Eyal