RE: Beeline throws OOM on large input query

2016-09-03 Thread Adam
*Reply to Stephen Sprague*
*1) confirm your beeline java process is indeed running with expanded*
*memory*
I used the -XX:+PrintCommandLineFlags which showed:
-XX:MaxHeapSize=17179869184
confirming the 16g setting.

*2) *
*try the hive-cli (or the python one even.)  or "beeline -u *
*jdbc:hive2://"*
I was using the beeline jdbc connect:
  issuing: !connect jdbc:hive2: 

*3) chop down your 6K points to 3K or something smaller to see just where*
*the breaking point is*
I didn't bother though it would be good information since I found a work
around and troubleshooting beeline wasn't my primary goal :)

*Reply to Markovitz, Dudu*
The query is basically finding geometry intersections.
If you are familiar with Postgis, it is a Java version of the Postgis
function ST_Intersects (http://postgis.net/docs/ST_Intersects.html) wrapped
in a Hive UDF.

We are checking intersection of a table's geometry column with a set of N
geometries (6000+ in this case).

*select from table*
*where st_intersects(table.geom, g1) OR st_intersects(table.geom, g2), etc.*

Unfortunately doing it with a table join requires a theta condition which
Hive doesn't support, something like

*select from table inner join reftable on st_intersects(table.geom,
reftable.geom)*

I tried pushing down the predicate but that required a cross join which was
not feasible for the huge table sizes.


Re: Beeline throws OOM on large input query

2016-09-02 Thread Adam
I set the heap size using HADOOP_CLIENT_OPTS all the way to 16g and still
no luck.

I tried to go down the table join route but the problem is that the
relation is not an equality so it would be a theta join which is not
supported in Hive.
Basically what I am doing is a geographic intersection against 6,000 points
so the where clause has 6000 points in it (I use a custom UDF for the
intersection).

To avoid the problem I ended up writing another version of the UDF that
reads the point list from an HDFS file.

It's a low priority I'm sure but I bet there are some inefficiencies in the
query string handling that could be fixed.  When I traced the code it was
doing all kinds of StringBuffer and String += type stuff.

Regards,


Beeline throws OOM on large input query

2016-09-01 Thread Adam
Hive Version: 2.1.0
I have a very large, multi-line input query (8,668,519 chars) and I have
gone up to 16g heap and still get the same OOM.


Error: Error running query: java.lang.OutOfMemoryError: Java heap space
(state=,code=0)
org.apache.hive.service.cli.HiveSQLException: Error running query:
java.lang.OutOfMemoryError: Java heap space
at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:264)
at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:250)
at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(
HiveStatement.java:309)
at org.apache.hive.jdbc.HiveStatement.execute(
HiveStatement.java:250)
at org.apache.hive.beeline.Commands.executeInternal(
Commands.java:977)
at org.apache.hive.beeline.Commands.execute(Commands.java:1148)
at org.apache.hive.beeline.Commands.sql(Commands.java:1063)
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1134)
at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:965)
at org.apache.hive.beeline.BeeLine.executeFile(BeeLine.java:940)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:867)
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(
BeeLine.java:499)
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:482)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.hive.service.cli.HiveSQLException: Error running
query: java.lang.OutOfMemoryError: Java heap space
at org.apache.hive.service.cli.operation.SQLOperation.
prepare(SQLOperation.java:218)
at org.apache.hive.service.cli.operation.SQLOperation.
runInternal(SQLOperation.java:269)
at org.apache.hive.service.cli.operation.Operation.run(
Operation.java:324)
at org.apache.hive.service.cli.session.HiveSessionImpl.
executeStatementInternal(HiveSessionImpl.java:460)
at org.apache.hive.service.cli.session.HiveSessionImpl.
executeStatementAsync(HiveSessionImpl.java:447)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hive.service.cli.session.HiveSessionProxy.
invoke(HiveSessionProxy.java:78)
at org.apache.hive.service.cli.session.HiveSessionProxy.
access$000(HiveSessionProxy.java:36)
at org.apache.hive.service.cli.session.HiveSessionProxy$1.
run(HiveSessionProxy.java:63)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(
UserGroupInformation.java:1657)
at org.apache.hive.service.cli.session.HiveSessionProxy.
invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy33.executeStatementAsync(Unknown Source)
at org.apache.hive.service.cli.CLIService.executeStatementAsync(
CLIService.java:294)
at org.apache.hive.service.cli.thrift.ThriftCLIService.
ExecuteStatement(ThriftCLIService.java:497)
at org.apache.hive.service.rpc.thrift.TCLIService$Processor$
ExecuteStatement.getResult(TCLIService.java:1437)
at org.apache.hive.service.rpc.thrift.TCLIService$Processor$
ExecuteStatement.getResult(TCLIService.java:1422)
at org.apache.thrift.ProcessFunction.process(
ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.hive.service.auth.TSetIpAddressProcessor.process(
TSetIpAddressProcessor.java:56)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(
TThreadPoolServer.java:286)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(
AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(
AbstractStringBuilder.java:448)
at java.lang.StringBuffer.append(StringBuffer.java:270)
at java.io.StringWriter.write(StringWriter.java:112)
at java.io.PrintWriter.write(PrintWriter.java:456)
at java.io.PrintWriter.write(PrintWriter.java:473)
at 

Re: hbase-1.1.1 & hive-1.0.1

2016-03-19 Thread Adam Hunt
Version information

Hive 1.x will remain compatible with HBase 0.98.x and lower versions. Hive
2.x will be compatible with HBase 1.x and higher. (See HIVE-10990
 for details.) Consumers
wanting to work with HBase 1.x using Hive 1.x will need to compile Hive 1.x
stream code themselves.


https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

On Wed, Mar 16, 2016 at 6:23 AM, songj songj  wrote:

> hi all:
> I use hbase-1.1.1 & hive-1.0.1 ,but I can not access hbase from hive
>
> this two apps version does not compatible?
>
> Diagnostic Messages for this Task:
> Error: java.lang.RuntimeException: java.lang.NoSuchMethodError:
> org.apache.hadoop.hbase.client.Put.setDurability(Lorg/apache/hadoop/hbase/client/Durability;)V
> at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:185)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.NoSuchMethodError:
> org.apache.hadoop.hbase.client.Put.setDurability(Lorg/apache/hadoop/hbase/client/Durability;)V
> at
> org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat$MyRecordWriter.write(HiveHBaseTableOutputFormat.java:142)
> at
> org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat$MyRecordWriter.write(HiveHBaseTableOutputFormat.java:117)
> at
> org.apache.hadoop.hive.ql.io.HivePassThroughRecordWriter.write(HivePassThroughRecordWriter.java:40)
>


Re: NPE when reading Parquet using Hive on Tez

2016-02-02 Thread Adam Hunt
HI Gopal,

With the release of 0.8.2, I thought I would give tez another shot.
Unfortunately, I got the same NPE. I dug a little deeper and it appears
that the configuration property "columns.types", which is used
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(),
is not being set. When I manually set that property in hive, your example
works fine.

hive> create temporary table x (x int) stored as parquet;
hive> insert into x values(1),(2);
hive> set columns.type=int;
hive> select count(*) from x where x.x > 1;
OK
1

I also saw that the configuration parameter parquet.columns.index.access is
also checked in that same function. Setting that property to "true" fixes
my issue.

hive> create temporary table x (x int) stored as parquet;
hive> insert into x values(1),(2);
hive> set parquet.column.index.access=true;
hive> select count(*) from x where x.x > 1;
OK
1

Thanks for your help.

Best,
Adam



On Tue, Jan 5, 2016 at 9:10 AM, Adam Hunt <adamph...@gmail.com> wrote:

> Hi Gopal,
>
> Spark does offer dynamic allocation, but it doesn't always work as
> advertised. My experience with Tez has been more in line with my
> expectations. I'll bring up my issues with Spark on that list.
>
> I tried your example and got the same NPE. It might be a mapr-hive issue.
> Thanks for your help.
>
> Adam
>
> On Mon, Jan 4, 2016 at 12:58 PM, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
>
>>
>> > select count(*) from alexa_parquet;
>>
>> > Caused by: java.lang.NullPointerException
>> >at
>>
>> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokeni
>> >ze(TypeInfoUtils.java:274)
>> >at
>>
>> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.
>> >(TypeInfoUtils.java:293)
>> >at
>>
>> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeS
>> >tring(TypeInfoUtils.java:764)
>> >at
>>
>> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColum
>> >nTypes(DataWritableReadSupport.java:76)
>> >at
>>
>> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(Dat
>> >aWritableReadSupport.java:220)
>> >at
>>
>> >org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSp
>> >lit(ParquetRecordReaderWrapper.java:256)
>>
>> This might be an NPE triggered off by a specific case of the type parser.
>>
>> I tested it out on my current build with simple types and it looks like
>> the issue needs more detail on the column types for a repro.
>>
>> hive> create temporary table x (x int) stored as parquet;
>> hive> insert into x values(1),(2);
>> hive> select count(*) from x where x.x > 1;
>> Status: DAG finished successfully in 0.18 seconds
>> OK
>> 1
>> Time taken: 0.792 seconds, Fetched: 1 row(s)
>> hive>
>>
>> Do you have INT96 in the schema?
>>
>> > I'm currently evaluating Hive on Tez as an alternative to keeping the
>> >SparkSQL thrift sever running all the time locking up resources.
>>
>> Tez has a tunable value in tez.am.session.min.held-containers (i.e
>> something small like 10).
>>
>> And HiveServer2 can be made work similarly because spark
>> HiveThriftServer2.scala is a wrapper around hive's ThriftBinaryCLIService.
>>
>>
>>
>>
>>
>>
>> Cheers,
>> Gopal
>>
>>
>>
>


Re: NPE when reading Parquet using Hive on Tez

2016-01-05 Thread Adam Hunt
Hi Gopal,

Spark does offer dynamic allocation, but it doesn't always work as
advertised. My experience with Tez has been more in line with my
expectations. I'll bring up my issues with Spark on that list.

I tried your example and got the same NPE. It might be a mapr-hive issue.
Thanks for your help.

Adam

On Mon, Jan 4, 2016 at 12:58 PM, Gopal Vijayaraghavan <gop...@apache.org>
wrote:

>
> > select count(*) from alexa_parquet;
>
> > Caused by: java.lang.NullPointerException
> >at
> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokeni
> >ze(TypeInfoUtils.java:274)
> >at
> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.
> >(TypeInfoUtils.java:293)
> >at
> >org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeS
> >tring(TypeInfoUtils.java:764)
> >at
> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColum
> >nTypes(DataWritableReadSupport.java:76)
> >at
> >org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(Dat
> >aWritableReadSupport.java:220)
> >at
> >org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSp
> >lit(ParquetRecordReaderWrapper.java:256)
>
> This might be an NPE triggered off by a specific case of the type parser.
>
> I tested it out on my current build with simple types and it looks like
> the issue needs more detail on the column types for a repro.
>
> hive> create temporary table x (x int) stored as parquet;
> hive> insert into x values(1),(2);
> hive> select count(*) from x where x.x > 1;
> Status: DAG finished successfully in 0.18 seconds
> OK
> 1
> Time taken: 0.792 seconds, Fetched: 1 row(s)
> hive>
>
> Do you have INT96 in the schema?
>
> > I'm currently evaluating Hive on Tez as an alternative to keeping the
> >SparkSQL thrift sever running all the time locking up resources.
>
> Tez has a tunable value in tez.am.session.min.held-containers (i.e
> something small like 10).
>
> And HiveServer2 can be made work similarly because spark
> HiveThriftServer2.scala is a wrapper around hive's ThriftBinaryCLIService.
>
>
>
>
>
>
> Cheers,
> Gopal
>
>
>


NPE when reading Parquet using Hive on Tez

2016-01-04 Thread Adam Hunt
Hi,

When I perform any operation on a data set stored in Parquet format using
Hive on Tez, I get an NPE (see bottom for stack trace). The same operation
works fine on tables stored as text, Avro, ORC and Sequence files. The same
query on the parquet tables also works fine if I use Hive on MR.

I'm running MapR 5.0.0  with Hive 1.2.0-mapr-1510, Hadoop 2.7.0-mapr-1506
and Tez 0.7.0 compiled from source.

I'm currently evaluating Hive on Tez as an alternative to keeping the
SparkSQL thrift sever running all the time locking up resources.
Unfortunately, this is a blocker since most of our data is stored in
Parquet files.

Thanks,
Adam

select count(*) from alexa_parquet;
or
create table kmeans_results_100_orc stored as orc as select * from
kmeans_results_100;

], TaskAttempt 3 failed, info=[Error: Failure while running
task:java.lang.RuntimeException: java.lang.RuntimeException:
java.io.IOException: java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:337)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.IOException:
java.lang.NullPointerException
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:192)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:131)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:97)
at
org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149)
at
org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
at
org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:614)
at
org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:593)
at
org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:141)
at
org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:370)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:127)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147)
... 14 more
Caused by: java.io.IOException: java.lang.NullPointerException
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:252)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:189)
... 25 more
Caused by: java.lang.NullPointerException
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokenize(TypeInfoUtils.java:274)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.(TypeInfoUtils.java:293)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:764)
at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColumnTypes(DataWritableReadSupport.java:76)
at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:220)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:256)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:99

RE: How to create new table like existing with an extra column in single query.

2015-08-10 Thread LaStrange, Adam
How about:

create table XXX like YYY;
alter table XXX add columns (new_column int);

From: venkatesh b [mailto:venkateshmailingl...@gmail.com]
Sent: Monday, August 10, 2015 9:28 AM
To: Wangwenli
Cc: user
Subject: Re: How to create new table like existing with an extra column in 
single query.

Thanks Wenli for quick reply.

Yes, we have a requirement exactly the same which you mentioned.

I have tried the query which you provided but got an error saying that
Invalid column reference for newColumn possible columns are ..



Thanks
Venkatesh

On Monday, August 10, 2015, Wangwenli 
wangwe...@huawei.commailto:wangwe...@huawei.com wrote:
something like this?
create table newTable as select oldTable.*, newCol1,newCol2 from oldTable;


wenli
Regards

From: venkatesh 
bjavascript:_e(%7B%7D,'cvml','venkateshmailingl...@gmail.com');
Date: 2015-08-10 20:54
To: userjavascript:_e(%7B%7D,'cvml','user@hive.apache.org');
Subject: How to create new table like existing with an extra column in single 
query.
Hello,

In hive we got a need that we need to create a new table like the old table 
with an extra column at run time. This must be done in a single query. No 
second query with alter table statement.

Query like:

Create table new_table like old_table , new_column datatype;

Please help me out, searched in many forums but not found the solution.
Please inform me if any other information is needed.

Thanks  Regards
Venkatesh



IMPORTANT: The information contained in this email and/or its attachments is 
confidential. If you are not the intended recipient, please notify the sender 
immediately by reply and immediately delete this message and all its 
attachments. Any review, use, reproduction, disclosure or dissemination of this 
message or any attachment by an unintended recipient is strictly prohibited. 
Neither this message nor any attachment is intended as or should be construed 
as an offer, solicitation or recommendation to buy or sell any security or 
other financial instrument. Neither the sender, his or her employer nor any of 
their respective affiliates makes any warranties as to the completeness or 
accuracy of any of the information contained herein or that this message or any 
of its attachments is free of viruses.


Re: limit clause + fetch optimization

2015-07-22 Thread Adam Silberstein
Thanks Gopal.  I filed an issue to cover JDBC+setMaxRows:
https://issues.apache.org/jira/browse/HIVE-11342

For your first offer of testing a patch, unfortunately we tend to run our
production software on customers' Hadoop clusters, so we can't easily patch
their Hive instances.  But I'll still take you up on that if I find some
time to try it.

Thanks,
Adam

On Tue, Jul 21, 2015 at 11:14 PM, Gopal Vijayaraghavan gop...@apache.org
wrote:


  Just want to make sure I understand the behavior once that bug is
 fixed...a 'select *' with no limit will run without a M/R job and instead
 stream.  Is that correct?

 Yes, that¹s the intended behaviour. I can help you get a fix in, if you
 have some time to test out my WIP patches.

  That may incidently solve another bug I'm seeing: when you use JDBC
 templates to set the limit (setMaxRows in Spring in my setup), it does
 not avoid the M/R job (and no limit clause appears in the hive-server2
 log).  Instead, the M/R job gets launched...I'm
  not sure if the jdbc framework subsequently would apply a limit, once
 the job finishes.  I haven't spotted this issue in JIRA, I'd be happy to
 file it if that's useful to you.

 File a JIRA, would be very useful for me.

 There¹s a lot of low-hanging fruit in the JDBC + Prepared Statement
 codepath, so going over the issues  filing your findings would help me
 pick up and knock them off one by one when I¹m back.

 Prasanth¹s github has some automated benchmarking tools for JDBC, which I
 use heavily - https://github.com/prasanthj/jmeter-hiveserver2/tree/llap


 There are some known issues which have a 2-3x perf degradation for the
 simple query patterns you¹re running, like -
 https://issues.apache.org/jira/browse/HIVE-10982

 Cheers,
 Gopal





limit clause + fetch optimization

2015-07-21 Thread Adam Silberstein
Hi,
I've been experimenting with 'select *' and 'select * limit X' in beeline
and watching the hive-server2 log to understand when a M/R job is triggered
and when not.  It seems like whenever I set a limit, the job is avoided,
but with no limit, it is run.

I found this param:
hive.limit.optimize.fetch.max

That defaults to 50,000 and as I understand it, whenever I set limit to
above that number, a job should be triggered.  But I can set limit to
something very high (e.g. 10M) and no job runs.

If anyone has some insight into how this param is used or expected behavior
of the fetch optimization, would appreciate it.

This is on Hive 1.1 inside CDH5.4.

Thanks,
Adam


Re: limit clause + fetch optimization

2015-07-21 Thread Adam Silberstein
Thanks for the quick answer Gopal, and also for the details on that param.
I indeed use JDBC in production, so will stay away from it.

Just want to make sure I understand the behavior once that bug is fixed...a
'select *' with no limit will run without a M/R job and instead stream.  Is
that correct?

That may incidently solve another bug I'm seeing: when you use JDBC
templates to set the limit (setMaxRows in Spring in my setup), it does not
avoid the M/R job (and no limit clause appears in the hive-server2 log).
Instead, the M/R job gets launched...I'm not sure if the jdbc framework
subsequently would apply a limit, once the job finishes.  I haven't spotted
this issue in JIRA, I'd be happy to file it if that's useful to you.

Thanks!
Adam

On Tue, Jul 21, 2015 at 7:20 PM, Gopal Vijayaraghavan gop...@apache.org
wrote:


  I've been experimenting with 'select *' and 'select * limit X' in
 beeline and watching the hive-server2 log to understand when a M/R job is
 triggered and when not.  It seems like whenever I set a limit, the job is
 avoided, but with no limit, it is run.

 https://issues.apache.org/jira/browse/HIVE-10156


 It¹s sitting on my back-burner (I know the fix, but I¹m working on the
 LLAP branch).

  hive.limit.optimize.fetch.max
 
  That defaults to 50,000 and as I understand it, whenever I set limit to
 above that number, a job should be triggered.  But I can set limit to
 something very high (e.g. 10M) and no job runs.

 That configs belong to a different optimization - the global limit case,
 which works as follows.

 Run query with a 50k row sample of the input, then if it doesn¹t produce
 enough rows, re-run the query with the full input data-set.

 You will notice errors on your JDBC connections with that optimization
 turned on (like HIVE-9382) and will get the following log line Retry
 query with a different approachŠ² in the HS2 logs.

 So I suggest not turning on the Global Limit optimization, if you¹re on
 JDBC/ODBC.

 Cheers,
 Gopal






DDL stmt for showing views only

2015-06-23 Thread Adam Silberstein
Hey All,
Is there a DDL command like 'SHOW TABLES' that returns only the tables or
only the views so they can be differentiated?  Apologies if this has been
answered, I see the question but no great answers.

BTW, know how to get this with per-table DDL calls, but hoping to get the
whole list with one hive call.

Thanks,
Adam


java.lang.RuntimeException: Unknown type BIGINT

2014-10-02 Thread Adam Kawa
Hi!

When using Hive 0.14 (trunk from April 2014), I can't run queries on tables
with BIGINT partition column. I get:


*FAILED: RuntimeException Unknown type BIGINT.*

It runs runs nicely in Hive 0.12.
It also runs nicely, if my table has a bigint column (un-capitalized, not
BIGINT).

The problem is that I have many tables (and partitions) with BIGINT
(capitalized). I would like to avoid altering them to change the type
from BIGINT
to bigint.

Is there any patch (or tick) available that handle case insensitivity?

Cheers!
Adam


HiveServer2 http mode?

2014-04-10 Thread Adam Faris
The Setting Up HiveServer2 wiki page mentions that HiveServer2 is providing a 
“http mode in 0.13.  Is “http mode” going to be a rest API or is it 
encapsulating thrift/jdbc connections inside http traffic? 

- Thanks, Adam

Re: org.apache.hadoop.hive.metastore.HiveMetaStoreClient with webhcat REST

2014-03-17 Thread Adam Silberstein
Hi,
Didn't get any answers on this, trying one more time.

Thanks,
Adam

On Mar 14, 2014, at 9:50 AM, Adam Silberstein a...@trifacta.com wrote:

 Hi,
 I'm testing out the REST interface to webhcat and stuck doing basic DDL 
 operations.
 
 Background on installation:
 --I installed packages with apt-get
 --I'm on CDH4.5
 
 I successfully have created and loaded tables with chat command line.  E.g.: 
 hcat -e 'create table testtable (a string, b string, c int);'
 I did some loading by hand and via Pig.  So I think the HCatalog service is 
 running correctly.
 
 Here's what I am seeing via curl:
 $ curl 
 http://localhost:50111/templeton/v1/ddl/database/testtable?user.name=myname
 
 {errorDetail:\norg.apache.hadoop.hive.ql.metadata.HiveException: 
 org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
 Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient\n\tat 
 org.apache.hadoop.hive.ql.exec.DDLTask.descDatabase(DDLTask.java:2647)\n\tat 
 org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:244)\n\tat 
 org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)\n\tat 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:66)\n\tat
  org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1383)\n\tat 
 org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1169)\n\tat 
 org.apache.hadoop.hive.ql.Driver.run(Driver.java:982)\n\tat 
 org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)\n\tat 
 org.apache.hcatalog.cli.HCatDriver.run(HCatDriver.java:43)\n\tat 
 org.apache.hcatalog.cli.HCatCli.processCmd(HCatCli.java:251)\n\tat 
 org.apache.hcatalog.cli.HCatCli.processLine(HCatCli.java:205)\n\tat 
 org.apache.hcatalog.cli.HCatCli.main(HCatCli.java:164)\n\tat 
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)\n\tat
  
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat
  java.lang.reflect.Method.invoke(Method.java:622)\n\tat 
 org.apache.hadoop.util.RunJar.main(RunJar.java:208)\n,error:FAILED: Error 
 in metadata: org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.RuntimeException: Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient,errorCode:500}
 
 I found a bunch of help online so I tried adding some things in 
 webhcat-site.xml
 -templeton.libjars: I added paths to a bunch of libraries, including 
 '/usr/lib/hive/lib/hive-metastore-0.10.0-cdh4.5.0.jar' which has the missing 
 HiveMetaStoreClient class.  Then restarted webhcat, can't tell if it picked 
 up this property.
 -Less promising, I uploaded the hive tar.gz file to HDFS and updated their 
 paths in templeton.hive.archive and templeton.hive.path.  Skeptical the 
 server would be looking in HDFS for libraries, and this didn't help either.
 -There is older material from ~2011 as well, but ignoring that.
 
 If you have any suggestions please share.  Thanks in advance!
 
 -Adam
 
 
 
 
 


-- 
We've moved! Please update your address book:

Trifacta Inc
575 Market St, 11th Floor
San Francisco, CA 94105


org.apache.hadoop.hive.metastore.HiveMetaStoreClient with webhcat REST

2014-03-14 Thread Adam Silberstein
Hi,
I'm testing out the REST interface to webhcat and stuck doing basic DDL 
operations.

Background on installation:
--I installed packages with apt-get
--I'm on CDH4.5

I successfully have created and loaded tables with chat command line.  
E.g.:hcat -e 'create table testtable (a string, b string, c int);'
I did some loading by hand and via Pig.  So I think the HCatalog service is 
running correctly.

Here's what I am seeing via curl:
$ curl 
http://localhost:50111/templeton/v1/ddl/database/testtable?user.name=myname

{errorDetail:\norg.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient\n\tat 
org.apache.hadoop.hive.ql.exec.DDLTask.descDatabase(DDLTask.java:2647)\n\tat 
org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:244)\n\tat 
org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)\n\tat 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:66)\n\tat
 org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1383)\n\tat 
org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1169)\n\tat 
org.apache.hadoop.hive.ql.Driver.run(Driver.java:982)\n\tat 
org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)\n\tat 
org.apache.hcatalog.cli.HCatDriver.run(HCatDriver.java:43)\n\tat 
org.apache.hcatalog.cli.HCatCli.processCmd(HCatCli.java:251)\n\tat 
org.apache.hcatalog.cli.HCatCli.processLine(HCatCli.java:205)\n\tat 
org.apache.hcatalog.cli.HCatCli.main(HCatCli.java:164)\n\tat 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)\n\tat
 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat
 java.lang.reflect.Method.invoke(Method.java:622)\n\tat 
org.apache.hadoop.util.RunJar.main(RunJar.java:208)\n,error:FAILED: Error 
in metadata: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient,errorCode:500}

I found a bunch of help online so I tried adding some things in webhcat-site.xml
-templeton.libjars: I added paths to a bunch of libraries, including 
'/usr/lib/hive/lib/hive-metastore-0.10.0-cdh4.5.0.jar' which has the missing 
HiveMetaStoreClient class.  Then restarted webhcat, can't tell if it picked up 
this property.
-Less promising, I uploaded the hive tar.gz file to HDFS and updated their 
paths in templeton.hive.archive and templeton.hive.path.  Skeptical the server 
would be looking in HDFS for libraries, and this didn't help either.
-There is older material from ~2011 as well, but ignoring that.

If you have any suggestions please share.  Thanks in advance!

-Adam



Re: Versioninfo and platformName issue.

2013-12-10 Thread Adam Kawa
Hi,

Do you have Hadoop libs properly installed? Does $ hadoop version command
run successfully? If true, then It sounds like some classpath issue...


2013/12/10 Manish Bhoge manishbh...@rocketmail.com

 Sent from Rocket Mail via Android

  --
 * From: * Manish Bhoge manishbh...@rocketmail.com;
 * To: * u...@hadoop.apache.org u...@hadoop.apache.org;
 * Cc: * user@hive.apache.org user@hive.apache.org;
 * Bcc: * p...@pig.apache.org p...@pig.apache.org;
 * Sent: * Mon, Dec 9, 2013 6:02:24 PM


 My Hadoop cluster (Psuedo distributed) is up and running but when i run
 hive shell it is giving me below error
 hive
 manish@manish-bhoge:/usr/lib/hadoop$ hive
 Error: Could not find or load main class
 org.apache.hadoop.util.PlatformName
 Error: Could not find or load main class org.apache.hadoop.util.VersionInfo
 Unable to determine Hadoop version information.
 'hadoop version' returned:
 Error: Could not find or load main class
 org.apache.hadoop.util.PlatformName
 Error: Could not find or load main class org.apache.hadoop.util.VersionInfo

 Can anyone please help me to fix version information and PlatformName. 
 PlatformName
 error appear when i run PIG shell:

 pig
 Error: Could not find or load main class
 org.apache.hadoop.util.PlatformName

 Suggest the fix for both Hive and PIG error.

 Thank You,
 Manish.



Re: hive.query.string not reflecting the current query

2013-12-03 Thread Adam Kawa
Hmmm?

Maybe it is related to the fact, that a query:
 select * from mytable limit 100;
does not start any MapReduce job. It is starts a reading operation from
HDFS (and a communication with MetaStore to know what is the schema and how
to parse the data using InputFormat and SerDe).

For example, If you run a query that has the same functionality (i.e. to
show all content of the table by specifying all columns in SELECT)
 select column1, column2, ... columnN from mytable limit 100;
then a map-only job will be started and maybe (?) hive.query.string will
contain this query..


2013/12/3 Petter von Dolwitz (Hem) petter.von.dolw...@gmail.com

 Hi,

 I use hive 0.11 with a five machine cluster. I am reading the property
 hive.query.string from a custom RecordReader (used for reading external
 tables).

 If I first invoke a query like

 select * from mytable where mycolumn='myvalue';

 I get the correct query string in this property.

 If I then invoke

 select * from mytable limit 100;

 the property hive.query.string still contains the first query. Seems like
 hive uses local mode for the second query. Don't know if it is related.

 Anybody knows why the query string is not updated in the second case?

 Thanks,
 Petter



Re: How to specify Hive auxiliary jar in HDFS, not local file system

2013-12-02 Thread Adam Kawa
You can use ADD JAR command inside a Hive script and a parameter in Oozie
workflow definition. Example is here:
http://blog.cloudera.com/blog/2013/01/how-to-schedule-recurring-hadoop-jobs-with-apache-oozie/


2013/12/2 mpeters...@gmail.com

 Is it possible to specify a Hive auxiliary jar (like a SerDe) that is in
 HDFS rather than the local fileystem?

 I am using a CsvSerDe I wrote and when I specify it Hive
 hive.aux.jars.path with a local file system path it works:

 hive -hiveconf hive.aux.jars.path=*file:*///path/to/truven-hive-serdes-1.0.jar
 -hiveconf hive.auto.convert.join.noconditionaltask.size=2500 -f
 hivefiscalyearqueries.sql

 But when I put that jar in HDFS and point it to, it fails:

 hive -hiveconf 
 hive.aux.jars.path=*hdfs:*///hdfspath/to/truven-hive-serdes-1.0.jar
 -hiveconf hive.auto.convert.join.noconditionaltask.size=2500 -f
 hivefiscalyearqueries.sql

 with the error message:

 java.lang.ClassNotFoundException: com.truven.hiveserde.csv.CsvSerDe
 Continuing ...
 2013-12-02 03:48:25 Starting to launch local task to process map join;
  maximum memory = 1065484288
 org.apache.hadoop.hive.ql.metadata.HiveException: Failed with exception
 nulljava.lang.NullPointerException
 at
 org.apache.hadoop.hive.ql.exec.FetchOperator.getRowInspectorFromTable(FetchOperator.java:230)
 at
 org.apache.hadoop.hive.ql.exec.FetchOperator.getOutputObjectInspector(FetchOperator.java:595)
 at
 org.apache.hadoop.hive.ql.exec.MapredLocalTask.initializeOperators(MapredLocalTask.java:406)
 at
 org.apache.hadoop.hive.ql.exec.MapredLocalTask.executeFromChildJVM(MapredLocalTask.java:290)
 at
 org.apache.hadoop.hive.ql.exec.ExecDriver.main(ExecDriver.java:682)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:160)

 at
 org.apache.hadoop.hive.ql.exec.FetchOperator.getOutputObjectInspector(FetchOperator.java:631)
 at
 org.apache.hadoop.hive.ql.exec.MapredLocalTask.initializeOperators(MapredLocalTask.java:406)
 at
 org.apache.hadoop.hive.ql.exec.MapredLocalTask.executeFromChildJVM(MapredLocalTask.java:290)
 at
 org.apache.hadoop.hive.ql.exec.ExecDriver.main(ExecDriver.java:682)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
 Execution failed with exit status: 2
 Obtaining error information

 Task failed!



 I will need to run this from Oozie eventually, so I'd like to know how get
 Hive to use a jar in HDFS, rather than have to distribute the file to the
 local file system of all datanodes.

 Thank you,
 Michael



Issue with multi insert

2013-02-07 Thread Thomas Adam
Hi,

I am having issues to execute the following multi insert query:

FROM
  ${tmp_users_table} u
  JOIN
  ${user_evens_table} ue
  ON (
u.id = ue.user
  )
INSERT OVERWRITE TABLE ${dau_table} PARTITION (dt='${date}')
SELECT
  u.country,
  u.platform,
  u.gender,
  COUNT(DISTINCT(u.id))
WHERE
  ue.dt = '${date}'
GROUP BY
  u.country,
  u.platform,
  u.gender
INSERT OVERWRITE TABLE ${wau_table} PARTITION (dt='${date}')
SELECT
  u.country,
  u.platform,
  u.gender,
  COUNT(DISTINCT(u.id))
WHERE
  ue.dt BETWEEN date_sub('${date}', 7) AND '${date}'
GROUP BY
  u.country,
  u.platform,
  u.gender
INSERT OVERWRITE TABLE ${mau_table} PARTITION (dt='${date}')
SELECT
  u.country,
  u.platform,
  u.gender,
  COUNT(DISTINCT(u.id))
WHERE
  ue.dt BETWEEN date_sub('${date}', 30) AND '${date}'
GROUP BY
  u.country,
  u.platform,
  u.gender;

I got the error: FAILED: SemanticException [Error 10025]: Line 15:6
Expression not in GROUP BY key 'dt'

If I remove the second and third insert the query works.
Can anyone explain me why it's not working? Why I need to add the dt
field to GROUP BY?

Thanks  regards,
Thomas


Hive inconsistently interpreting 'where' and 'group by'

2012-05-30 Thread Adam Laiacano
Hi all,

I have an activity log stored in an external Hive table, LZO-compressed, and 
partitioned by 'dt' which is the date that the data was recorded. Because of 
time zones and when we dump the data into HDFS, there are about 22 hours of one 
day and 2 of the following in each partition. In the example below, 95% of 

The first column is a timestamp (ts, bigint). I want to count the number of 
actions on each day within a single partition with this query:

SELECT TO_DATE(FROM_UNIXTIME(ts)) AS day, count(*) FROM
activity_log
WHERE
dt = 2012-05-29
GROUP BY TO_DATE(FROM_UNIXTIME(ts))


The correct results are (I confirmed these by decompressing the files locally 
and counting with a simple python script):

NULL 201
2012-05-29 80677204
2012-05-30 3826101

The NULL is caused by a relatively tiny number of corrupt rows. I sometimes get 
this with the exact same query:

NULL 201
2012-05-29 84503305

It seems to convert all of the non-null timestamps to 2012-05-29. I'd say that 
it returns the wrong value 1 out of every 5 times.

I've tried other things like 

   SELECT COUNT(*) FROM activity_log WHERE dt=2012-05-29 AND  ts  1338354000 
AND ts 1338267600) 

but it seems to have the same problem with interpreting the ts value 
consistently.

Any tips or explanations would be greatly appreciated. I'm not sure if the 
corrupt rows have something to do with it, but I still don't get why that only 
happens sometimes. 

Thanks,
Adam



Hive question, summing second-level domain names

2011-05-23 Thread Adam Phelps
(As an FYI I'm relatively new to Hive and have no previous SQL 
experience, so have been struggling a bit with the Language manual which 
seems to assume previous SQL experience)


Suppose I have a table, within which there is a column which contains 
domain names (ie such as hadoop.apache.org).  I want to perform a count 
of all second-level domains, ie hadoop.apache.org and hive.apache.org 
would count in the same bucket.


Now I could count things for a particular second-level domain like this:

SELECT
  year, month, day, hour, COUNT(1) as count
FROM
  domainlog
WHERE
  year = 2011 AND
  month = 05 AND
  day = 15 AND
  (
domain RLIKE .*[.]apache[.]org
  )
GROUP BY
  year, month, day, hour

however I'm not seeing a way to sum up all second-level domains rather 
than a particular one.  I basically want to group everything using a 
regular expression along the lines of .*[.][^.]*[.][^.]* and then 
output lines with a count for the common portion.  Any pointers in the 
correct direction would be welcome.


Thanks
- Adam