CDH 5.7 Hive crash when performing a join on a parquet external table

2016-05-04 Thread Nicholas Hakobian
Has anyone else encountered issues when using a partitioned Parquet
external tables in Hive on CDH 5.7 (Hive is running in map reduce mode) ?
When I perform a simple query such as (I've removed any names/fields that I
am not allowed to publicly share):

select * from user_event left join names on names.id = user_event.feature.id

I get an error like:

Error: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
processing row
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
Error while processing row
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected
exception: Illegal Capacity: -1
at
org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:318)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
at
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497)
... 9 more
Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -1
at java.util.ArrayList.(ArrayList.java:156)
at
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:339)
at
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:366)
at
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:366)
at org.apache.hadoop.hive.ql.exec.JoinUtil.computeValues(JoinUtil.java:193)
at
org.apache.hadoop.hive.ql.exec.CommonJoinOperator.getFilteredValue(CommonJoinOperator.java:408)
at
org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:302)
... 13 more

This suggests that there is an issue within the mapjoin operator (the names
table is very small). This appears to only occur when directly joining the
Parquet backed table to the names table (which is stored in ORC). I've
played around with various file formats, and the format of names does not
seem to change the result, but if I first convert a sample of events to
SequenceFile, the issues does not occur.

I'm thinking that Hive is having an issue mapping the internal Parquet
column types to hive data types, since I have to explicitly state the hive
column layout. This schema and identical query works correctly in CDH 5.4.7
(both a modified version of Hive 1.1.0.

As a test, I tried copying a subset of rows from user_event to a Hive
managed Parquet table:

create table tmp_event stored as parquet tblproperties
("parquet.compression"="SNAPPY") as select * from user_event limit 200;

Oddly, this crashed as well:

Diagnostic Messages for this Task:
Error: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
processing row
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
Error while processing row
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
... 8 more
Caused by: parquet.io.ParquetEncodingException: empty fields are illegal,
the field should be ommited completely instead
at
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:271)
at
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$ListDataWriter.write(DataWritableWriter.java:271)
at
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$GroupDataWriter.write(DataWritableWriter.java:199)
at
org.apache

Re: Hive parquet on EMR

2016-04-01 Thread Nicholas Hakobian
Make sure your column names in the struct exactly matches the case in the
table create statement. We just decided to make everything lowercase, but
occasionally someone forgets and makes one of the characters upper case and
Hive fails.

There was a fix for this in Hive, but it only fixed querying with mixed
case in top level column names, not columns nested in structs.

Hope this helps,

Nick

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com

On Fri, Apr 1, 2016 at 12:09 PM, Jose Rozanec  wrote:

> Hello,
>
> We have a Hive (v 1.0.0) cluster at EMR and data stored in parquet files.
> When querying data, it fails to return result, causing a NPE. We think the
> error may be related with Hive deserialization, since can query data
> without problems when using other technologies (ex.: Presto).
>
> Here we provide the stacktrace of the exception we get
> . We checked Hive Parquet support
>  and release
> notes , and
> seems map and struct types are supported for this version (1.0.0) Our
> Parquet schema involves the following types: string, boolean, map, struct.
>
> Did anyone had a similar experience? Is there a workaround for this?
>
> Thank you in advance,
>
>
>


Re: a newline in column data ruin Hive

2016-02-23 Thread Nicholas Hakobian
We just had this problem recently with our data. There are actually 2
things you have to worry about. The reader (which the suggestion above
seems to solve) and the intermediate stages (if using MR). We didn't
have the issue with the reader since we use Parquet and Avro to store
our data, but we had issues with the intermediate. The default
intermediate output for many distros is TextFile so any serialization
will re-introduce the newline issue. If you set this option:

set hive.query.result.fileformat=SequenceFile;

It'll use sequence files for the intermediate output which uses
different delimiters for newline than \n. The wiki used to
specifically say this, but it looks like the wiki has been changed
since the default was changed to SequenceFile for 2.1

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com




On Tue, Feb 23, 2016 at 3:04 PM, Rajit Saha  wrote:
> Hi Mahender,
>
> You can try ESCAPED BY '\\'
>
> Like a sample below
>
> CREATE EXTERNAL TABLE test
> (
> a1 int,
> b1 string
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ESCAPED BY '\\'
> STORED AS TEXTFILE
> LOCATION ‘';
>
> Thanks
> Rajit
>
>
> From: mahender bigdata 
> Reply-To: "user@hive.apache.org" 
> Date: Tuesday, February 23, 2016 at 2:51 PM
> To: "user@hive.apache.org" 
> Subject: a newline in column data ruin Hive
>
> Hi,
>
> We are facing issue while loading/reading data from file which has line
> delimiter characters like \n  has part of column data. When we try to query
> the Hive table, data with \n gets split up into multiple rows. Is there a
> way to tell hive to skip escape character like \n ( row delimiter or field
> delimiter) within in column data.  We saw LINE Terminator property in create
> table syntax, but currently it accepts only \n. Is there a way to have
> custom line terminator property.
>
>
> Thanks in advance
>
>
> 
> DISCLAIMER: The information transmitted is intended only for the person or
> entity to which it is addressed and may contain confidential and/or
> privileged material. Any review, re-transmission, dissemination or other use
> of, or taking of any action in reliance upon this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and destroy any copies of this
> document and any attachments.


Re: NPE from simple nested ANSI Join

2016-02-04 Thread Nicholas Hakobian
I'm only aware of this:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
but its unclear if it supports your syntax or not.

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com

On Thu, Feb 4, 2016 at 12:57 PM, Dave Nicodemus
 wrote:
> Thanks Nick,
>
> I did a few experiments and found that the version of the query below does
> work. So I'm not sure about your theory. Do you know if there is a document
> that spells out the exact accepted syntax ?
>
> SELECT COUNT(*)
> FROM (nation n INNER JOIN customer c ON n.n_nationkey = c.c_nationkey) INNER
> JOIN orders o ON c.c_custkey = o.o_custkey;
>
>
>
>
> On Thu, Feb 4, 2016 at 3:45 PM, Nicholas Hakobian
>  wrote:
>>
>> I don't believe Hive supports that join format. Its expecting either a
>> table name or a subquery. If its a subquery, it usually requires it to
>> have a table name alias so it can be referenced in an outer statement.
>>
>> -Nick
>>
>> Nicholas Szandor Hakobian
>> Data Scientist
>> Rally Health
>> nicholas.hakob...@rallyhealth.com
>>
>> On Thu, Feb 4, 2016 at 11:28 AM, Dave Nicodemus
>>  wrote:
>> > Using hive 1.2.1.2.3  Connecting using JDBC, issuing the following query
>> > :
>> >
>> > SELECT COUNT(*)
>> > FROM nation n
>> > INNER JOIN (customer c
>> >  INNER JOIN orders o ON c.c_custkey =
>> > o.o_custkey)
>> >  ON n.n_nationkey = c.c_nationkey;
>> >
>> > Generates the NPE and stack below. Fields are integer data type
>> >
>> > Does anyone know if this is a known issue  and whether it's fixed
>> > someplace
>> > ?
>> >
>> > Thanks,
>> > Dave
>> >
>> > Stack
>> > 
>> > Caused by: java.lang.NullPointerExcEeption: Remote
>> > java.lang.NullPointerException: null
>> >
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPresent(SemanticAnalyzer.java:2046)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.parseJoinCondPopulateAlias(SemanticAnalyzer.java:2109)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.parseJoinCondPopulateAlias(SemanticAnalyzer.java:2185)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.parseJoinCondition(SemanticAnalyzer.java:2445)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.parseJoinCondition(SemanticAnalyzer.java:2386)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genJoinTree(SemanticAnalyzer.java:8192)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genJoinTree(SemanticAnalyzer.java:8131)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9709)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9636)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:10109)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:329)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10120)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:211)
>> > at
>> >
>> > org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
>> > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:454)
>> > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:314)
>> > at
>> > org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1164)
>> > at
>> > org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1158)
>> > at
>> >
>> > org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:110)
>> >
>> >
>> >
>> >
>
>


Re: NPE from simple nested ANSI Join

2016-02-04 Thread Nicholas Hakobian
I don't believe Hive supports that join format. Its expecting either a
table name or a subquery. If its a subquery, it usually requires it to
have a table name alias so it can be referenced in an outer statement.

-Nick

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com

On Thu, Feb 4, 2016 at 11:28 AM, Dave Nicodemus
 wrote:
> Using hive 1.2.1.2.3  Connecting using JDBC, issuing the following query :
>
> SELECT COUNT(*)
> FROM nation n
> INNER JOIN (customer c
>  INNER JOIN orders o ON c.c_custkey = o.o_custkey)
>  ON n.n_nationkey = c.c_nationkey;
>
> Generates the NPE and stack below. Fields are integer data type
>
> Does anyone know if this is a known issue  and whether it's fixed someplace
> ?
>
> Thanks,
> Dave
>
> Stack
> 
> Caused by: java.lang.NullPointerExcEeption: Remote
> java.lang.NullPointerException: null
>
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.isPresent(SemanticAnalyzer.java:2046)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.parseJoinCondPopulateAlias(SemanticAnalyzer.java:2109)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.parseJoinCondPopulateAlias(SemanticAnalyzer.java:2185)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.parseJoinCondition(SemanticAnalyzer.java:2445)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.parseJoinCondition(SemanticAnalyzer.java:2386)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genJoinTree(SemanticAnalyzer.java:8192)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genJoinTree(SemanticAnalyzer.java:8131)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9709)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9636)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:10109)
> at
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:329)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10120)
> at
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:211)
> at
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:227)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:454)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:314)
> at
> org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1164)
> at
> org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1158)
> at
> org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:110)
>
>
>
>


Re: "Create external table" nulling data from source table

2016-01-28 Thread Nicholas Hakobian
Do you have any fields with embedded newline characters? If so,
certain hive output formats will parse the newline character as the
end of row, and when importing, chances are the missing fields (now
part of the next row) will be padded with nulls. This happens in Hive
as well if you are using a TextFile intermediate format (SequenceFile
does not have this problem).

-Nick

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com



On Thu, Jan 28, 2016 at 12:05 PM, Riesland, Zack
 wrote:
> First time posting to this list. Please forgive me if I break etiquette. I’m
> looking for some help with getting data from hive to hbase.
>
>
>
> I’m using HDP 2.2.8.
>
>
>
> I have a compressed (zlib), orc-based hive table with 12 columns and
> billions of rows.
>
> In order to get the data into hbase, I have to create a copy of the table as
> an "external" table, backed by CSV files (unless someone knows a better
> way).
>
>
>
> Then, I use the CsvBulkLoad mapreduce job to create hfiles from the csv
> files backing the external table.
>
> I’ve been doing this for almost a year, and MOST of the data ends up
> correct, but if I export a large amount of data, I end up with nulls where I
> shouldn't.
>
>
>
> If I run the exact same query on the source table (compressed orc) and
> destination table (external text) I get null values in the results of the
> latter, but not the former.
>
>
>
> However, if I only copy a small subset of the data to the text-based table,
> all the data is correct.
>
>
>
> I also noticed that if I use an uncompressed source table, and then copy to
> an external text-based table, it happens much more often.
>
>
>
> So, my (not-very-educated) guess is that this has to do with ORC files.
>
>
>
> I know that there are alternatives to ORC, but Hortonworks strongly
> encourages us to use ORC for everything. I’m not even sure whether Parquet
> works with HDP.
>
>
>
> Anyways, Is this a known bug?
>
> Any ideas on how I can get around it without chopping up my data into
> multiple tables?


Re: trying to figure out number of MR jobs from explain output

2015-12-11 Thread Nicholas Hakobian
You can't find out definitively because it is going to depend on the
nature of the data being processed, especially when it comes to
mapjoins. If the output of one stage is small enough for it to
mapjoin, parts of a stage can be skipped as the whole dataset is on
every node.

I'm sure there are other conditions as well, but that is general idea.

-Nick

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com



On Fri, Dec 11, 2015 at 2:00 PM, Ophir Etzion  wrote:
> Hi,
>
> I've been trying to figure out how to know the number of MR jobs that will
> be ran for a hive query using the EXPLAIN output.
>
> I haven't got to a consistent method to knowing that.
>
> for example (in one of my queries, ctas query):
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
>   Stage-4
>   Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
>   Stage-8 depends on stages: Stage-0
>   Stage-2 depends on stages: Stage-8
>   Stage-3
>   Stage-5
>   Stage-6 depends on stages: Stage-5
>
> Stage-1, Stage-3, Stage-5 are listed as map reduce steps.
>
> eventually 2 MR jobs ran.
>
> in other cases only 1 job runs.
>
> I couldn't find a consistent rule on how to figure this out.
>
> can anyone help??
>
> Thank you!!
>
> below is full output
>
> explain CREATE TABLE beekeeper_results.test3 ROW FORMAT SERDE
> "com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde" WITH
> SERDEPROPERTIES ('escape.delim'='\\', 'mapkey.delim'='\;',
> 'colelction.delim'='|') AS SELECT * FROM beekeeper_results.test2;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
>   Stage-4
>   Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
>   Stage-8 depends on stages: Stage-0
>   Stage-2 depends on stages: Stage-8
>   Stage-3
>   Stage-5
>   Stage-6 depends on stages: Stage-5
>
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: test2
> Statistics: Num rows: 112 Data size: 11690 Basic stats: COMPLETE
> Column stats: NONE
> Select Operator
>   expressions: blasttag (type: string), actioncounts (type:
> array>), detailedclicks (type:
> array>), countsbyclient
> (type: array>),
> totalactioncounts (type: array>),
> actionsbydate (type:
> array>)
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
>   Statistics: Num rows: 112 Data size: 11690 Basic stats:
> COMPLETE Column stats: NONE
>   File Output Operator
> compressed: false
> Statistics: Num rows: 112 Data size: 11690 Basic stats:
> COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde:
> com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
> name: beekeeper_results.test3
>
>   Stage: Stage-7
> Conditional Operator
>
>   Stage: Stage-4
> Move Operator
>   files:
>   hdfs directory: true
>   destination:
> hdfs://hadoop-alidoro-nn-vip/user/hive/warehouse/.hive-staging_hive_2015-12-11_21-52-35_063_8498858370292854265-1/-ext-10001
>
>   Stage: Stage-0
> Move Operator
>   files:
>   hdfs directory: true
>   destination: ***
>
>   Stage: Stage-8
>   Create Table Operator:
> Create Table
>   columns: blasttag string, actioncounts
> array>, detailedclicks
> array>, countsbyclient
> array>, totalactioncounts
> array>, actionsbydate
> array>
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format:
> org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
>   serde name:
> com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
>   serde properties:
> colelction.delim |
> escape.delim \
> mapkey.delim ;
>   name: beekeeper_results.test3
>
>   Stage: Stage-2
> Stats-Aggr Operator
>
>   Stage: Stage-3
> Map Reduce
>   Map Operator Tree:
>   TableScan
> File Output Operator
>   compressed: false
>   table:
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   serde:
> com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
>   name: beekeeper_results.test3
>
>   Stage: Stage-5
> Map Reduce
>   Map Operator Tree:
>   TableScan
> File Output Operator
>   compressed: false
>   table:
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format:
> 

Re: Help With Hive Windowing

2015-12-10 Thread Nicholas Hakobian
Last_value and lag are not aggregate functions. They are best thought
of as streaming windowing functions. The last_value function picks the
last value over your window and applies it to every row in your select
statement. Similarly, lag over your window will lag that column by the
specified number of rows.

So what you are going to need to do in this case is something like the
following (there are other ways of course, using ROW_NUMBER(), or even
other windowing functions and clever where clauses.):

SELECT
url,
MAX(STRUCT(hit_time, prev_hit_time)).col1 max_hit_time,
MAX(STRUCT(hit_time, prev_hit_time)).col2 prev_hit_time
FROM (
SELECT
url,
hit_time,
LAG(hit_time, 1, 0) OVER (PARTITION BY url ORDER BY hit_time)
prev_hit_time
FROM prod_user.web_hits
) a
GROUP BY url
Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com
M: 510-295-7113



On Thu, Dec 10, 2015 at 9:21 AM, Justin Workman
 wrote:
> I am attempting to run a simple query over a table with website click data
> using last_value and lag functions to get the last two times we saw URL's
> being hit by specific bots. The table I am querying has the following schema
>
> Table Schema
> URL
> HIT_TIME
> USER_AGENT
> MONTH
> DAY
>
> Created as an unpartitioned ORC table with the last months worth of click
> data. Based on the query below, I would expect to see one record for each
> URL with the last HIT_TIME, and the previous HIT_TIME. Instead I am getting
> several records for each URL, each record does appear to be returning the
> last HIT_TIME and the previous HIT_TIME. How can I limit this to only return
> one record per URL with the true last HIT_TIME and previous HIT_TIME?
>
> Query
> select
>   url,
>last_value(hit_time) over w as last_seen_dt,
>lag(hit_time,1,0) over w as prev_seen_dt
> from prod_user.web_hits
> window w as (partition by url order by hit_time)
>
> Any advice would be greatly appreciated. I am sure I just don't understand
> the windowing syntax correct. I've tried wrapping this in a sub-select and
> grouping by url and selecting distinct url. I've also played with the
> ordering in the window specification.
>
> Versions
> Hive-0.11.XXX from CDH 5.3.3.
>
> Thanks!
> Justin


Re: Re: Hiveserver2 client stdout

2015-10-19 Thread Nicholas Hakobian
If you want to retrieve the STDOUT logs from the HiveServer2 Thrift server,
you can see how beeline does it here:

https://github.com/apache/hive/blob/master/beeline/src/java/org/apache/hive/beeline/Commands.java#L958-L963

I used this as an example of how to pull the query logs for a recent
project.

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com


On Mon, Oct 19, 2015 at 10:30 PM, Jerrick Hoang 
wrote:

> Right, with beeline CLI, I can see logs just fine. But with the java JDBC
> client code (
> https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients) I
> can't see stdout logs
>
> On Mon, Oct 19, 2015 at 10:26 PM, r7raul1...@163.com 
> wrote:
>
>> You can check more option from
>> https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients .
>>
>>
>> *--showWarnings=*[true/false]
>>
>> Display warnings that are reported on the connection after issuing any
>> HiveQL commands. Default is false.
>>
>> Usage: beeline --showWarnings=true
>> *--showNestedErrs=*[true/false]
>>
>> Display nested errors. Default is false.
>>
>> Usage: beeline --showNestedErrs=true
>>
>> --
>> r7raul1...@163.com
>>
>>
>> *From:* Jerrick Hoang 
>> *Date:* 2015-10-20 11:55
>> *To:* user 
>> *Subject:* Re: Hiveserver2 client stdout
>> That would just increase the log level on the server side right? I want
>> the stdout log sent to the jdbc client.
>>
>> On Mon, Oct 19, 2015 at 6:49 PM, r7raul1...@163.com 
>> wrote:
>>
>>> start hiveserver2 with  --hiveconf hive.root.logger=DEBUG,console.
>>> Then check hiveserver2 log.
>>>
>>> --
>>> r7raul1...@163.com
>>>
>>>
>>> *From:* Jerrick Hoang 
>>> *Date:* 2015-10-20 09:03
>>> *To:* user 
>>> *Subject:* Hiveserver2 client stdout
>>> Hi all,
>>>
>>>
>>> I have a hiveserver2 client running on a remote server. I'm able to
>>> connect to it via beeline CLI and everything works fine but when I try to
>>> query via a JDBC client, the stdout is not there. For example, if the
>>> server side throws an exception, beeline can show the exception but all the
>>> JDBC client can do is giving me ` Error while processing statement: FAILED:
>>> Execution Error, return code 2 from
>>> org.apache.hadoop.hive.ql.exec.mr.MapRedTask` which is not very helpful. Is
>>> there a way to capture all stdout stderr using a JDBC? I figure if beeline
>>> is able to do it then I should be able to write a jdbc client to do it too,
>>>
>>> THanks,
>>> J
>>>
>>>
>>
>


Re: Hive 1.2.1 installation troubleshooting - No known driver to handle "jdbc://hive2://:10000"

2015-10-08 Thread Nicholas Hakobian
The format of the jdbc connection string should be something like:

beeline -u jdbc:hive2://localhost:1

Since you're connecting to localhost you can also try the embedded
connection mode by starting beeline like:

beeline -u jdbc:hive2://

The connection string format is documented here:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-ConnectionURLs


Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com
M: 510-295-7113


On Thu, Oct 8, 2015 at 7:13 AM, Timothy Garza <
timothy.ga...@collinsongroup.com> wrote:

> I should mention that all mentioned .jar files are located in the
> directories specified with the following privs: -rw-r--r--  1 hadoop
> hadoop
>
>
>
>
>
> I’ve installed Hive 1.2.1 on Amazon Linux AMI release 2015.03, master-node
> of Hadoop cluster.
>
>
>
> I can successfully access the Beeline client but when I try to connect to
> Hive-Server2… beeline
>
> Ø  !connect jdbc://hive2:// :1
> org.apache.hive.jdbc.HiveDriver
>
>
>
> I get the following error:
>
> No known driver to handle "jdbc://hive2://:1"
>
>
>
>
>
> …so instead I try to add the local hive-jdbc driver supplied in the Apache
> tarball:
>
> Ø  beeline
>
> Ø  !addlocaldriverjar /home/hadoop/hive/lib/hive-jdbc-1.2.1.jar
>
>
>
> Error:
>
> The jar file in the path /home/hadoop/hive/lib/hive-jdbc-1.2.1.jar can't
> be found!
>
>
>
>
>
> …so I tried the same thing with an older version of the hive-jdbc driver:
>
> Ø  beeline
>
> Ø  !addlocaldriverjar !addlocaldriverjar
> home/hadoop/hive/lib/hive-jdbc-0.13.1-amzn-3.jar
>
>
>
> Same error:
>
> The jar file in the path home/hadoop/hive/lib/hive-jdbc-0.13.1-amzn-3.jar
> can't be found!
>
>
>
> …so I tried bypassing this step as a beeline switch instead:
>
> Ø  beeline -u jdbc://hive2://localhost:1 -d
> org.apache.hive.jdbc.HiveDriver
>
>
>
> Output:
>
> Connecting to jdbc://hive2://localhost:1
>
> java.lang.NullPointerException
>
>
>
> I can find nothing related to previous troubleshooting regarding Hive
> installation on either *Google* nor *Apache* archives.
>
>
>
> Can anyone shed any light on why I might be having such undocumented
> problems out of the box or advice on further research forums I can’t read
> through?  Many thanks in advance!
>
>
> The Collinson Group Limited; Registered number: 2577557, Registered in
> England & Wales; Registered Office: Cutlers Exchange, 123 Houndsditch,
> London, EC3A 7BU.
>
>
> This e-mail may contain privileged and confidential information and/or
> copyright material and is intended for the use of the addressee only. If
> you receive this e-mail by mistake please advise the sender immediately by
> using the reply facility in your e-mail software and delete this e-mail
> from your computer system. You may not deliver, copy or disclose its
> contents to anyone else. Any unauthorised use may be unlawful. Any views
> expressed in this e-mail are those of the individual sender and may not
> necessarily reflect the views of The Collinson Group Ltd and/or its
> subsidiaries or any other associated company (collectively “Collinson
> Group”).
>
> As communications via the Internet are not secure Collinson Group cannot
> accept any liability if this e-mail is accessed by third parties during the
> course of transmission or is modified or amended in any way following
> despatch. Collinson Group cannot guarantee that any attachment to this
> email does not contain a virus, therefore it is strongly recommended that
> you carry out your own virus check before opening any attachment, as we
> cannot accept liability for any damage sustained as a result of software
> virus infection. Senders of messages shall be taken to consent to the
> monitoring and recording of e-mails addressed to members of the Company.
>