SV: Duplicate rows when using group by in subquery

2013-09-19 Thread Mikael Öhman
Hello again.

I have now checked out latest code from trunk and built as per instructions.

However, this query:

select a.Symbol, count(*) 
from (select Symbol, catid from cat group by Symbol, catid) a 
group by a.Symbol;

still returns an incorrect number of rows for table:

create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, 
INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID 
string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, SRCREPID int, 
TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) row format 
delimited fields terminated by ',' stored as ORC;


Here is the result of EXPLAIN:


hive EXPLAIN select a.Symbol, count(*) 
     from (select Symbol, catid from cat group by Symbol, catid) a 
     group by a.Symbol;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF 
(TOK_TABNAME cat))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) 
(TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL Symbol)) (TOK_SELEXPR 
(TOK_TABLE_OR_COL catid))) (TOK_GROUPBY (TOK_TABLE_OR_COL Symbol) 
(TOK_TABLE_OR_COL catid a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) Symbol)) 
(TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) 
Symbol

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
  Alias - Map Operator Tree:
    a:cat 
  TableScan
    alias: cat
    Select Operator
  expressions:
    expr: symbol
    type: string
    expr: catid
    type: bigint
  outputColumnNames: symbol, catid
  Group By Operator
    bucketGroup: false
    keys:
  expr: symbol
  type: string
  expr: catid
  type: bigint
    mode: hash
    outputColumnNames: _col0, _col1
    Reduce Output Operator
  key expressions:
    expr: _col0
    type: string
    expr: _col1
    type: bigint
  sort order: ++
  Map-reduce partition columns:
    expr: _col0
    type: string
    expr: _col1
    type: bigint
  tag: -1
  Reduce Operator Tree:
    Group By Operator
  bucketGroup: false
  keys:
    expr: KEY._col0
    type: string
    expr: KEY._col1
    type: bigint
  mode: mergepartial
  outputColumnNames: _col0, _col1
  Select Operator
    expressions:
  expr: _col0
  type: string
    outputColumnNames: _col0
    Group By Operator
  aggregations:
    expr: count()
  bucketGroup: false
  keys:
    expr: _col0
    type: string
  mode: complete
  outputColumnNames: _col0, _col1
  Select Operator
    expressions:
  expr: _col0
  type: string
  expr: _col1
  type: bigint
    outputColumnNames: _col0, _col1
    File Output Operator
  compressed: false
  GlobalTableId: 0
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
  limit: -1


Using set hive.optimize.reducededuplication=false;
I get 2 mapreduce jobs and the correct number of rows (24).

Can I verify somehow, maybe through looking in the source code, that I indeed 
have the correct version? Or execute a command from hive cli that shows version 
etc. Just built from source this morning so seems strange that the bug would 
still persist :(.




 Från: Yin Huai huaiyin@gmail.com
Till: user@hive.apache.org; Mikael Öhman mikael_u...@yahoo.se 
Skickat: tisdag, 17 september 2013 15:30
Ämne: Re: Duplicate rows when using group by in subquery
 


Hello Mikael,

ReduceSinkDeduplication automatically kicked in because it is enabled by 
default. The original plan tries to shuffle your data twice. Then, 
ReduceSinkDeduplication finds that the original plan can be optimized to 
shuffle your data once. But, when picking the partitioning columns, this 
optimizer picked the wrong columns because of the bug. 

Also, you can try your query with and without ReduceSinkDeduplication (use set 
hive.optimize.reducededuplication=false; to turn this 

Re: User accounts to execute hive queries

2013-09-19 Thread Rudra Tripathy
Thanks Nitin for the help, I would try.

Thanks and Regards,
Rudra

On Wed, Sep 18, 2013 at 5:14 PM, Thejas Nair the...@hortonworks.com wrote:

 You might find my slides on this topic useful -
 http://www.slideshare.net/thejasmn/hive-authorization-models

 Also linked from last slide  -

 https://cwiki.apache.org/confluence/display/HCATALOG/Storage+Based+Authorization

 On Tue, Sep 17, 2013 at 11:46 PM, Nitin Pawar nitinpawar...@gmail.com
 wrote:
  The link I gave in previous mail explains how can you user level
  authorizations in hive.
 
 
 
  On Mon, Sep 16, 2013 at 7:57 PM, shouvanik.hal...@accenture.com wrote:
 
  Hi Nitin,
 
 
 
  I want it secured.
 
 
 
  Yes, I would like to give specific access to specific users. E.g.
 “select
  * from” access to some and “add/modify/delete” options to some
 
 
 
 
 
  “What kind of security do you have on hdfs? “
 
  I could not follow this question
 
 
 
  Thanks,
 
  Shouvanik
 
  From: Nitin Pawar [mailto:nitinpawar...@gmail.com]
  Sent: Monday, September 16, 2013 6:50 PM
  To: Haldar, Shouvanik
  Cc: user@hive.apache.org
  Subject: Re: User accounts to execute hive queries
 
 
 
  You will need to tell few more things.
 
  Do you want it secured?
 
  Do you distinguish users in different categories on what one particular
  user can do or not?
 
  What kind of security do you have on hdfs?
 
 
 
 
 
  It is definitely possible for users to run queries on their own username
  but then you have to take few measures as well.
 
  which user can do what action. Which user can access what location on
 hdfs
  etc
 
 
 
  For user management on hive side you can read at
  https://cwiki.apache.org/Hive/languagemanual-authorization.html
 
 
 
  if you do not want to go through the secure way,
 
  then add all the users to one group and then grant permissions to that
  group on your warehouse directory.
 
 
 
  other way if the table data is not shared then,
 
  create individual directory for each user on hdfs and give only that
 user
  access to that directory.
 
 
  
  This message is for the designated recipient only and may contain
  privileged, proprietary, or otherwise confidential information. If you
 have
  received it in error, please notify the sender immediately and delete
 the
  original. Any other use of the e-mail by you is prohibited.
 
  Where allowed by local law, electronic communications with Accenture and
  its affiliates, including e-mail and instant messaging (including
 content),
  may be scanned by our systems for the purposes of information security
 and
  assessment of internal compliance with Accenture policy.
 
 
 
 __
 
  www.accenture.com
 
 
 
 
  --
  Nitin Pawar

 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.



Operators and || do not work

2013-09-19 Thread amareshwari sriramdasu
Hello,

Though the documentation
https://cwiki.apache.org/Hive/languagemanual-udf.html says they are same as
AND and OR, they do not even get parsed. User gets parsing when they are
used. Was that intentional or is it a regression?

hive select key from src where key=a || key =b;
FAILED: Parse Error: line 1:33 cannot recognize input near '|' 'key' '=' in
expression specification

hive select key from src where key=a  key =b;
FAILED: Parse Error: line 1:33 cannot recognize input near '' 'key' '=' in
expression specification

Thanks
Amareshwari


Hive 0.11.0 | Issue with ORC Tables

2013-09-19 Thread Savant, Keshav
Hi All,

We have setup apache hive 0.11.0 services on Hadoop cluster (apache version 
0.20.203.0). Hive is showing expected results when tables are stored as 
TextFile.
Whereas, Hive 0.11.0's new feature ORC(Optimized Row Columnar) is throwing an 
exception while running a select query, when we run select queries on tables 
stored as ORC.
Stacktrace of the exception :

2013-09-19 20:33:38,095 ERROR CliDriver (SessionState.java:printError(386)) - 
Failed with exception 
java.io.IOException:com.google.protobuf.InvalidProtocolBufferException: While 
parsing a protocol message, the input ended unexpectedly in the middle of a 
field.  This could mean either than the input has been truncated or that an 
embedded message misreported its own length.
java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: While 
parsing a protocol message, the input ended unexpectedly in the middle of a 
field.  This could mean either than the input has been truncated or that an 
embedded message misreported its own length.
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:544)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:488)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1412)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:271)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:756)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: com.google.protobuf.InvalidProtocolBufferException: While parsing a 
protocol message, the input ended unexpectedly in the middle of a field.  This 
could mean either than the input has been truncated or that an embedded message 
misreported its own length.
at 
com.google.protobuf.InvalidProtocolBufferException.truncatedMessage(InvalidProtocolBufferException.java:49)
at 
com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:754)
at 
com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:294)
at 
com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:484)
at 
com.google.protobuf.GeneratedMessage$Builder.parseUnknownField(GeneratedMessage.java:438)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$Builder.mergeFrom(OrcProto.java:10129)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$Builder.mergeFrom(OrcProto.java:9993)
at 
com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:300)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.parseFrom(OrcProto.java:9970)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.init(ReaderImpl.java:193)
at 
org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:56)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:168)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:432)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:508)


We did following steps that leads to above exception:

* SET mapred.output.compression.codec= 
org.apache.hadoop.io.compress.SnappyCodec;

* CREATE TABLE person(id INT, name STRING) ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ' ' STORED AS ORC tblproperties (orc.compress=Snappy);

* LOAD DATA LOCAL INPATH 'test.txt' INTO TABLE person;

* Executing  : SELECT * FROM person;
Results :

Failed with exception 
java.io.IOException:com.google.protobuf.InvalidProtocolBufferException: While 
parsing a protocol message, the input ended unexpectedly in the middle of a 
field.  This could mean either than the input has been truncated or that an 
embedded message misreported its own length.



Also, we included codec property in core-site.xml in our hadoop cluster with 
other configuration settings.
property
 nameio.compression.codecs/name
valueorg.apache.hadoop.io.compress.SnappyCodec/value

/property



Following are the new jars with their placements



1.   Placed a new jar at $HIVE_HOME/lib/config-1.0.0.jar

2.   Placed a new jar for metastore connection 
$HIVE_HOME/lib/mysql-connector-java-5.1.17-bin.jar

3.   Moved jackson-core-asl-1.8.8.jar from $HIVE_HOME/lib to 
$HADOOP_HOME/lib

4.  

Re: Hive 0.11.0 | Issue with ORC Tables

2013-09-19 Thread Nitin Pawar
How did you create test.txt as ORC file?



On Thu, Sep 19, 2013 at 5:34 PM, Savant, Keshav 
keshav.c.sav...@fisglobal.com wrote:

  Hi All,

 ** **

 We have setup apache “hive 0.11.0” services on Hadoop cluster (apache
 version 0.20.203.0). Hive is showing expected results when tables are
 stored as * TextFile*. 

 Whereas, Hive 0.11.0’s new feature ORC(*Optimized Row Columnar*) is
 throwing an exception while running a select query, when we run select
 queries on tables stored as “*ORC*”.

 Stacktrace of the exception :

 ** **

 2013-09-19 20:33:38,095 ERROR CliDriver
 (SessionState.java:printError(386)) - Failed with exception
 java.io.IOException:com.google.protobuf.InvalidProtocolBufferException:
 While parsing a protocol message, the input ended unexpectedly in the
 middle of a field.  This could mean either than the input has been
 truncated or that an embedded message misreported its own length.

 java.io.IOException: com.google.protobuf.InvalidProtocolBufferException:
 While parsing a protocol message, the input ended unexpectedly in the
 middle of a field.  This could mean either than the input has been
 truncated or that an embedded message misreported its own length.

 at
 org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:544)
 

 at
 org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:488)
 

 at
 org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136)

 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1412)**
 **

 at
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:271)**
 **

 at
 org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)

 at
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)

 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:756)***
 *

 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)**
 **

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 

 at java.lang.reflect.Method.invoke(Method.java:597)

 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 Caused by: com.google.protobuf.InvalidProtocolBufferException: While
 parsing a protocol message, the input ended unexpectedly in the middle of a
 field.  This could mean either than the input has been truncated or that an
 embedded message misreported its own length.

 at
 com.google.protobuf.InvalidProtocolBufferException.truncatedMessage(InvalidProtocolBufferException.java:49)
 

 at
 com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:754)
 

 at
 com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:294)*
 ***

 at
 com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:484)
 

 at
 com.google.protobuf.GeneratedMessage$Builder.parseUnknownField(GeneratedMessage.java:438)
 

 at
 org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$Builder.mergeFrom(OrcProto.java:10129)
 

 at
 org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$Builder.mergeFrom(OrcProto.java:9993)
 

 at
 com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:300)
 

 at
 org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.parseFrom(OrcProto.java:9970)
 

 at
 org.apache.hadoop.hive.ql.io.orc.ReaderImpl.init(ReaderImpl.java:193)***
 *

 at
 org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:56)

 at
 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:168)
 

 at
 org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:432)
 

 at
 org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:508)
 

 ** **

 We did following steps that leads to above exception:

 **· **SET mapred.output.compression.codec=
 org.apache.hadoop.io.compress.SnappyCodec;

 **· **CREATE TABLE person(id INT, name STRING) ROW FORMAT
 DELIMITED FIELDS TERMINATED BY ' ' STORED AS ORC tblproperties
 (orc.compress=Snappy);

 **· **LOAD DATA LOCAL INPATH 'test.txt' INTO TABLE person;

 **· ***Executing  :* SELECT * FROM person;
 *Results :*

 Failed with exception
 java.io.IOException:com.google.protobuf.InvalidProtocolBufferException:
 While parsing a protocol message, the input ended unexpectedly in the
 middle of a field.  This could mean either than the input has been
 truncated or that an embedded message misreported its own length.

 ** **

 Also, we included codec property in core-site.xml 

Re: Operators and || do not work

2013-09-19 Thread Ashutosh Chauhan
I have not tested it on historical versions, so don't know on which
versions it used to work (if ever), but possibly antlr upgrade [1] may have
impacted this.

[1] : https://issues.apache.org/jira/browse/HIVE-2439

Ashutosh


On Thu, Sep 19, 2013 at 4:52 AM, amareshwari sriramdasu 
amareshw...@gmail.com wrote:

 Hello,

 Though the documentation
 https://cwiki.apache.org/Hive/languagemanual-udf.html says they are same
 as
 AND and OR, they do not even get parsed. User gets parsing when they are
 used. Was that intentional or is it a regression?

 hive select key from src where key=a || key =b;
 FAILED: Parse Error: line 1:33 cannot recognize input near '|' 'key' '=' in
 expression specification

 hive select key from src where key=a  key =b;
 FAILED: Parse Error: line 1:33 cannot recognize input near '' 'key' '=' in
 expression specification

 Thanks
 Amareshwari



Re: Hive 0.11.0 | Issue with ORC Tables

2013-09-19 Thread Owen O'Malley
On Thu, Sep 19, 2013 at 5:04 AM, Savant, Keshav 
keshav.c.sav...@fisglobal.com wrote:

  Hi All,

 ** **

 We have setup apache “hive 0.11.0” services on Hadoop cluster (apache
 version 0.20.203.0). Hive is showing expected results when tables are
 stored as * TextFile*. 

 Whereas, Hive 0.11.0’s new feature ORC(*Optimized Row Columnar*) is
 throwing an exception while running a select query, when we run select
 queries on tables stored as “*ORC*”.

 Stacktrace of the exception :

 ** **

 2013-09-19 20:33:38,095 ERROR CliDriver
 (SessionState.java:printError(386)) - Failed with exception
 java.io.IOException:com.google.protobuf.InvalidProtocolBufferException:
 While parsing a protocol message, the input ended unexpectedly in the
 middle of a field.  This could mean either than the input has been
 truncated or that an embedded message misreported its own length.

 java.io.IOException: com.google.protobuf.InvalidProtocolBufferException:
 While parsing a protocol message, the input ended unexpectedly in the
 middle of a field.  This could mean either than the input has been
 truncated or that an embedded message misreported its own length.

 at
 org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:544)
 

 at
 org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:488)
 

 at
 org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136)

 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1412)**
 **

 at
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:271)**
 **

 at
 org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)

 at
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)

 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:756)***
 *

 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)**
 **

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 

 at java.lang.reflect.Method.invoke(Method.java:597)

 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 Caused by: com.google.protobuf.InvalidProtocolBufferException: While
 parsing a protocol message, the input ended unexpectedly in the middle of a
 field.  This could mean either than the input has been truncated or that an
 embedded message misreported its own length.

 at
 com.google.protobuf.InvalidProtocolBufferException.truncatedMessage(InvalidProtocolBufferException.java:49)
 

 at
 com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:754)
 

 at
 com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:294)*
 ***

 at
 com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:484)
 

 at
 com.google.protobuf.GeneratedMessage$Builder.parseUnknownField(GeneratedMessage.java:438)
 

 at
 org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$Builder.mergeFrom(OrcProto.java:10129)
 

 at
 org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$Builder.mergeFrom(OrcProto.java:9993)
 

 at
 com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:300)
 

 at
 org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.parseFrom(OrcProto.java:9970)
 

 at
 org.apache.hadoop.hive.ql.io.orc.ReaderImpl.init(ReaderImpl.java:193)***
 *

 at
 org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:56)

 at
 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:168)
 

 at
 org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:432)
 

 at
 org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:508)
 

 ** **

 We did following steps that leads to above exception:

 **· **SET mapred.output.compression.codec=
 org.apache.hadoop.io.compress.SnappyCodec;

 **· **CREATE TABLE person(id INT, name STRING) ROW FORMAT
 DELIMITED FIELDS TERMINATED BY ' ' STORED AS ORC tblproperties
 (orc.compress=Snappy);

The problem is that load data doesn't convert the file into ORC format.
You need to use the following commands:

CREATE TABLE person_staging (id INT, name STRING);

LOAD DATA LOCAL INPATH 'test.txt' INTO TABLE person_staging;

SELECT * FROM person_staging;

INSERT OVERWRITE TABLE person select * from person_staging;

SELECT * FROM person;

Sorry for the bad error message. I improved the ORC reader to explicitly
check that the file is actually an ORC file in
https://issues.apache.org/jira/browse/HIVE-4724 .

 

 **· **LOAD DATA LOCAL INPATH 

Re: Export/Import Table in Hive NPE

2013-09-19 Thread Brad Ruderman
Hi All-
I have opened up a ticket for this issue:
https://issues.apache.org/jira/browse/HIVE-5318

Can anyone repo to confirm its a bug with Hive and not with a configuration
within my instance?

THanks,
Brad


On Tue, Sep 17, 2013 at 2:22 PM, Brad Ruderman bruder...@radiumone.comwrote:

 Hi All-
 I am trying to export a table in Hive 0.9, then import it into Hive 0.10
 staging. Essentially moving data from a production import to staging.

 I used the EXPORT table command, however when I try to import the table
 back into staging I receive the following (pulled from the hive.log file).
 Could anyone help to point me in what i should be looking that could cause
 the problem? This is a hive managed table in the source instance, where it
 was originally moved into hive by Sqoop.

 Thanks,
 Brad



 2013-09-17 14:10:27,482 INFO  parse.ParseDriver
 (ParseDriver.java:parse(433)) - Parsing command: IMPORT FROM
 'hdfs://user/hdfs/test_table'
 2013-09-17 14:10:27,482 INFO  parse.ParseDriver
 (ParseDriver.java:parse(450)) - Parse Completed
 2013-09-17 14:10:27,486 ERROR ql.Driver
 (SessionState.java:printError(427)) - FAILED: SemanticException Exception
 while processing
 org.apache.hadoop.hive.ql.parse.SemanticException: Exception while
 processing
 at
 org.apache.hadoop.hive.ql.parse.ImportSemanticAnalyzer.analyzeInternal(ImportSemanticAnalyzer.java:277)
  at
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:459)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:349)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:938)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)
 at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
 Caused by: java.lang.IllegalArgumentException:
 java.net.UnknownHostException: user
 at
 org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
  at
 org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164)
 at
 org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
  at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:448)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:410)
  at
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:128)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2308)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:87)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2342)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2324)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:351)
  at
 org.apache.hadoop.hive.ql.parse.ImportSemanticAnalyzer.analyzeInternal(ImportSemanticAnalyzer.java:87)
 ... 15 more
 Caused by: java.net.UnknownHostException: user
 ... 27 more

 2013-09-17 14:10:27,487 INFO  ql.Driver (PerfLogger.java:PerfLogEnd(115))
 - /PERFLOG method=compile start=1379452227481 end=1379452227487 duration=6
 2013-09-17 14:10:27,487 INFO  ql.Driver (PerfLogger.java:PerfLogBegin(88))
 - PERFLOG method=releaseLocks
 2013-09-17 14:10:27,487 INFO  ql.Driver (PerfLogger.java:PerfLogEnd(115))
 - /PERFLOG method=releaseLocks start=1379452227487 end=1379452227487
 duration=0
 2013-09-17 14:10:27,487 INFO  ql.Driver (PerfLogger.java:PerfLogBegin(88))
 - PERFLOG method=releaseLocks
 2013-09-17 14:10:27,487 INFO  ql.Driver (PerfLogger.java:PerfLogEnd(115))
 - /PERFLOG method=releaseLocks start=1379452227487 end=1379452227487
 duration=0



Re: Duplicate rows when using group by in subquery

2013-09-19 Thread Yin Huai
Maybe you were stilling using the cli which was pointing to hive 0.11 libs.
 After you build trunk (https://github.com/apache/hive.git), you need to
use trunk-dir/build/dist as your hive home and
use trunk-dir/build/dist/bin/hive to launch hive cli. You can find
hive 0.13 libs in trunk-dir/build/dist/lib

btw, seems trunk has an issue today. You can try hive 0.12 branch.


On Thu, Sep 19, 2013 at 4:26 AM, Mikael Öhman mikael_u...@yahoo.se wrote:

 Hello again.

 I have now checked out latest code from trunk and built as per
 instructions.

 However, this query:

 select a.Symbol, count(*)
 from (select Symbol, catid from cat group by Symbol, catid) a
 group by a.Symbol;

 still returns an incorrect number of rows for table:


 create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE
 int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int,
 ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string,
 SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID
 int) row format delimited fields terminated by ',' stored as ORC;

 Here is the result of EXPLAIN:

 hive EXPLAIN select a.Symbol, count(*)
  from (select Symbol, catid from cat group by Symbol, catid) a
  group by a.Symbol;
 OK
 ABSTRACT SYNTAX TREE:
   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF
 (TOK_TABNAME cat))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
 (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL Symbol)) (TOK_SELEXPR
 (TOK_TABLE_OR_COL catid))) (TOK_GROUPBY (TOK_TABLE_OR_COL Symbol)
 (TOK_TABLE_OR_COL catid a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
 TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) Symbol))
 (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL
 a) Symbol

 STAGE DEPENDENCIES:
   Stage-1 is a root stage
   Stage-0 is a root stage

 STAGE PLANS:
   Stage: Stage-1
 Map Reduce
   Alias - Map Operator Tree:
 a:cat
   TableScan
 alias: cat
 Select Operator
   expressions:
 expr: symbol
 type: string
 expr: catid
 type: bigint
   outputColumnNames: symbol, catid
   Group By Operator
 bucketGroup: false
 keys:
   expr: symbol
   type: string
   expr: catid
   type: bigint
 mode: hash
 outputColumnNames: _col0, _col1
 Reduce Output Operator
   key expressions:
 expr: _col0
 type: string
 expr: _col1
 type: bigint
   sort order: ++
   Map-reduce partition columns:
 expr: _col0
 type: string
 expr: _col1
 type: bigint
   tag: -1
   Reduce Operator Tree:
 Group By Operator
   bucketGroup: false
   keys:
 expr: KEY._col0
 type: string
 expr: KEY._col1
 type: bigint
   mode: mergepartial
   outputColumnNames: _col0, _col1
   Select Operator
 expressions:
   expr: _col0
   type: string
 outputColumnNames: _col0
 Group By Operator
   aggregations:
 expr: count()
   bucketGroup: false
   keys:
 expr: _col0
 type: string
   mode: complete
   outputColumnNames: _col0, _col1
   Select Operator
 expressions:
   expr: _col0
   type: string
   expr: _col1
   type: bigint
 outputColumnNames: _col0, _col1
 File Output Operator
   compressed: false
   GlobalTableId: 0
   table:
   input format:
 org.apache.hadoop.mapred.TextInputFormat
   output format:
 org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

   Stage: Stage-0
 Fetch Operator
   limit: -1

 Using set hive.optimize.reducededuplication=false;
 I get 2 mapreduce jobs and the correct number of rows (24).

 Can I verify somehow, maybe through looking in the source code, that I
 indeed have the correct version? Or execute a command from hive cli that
 shows version etc. Just built from source this morning so seems strange
 that the bug would still persist :(.

   --
  *Från:* Yin Huai huaiyin@gmail.com
 *Till:* user@hive.apache.org; Mikael Öhman mikael_u...@yahoo.se
 *Skickat:* tisdag, 17 september 2013 15:30

 

Re: Operators and || do not work

2013-09-19 Thread Thiruvel Thirumoolan
Hi Amareshwari/Ashutosh,

Ashutosh is probably right, I doubt if this ever worked. I couldn't find a
clientpositive test case which uses  or ||.

I also modified a unit test case in Hive9 to use  instead of AND and
that failed with the same error Amareshwari saw. Hive9 does not have
HIVE-2439.

-Thiruvel

On 9/19/13 7:21 AM, Ashutosh Chauhan hashut...@apache.org wrote:

I have not tested it on historical versions, so don't know on which
versions it used to work (if ever), but possibly antlr upgrade [1] may
have
impacted this.

[1] : https://issues.apache.org/jira/browse/HIVE-2439

Ashutosh


On Thu, Sep 19, 2013 at 4:52 AM, amareshwari sriramdasu 
amareshw...@gmail.com wrote:

 Hello,

 Though the documentation
 https://cwiki.apache.org/Hive/languagemanual-udf.html says they are same
 as
 AND and OR, they do not even get parsed. User gets parsing when they are
 used. Was that intentional or is it a regression?

 hive select key from src where key=a || key =b;
 FAILED: Parse Error: line 1:33 cannot recognize input near '|' 'key'
'=' in
 expression specification

 hive select key from src where key=a  key =b;
 FAILED: Parse Error: line 1:33 cannot recognize input near '' 'key'
'=' in
 expression specification

 Thanks
 Amareshwari




Re: Operators and || do not work

2013-09-19 Thread amareshwari sriramdasu
Yes, should not be because of HIVE-2439. Even in hive-0.7, it is not
working, not sure if it worked at any version. Will create a jira to track.

Thanks
Amareshwari


On Fri, Sep 20, 2013 at 6:03 AM, Thiruvel Thirumoolan 
thiru...@yahoo-inc.com wrote:

 Hi Amareshwari/Ashutosh,

 Ashutosh is probably right, I doubt if this ever worked. I couldn't find a
 clientpositive test case which uses  or ||.

 I also modified a unit test case in Hive9 to use  instead of AND and
 that failed with the same error Amareshwari saw. Hive9 does not have
 HIVE-2439.

 -Thiruvel

 On 9/19/13 7:21 AM, Ashutosh Chauhan hashut...@apache.org wrote:

 I have not tested it on historical versions, so don't know on which
 versions it used to work (if ever), but possibly antlr upgrade [1] may
 have
 impacted this.
 
 [1] : https://issues.apache.org/jira/browse/HIVE-2439
 
 Ashutosh
 
 
 On Thu, Sep 19, 2013 at 4:52 AM, amareshwari sriramdasu 
 amareshw...@gmail.com wrote:
 
  Hello,
 
  Though the documentation
  https://cwiki.apache.org/Hive/languagemanual-udf.html says they are
 same
  as
  AND and OR, they do not even get parsed. User gets parsing when they are
  used. Was that intentional or is it a regression?
 
  hive select key from src where key=a || key =b;
  FAILED: Parse Error: line 1:33 cannot recognize input near '|' 'key'
 '=' in
  expression specification
 
  hive select key from src where key=a  key =b;
  FAILED: Parse Error: line 1:33 cannot recognize input near '' 'key'
 '=' in
  expression specification
 
  Thanks
  Amareshwari
 




Re: De-serializing Thrift Optional fields

2013-09-19 Thread Kanwaljit Singh

 Hi,

 We are creating a table by De-serializing thrift file. We end up with an
 extra hive column named *optionals* and of the type *struct*.

 This breaks the SELECT * option! How can we prevent it?