Re: collecting lists in non-natural order / SORT BY on columns which are not part of the final result set

2015-06-13 Thread Robin Verlangen
Hi Michael,

You can try using a with statement, pseudo:

WITH input AS (SELECT colA, colB FROM table ORDER BY colA ASC)
SELECT colB FROM input

Best regards,

Robin Verlangen
*Chief Data Architect*

W http://www.robinverlangen.nl
E ro...@us2.nl


*What is CloudPelican? *

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

On Sat, Jun 13, 2015 at 3:50 PM, Michael Häusler  wrote:

> Hi there,
>
> imagine you have a table of time-series transactions, e.g.:
> CREATE TABLE
> foobar (actor BIGINT, ts BIGINT, action STRING);
>
> containing the rows:
> 1   2000bar
> 1   1000foo
> 2   1500foo
>
> An interesting query would be to get a denormalized view on all actions of
> an actor sorted by timestamp:
> 1   [foo, bar]
> 2   [foo]
>
> This would require collecting the actions in a list in non-natural order.
> A naive query would look like this:
> SET hive.map.aggr = false;
> SELECT
> actor,
> COLLECT_LIST(action) AS actions
> FROM
> foobar
> GROUP BY
> actor
> SORT BY
> actor, ts;
>
> Unfortunately, as the column "ts" is not in the final result set, this
> produces a SemanticException [Error 10004]: Line 9:11 Invalid table alias
> or column reference 'ts'.
>
>
> I do understand that it is difficult to allow a global ORDER BY on column
> that is not in the result set.
> But the SORT BY only needs to ensure order within the reduce group. Is
> there a way to get this behaviour in Hive?
>
> Best regards
> Michael
>


collecting lists in non-natural order / SORT BY on columns which are not part of the final result set

2015-06-13 Thread Michael Häusler
Hi there,

imagine you have a table of time-series transactions, e.g.:
CREATE TABLE
foobar (actor BIGINT, ts BIGINT, action STRING);

containing the rows:
1   2000bar
1   1000foo
2   1500foo

An interesting query would be to get a denormalized view on all actions of an 
actor sorted by timestamp:
1   [foo, bar]
2   [foo]

This would require collecting the actions in a list in non-natural order. A 
naive query would look like this:
SET hive.map.aggr = false;
SELECT
actor,
COLLECT_LIST(action) AS actions
FROM
foobar
GROUP BY
actor
SORT BY
actor, ts;

Unfortunately, as the column "ts" is not in the final result set, this produces 
a SemanticException [Error 10004]: Line 9:11 Invalid table alias or column 
reference 'ts'.


I do understand that it is difficult to allow a global ORDER BY on column that 
is not in the result set.
But the SORT BY only needs to ensure order within the reduce group. Is there a 
way to get this behaviour in Hive?

Best regards
Michael


Re: Hive double issues while moving around RC files between clusters

2015-06-13 Thread Robin Verlangen
One thing I found in the change logs was this
https://issues.apache.org/jira/browse/HIVE-7041 which sounds like it might
have something to do with this. I don't use any byte datatypes in the
structure, so it would be hard to verify those.

Best regards,

Robin Verlangen
*Chief Data Architect*

W http://www.robinverlangen.nl
E ro...@us2.nl


*What is CloudPelican? *

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.

On Sat, Jun 13, 2015 at 2:42 PM, Robin Verlangen  wrote:

> Hi there,
>
> I was copying around RC files from an CDH hadoop 2.0 cluster to a new HDP
> hadoop 2.6 cluster.
>
> After creating a new table with the storage options RC file and LOCATION
> pointing to the right direction I can query all columns, except for the
> ones that are double.
>
> I tried querying with Hive (via tez and MR), beeline, presto. None of
> these work.
>
> The error from hive is:
>
> java.lang.RuntimeException:
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
> processing row [Error getting row data with exception
> java.lang.ArrayIndexOutOfBoundsException: 20221
>
> at
> org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryUtils.byteArrayToLong(LazyBinaryUtils.java:84)
>
> at
> org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryDouble.init(LazyBinaryDouble.java:43)
>
> at
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:111)
>
> at
> org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:172)
>
> at
> org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:67)
>
> at
> org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldData(UnionStructObjectInspector.java:140)
>
> at
> org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:353)
>
> at
> org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:197)
>
> at
> org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:183)
>
> The error from presto is less verbose, but also indicates a lead:
>
> Query 20150613_114049_00297_468ni failed: Double should be 8 bytes
>
> Both point at something around the doubles which seems to be causing
> issues.
>
> Around table settings, both serdes are
> 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' which should be
> correct.
> The hive versions used in the old (0.8) and new (0.14) vary quite a bit,
> but it is still a valid RC file (checksums match), but only the doubles are
> "stuck".
>
> I tried
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-DecimalTypeIncompatibilitiesbetweenHive0.12.0and0.13.0
> but that doesn't seem to help as well.
>
> Any idea on how I can resolve this?
>
> Thanks in advance!
>
> Best regards,
>
> Robin Verlangen
> *Chief Data Architect*
>
> W http://www.robinverlangen.nl
> E ro...@us2.nl
>
> 
> *What is CloudPelican? *
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>


Hive double issues while moving around RC files between clusters

2015-06-13 Thread Robin Verlangen
Hi there,

I was copying around RC files from an CDH hadoop 2.0 cluster to a new HDP
hadoop 2.6 cluster.

After creating a new table with the storage options RC file and LOCATION
pointing to the right direction I can query all columns, except for the
ones that are double.

I tried querying with Hive (via tez and MR), beeline, presto. None of these
work.

The error from hive is:

java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
processing row [Error getting row data with exception
java.lang.ArrayIndexOutOfBoundsException: 20221

at
org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryUtils.byteArrayToLong(LazyBinaryUtils.java:84)

at
org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryDouble.init(LazyBinaryDouble.java:43)

at
org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:111)

at
org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:172)

at
org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:67)

at
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldData(UnionStructObjectInspector.java:140)

at
org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:353)

at
org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:197)

at
org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:183)

The error from presto is less verbose, but also indicates a lead:

Query 20150613_114049_00297_468ni failed: Double should be 8 bytes

Both point at something around the doubles which seems to be causing issues.

Around table settings, both serdes are
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' which should be
correct.
The hive versions used in the old (0.8) and new (0.14) vary quite a bit,
but it is still a valid RC file (checksums match), but only the doubles are
"stuck".

I tried
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-DecimalTypeIncompatibilitiesbetweenHive0.12.0and0.13.0
but that doesn't seem to help as well.

Any idea on how I can resolve this?

Thanks in advance!

Best regards,

Robin Verlangen
*Chief Data Architect*

W http://www.robinverlangen.nl
E ro...@us2.nl


*What is CloudPelican? *

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.


Re: nested join issue

2015-06-13 Thread Gautam
To clarify, HIVE-8435 introduced the regression. Turning that feature off
fixes the issue. So we still need to fix that optimization to not produce
this incorrect result.

On Fri, Jun 12, 2015 at 11:31 PM, Gautam  wrote:

> Found that turning off hive.optimize.remove.identity.project ( ref:
> HIVE-8435  ) fixes the
> issue.
>
> This gives us a workaround, but dunno the performance degradation this
> impacts yet.
>
> Thanks!
> -Gautam.
>
>
> On Fri, Jun 12, 2015 at 6:02 PM, Gautam  wrote:
>
>> Done. https://issues.apache.org/jira/browse/HIVE-10996
>>
>> On Fri, Jun 12, 2015 at 1:47 PM, Gopal Vijayaraghavan 
>> wrote:
>>
>>> Hi
>>>
>>> > Thanks for investigating..  Trying to locate the patch that fixes this
>>> >between 1.1 and 2.0.0-SNAPSHOT. Any leads on what Jira this fix was part
>>> >of? Or what part of the code the patch is likely to be on?
>>>
>>> git bisect is the only way usually to identify these things.
>>>
>>> But before you hunt into the patches I suggest trying combinations of
>>> constant propogation, null-scan and identity projection remover
>>> optimizations to see if there¹s a workaround in there.
>>>
>>> An explain of the query added to a new JIRA would be good, to continue
>>> the
>>> analysis.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>>
>> --
>> "If you really want something in this life, you have to work for it. Now,
>> quiet! They're about to announce the lottery numbers..."
>>
>
>
>
> --
> "If you really want something in this life, you have to work for it. Now,
> quiet! They're about to announce the lottery numbers..."
>



-- 
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."