[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group

2016-03-09 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15188204#comment-15188204
 ] 

Siddharth Seth commented on HIVE-13250:
---

cc [~ashutoshc], [~prasanth_j] - this was initially filed for Orc split 
elimination and partition pruning. [~ashutoshc] mentioned that CBO may be 
affected as well.

> Compute predicate conversions on the client, instead of per row group
> -
>
> Key: HIVE-13250
> URL: https://issues.apache.org/jira/browse/HIVE-13250
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>
> When running a query for the form 
> select count from table where ts_field = "2016-01-23 00:00:00";
> or
> select count from table where ts_field = 1453507200
> ts_field is of type TIMESTAMP
> The predicate is converted to whatever format is appropriate for TIMESTAMP 
> processing on each and every row group.
> It would be far more efficient to process this once on the client - or even 
> once per task.
> The same applies to ORC splt elimination as well - this is applied for each 
> stripe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group

2016-03-10 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189842#comment-15189842
 ] 

Gopal V commented on HIVE-13250:


The int -> timestamp conversion needs docs, thanks to 
{{hive.int.timestamp.conversion.in.seconds=false}}

> Compute predicate conversions on the client, instead of per row group
> -
>
> Key: HIVE-13250
> URL: https://issues.apache.org/jira/browse/HIVE-13250
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-13250.patch
>
>
> When running a query for the form 
> select count from table where ts_field = "2016-01-23 00:00:00";
> or
> select count from table where ts_field = 1453507200
> ts_field is of type TIMESTAMP
> The predicate is converted to whatever format is appropriate for TIMESTAMP 
> processing on each and every row group.
> It would be far more efficient to process this once on the client - or even 
> once per task.
> The same applies to ORC splt elimination as well - this is applied for each 
> stripe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group

2016-03-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201819#comment-15201819
 ] 

Ashutosh Chauhan commented on HIVE-13250:
-

I misunderstood this bug report. Without patch, filter expression for {{ 
ts_field = "2016-01-23 00:00:00"}} gets executed as {{(UDFToString(ts_field) = 
'2016-01-23 00:00:00')}} In the patch I made changes such that cast is on 
constant {{(ts_field = UDFTOTimeStamp('2016-01-23 00:00:00'))}} which gets 
folded compile time to {{(ts_field = 2016-01-23 00:00:00.0)}}
However this is incorrect. Earlier behavior of cast on column is indeed 
correct. I tested this on oracle, mysql & SQL Server all of which puts a cast 
on column and not constant. [~sseth] Do you have anything else on the mind for 
this one?


> Compute predicate conversions on the client, instead of per row group
> -
>
> Key: HIVE-13250
> URL: https://issues.apache.org/jira/browse/HIVE-13250
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 2.1.0
>Reporter: Siddharth Seth
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-13250.2.patch, HIVE-13250.patch
>
>
> When running a query for the form 
> select count from table where ts_field = "2016-01-23 00:00:00";
> or
> select count from table where ts_field = 1453507200
> ts_field is of type TIMESTAMP
> The predicate is converted to whatever format is appropriate for TIMESTAMP 
> processing on each and every row group.
> It would be far more efficient to process this once on the client - or even 
> once per task.
> The same applies to ORC splt elimination as well - this is applied for each 
> stripe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group

2016-03-19 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201376#comment-15201376
 ] 

Hive QA commented on HIVE-13250:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12793829/HIVE-13250.2.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 383 failed/errored test(s), 9833 tests 
executed
*Failed tests:*
{noformat}
TestSparkCliDriver-groupby3_map.q-sample2.q-auto_join14.q-and-12-more - did not 
produce a TEST-*.xml file
TestSparkCliDriver-groupby_map_ppr_multi_distinct.q-table_access_keys_stats.q-groupby4_noskew.q-and-12-more
 - did not produce a TEST-*.xml file
TestSparkCliDriver-join_rc.q-insert1.q-vectorized_rcfile_columnar.q-and-12-more 
- did not produce a TEST-*.xml file
TestSparkCliDriver-ppd_join4.q-join9.q-ppd_join3.q-and-12-more - did not 
produce a TEST-*.xml file
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_allcolref_in_udf
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_part
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_archive_excludeHadoop20
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join11
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join12
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join13
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join14
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join16
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join27
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join6
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join7
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_without_localtask
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_smb_mapjoin_14
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_10
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_13
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_14
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_15
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_9
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucket_map_join_spark4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_6
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_7
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketsortoptimize_insert_8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cast1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_join
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_auto_join1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_join
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_lineage2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_outer_join_ppr
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_semijoin
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_union
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_semijoin
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_union
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_column_access_stats
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_combine2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_constprog3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_constprog_dp
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer10
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_create_view
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_create_view_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cross_join_merge
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_colname
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cte_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cte_5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cte_mat_1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cte_mat_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_database
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynamic_rdd_cache
org.apache.hadoop.hiv

[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group

2016-03-21 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204809#comment-15204809
 ] 

Siddharth Seth commented on HIVE-13250:
---

[~ashutoshc] - this is what was observed.

The following exception was seen for every row group in ORC Files. Note how the 
constant is being cast each and every time. The intent was to avoid that. It 
seems like this is something the can be avoided on the client itself by casting 
the constant to whatever format the column requires. Now, with schema 
evolution, this may not always be possible - which is why the suggestion for 
once per task.
{code}
2016-02-10 02:15:43,175 [WARN] [TezChild] |orc.RecordReaderImpl|: Exception 
when evaluating predicate. Skipping ORC PPD. Exception: 
java.lang.IllegalArgumentException: ORC SARGS could not convert from String to 
TIMESTAMP
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.getBaseObjectForComparison(RecordReaderImpl.java:659)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.evaluatePredicateRange(RecordReaderImpl.java:373)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.evaluatePredicateProto(RecordReaderImpl.java:338)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$SargApplier.pickRowGroups(RecordReaderImpl.java:710)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.pickRowGroups(RecordReaderImpl.java:751)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:777)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1019)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:205)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:598)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1269)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1151)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:249)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:135)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:101)
at 
org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:149)
at 
org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
at 
org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:650)
at 
org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:621)
at 
org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145)
at 
org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:406)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:128)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:149)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
a

[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group

2016-03-21 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204812#comment-15204812
 ] 

Siddharth Seth commented on HIVE-13250:
---

bq. I misunderstood this bug report. Without patch, filter expression for {{ 
ts_field = "2016-01-23 00:00:00"}} gets executed as (UDFToString(ts_field) = 
'2016-01-23 00:00:00') In the patch I made changes such that cast is on 
constant (ts_field = UDFTOTimeStamp('2016-01-23 00:00:00')) which gets folded 
compile time to (ts_field = 2016-01-23 00:00:00.0)
I'd expect the cast to change the value to whatever can be compared directly 
against storage. However, I think the type promotion system is far more 
complicated - and this may not be possible always.

> Compute predicate conversions on the client, instead of per row group
> -
>
> Key: HIVE-13250
> URL: https://issues.apache.org/jira/browse/HIVE-13250
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 2.1.0
>Reporter: Siddharth Seth
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-13250.2.patch, HIVE-13250.patch
>
>
> When running a query for the form 
> select count from table where ts_field = "2016-01-23 00:00:00";
> or
> select count from table where ts_field = 1453507200
> ts_field is of type TIMESTAMP
> The predicate is converted to whatever format is appropriate for TIMESTAMP 
> processing on each and every row group.
> It would be far more efficient to process this once on the client - or even 
> once per task.
> The same applies to ORC splt elimination as well - this is applied for each 
> stripe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group

2016-03-21 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204859#comment-15204859
 ] 

Gopal V commented on HIVE-13250:


bq.  I'd expect the cast to change the value to whatever can be compared 
directly against storage. 

UDFs disable all predicate pushdown. Only constants are applied to PPD, so 
retaining the UDFToString disabled all PPD into the storage system.

> Compute predicate conversions on the client, instead of per row group
> -
>
> Key: HIVE-13250
> URL: https://issues.apache.org/jira/browse/HIVE-13250
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 2.1.0
>Reporter: Siddharth Seth
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-13250.2.patch, HIVE-13250.patch
>
>
> When running a query for the form 
> select count from table where ts_field = "2016-01-23 00:00:00";
> or
> select count from table where ts_field = 1453507200
> ts_field is of type TIMESTAMP
> The predicate is converted to whatever format is appropriate for TIMESTAMP 
> processing on each and every row group.
> It would be far more efficient to process this once on the client - or even 
> once per task.
> The same applies to ORC splt elimination as well - this is applied for each 
> stripe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13250) Compute predicate conversions on the client, instead of per row group

2016-03-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205041#comment-15205041
 ] 

Ashutosh Chauhan commented on HIVE-13250:
-

I think we can special case this for equality predicate. I will update the 
patch for that.

> Compute predicate conversions on the client, instead of per row group
> -
>
> Key: HIVE-13250
> URL: https://issues.apache.org/jira/browse/HIVE-13250
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 2.1.0
>Reporter: Siddharth Seth
>Assignee: Ashutosh Chauhan
> Attachments: HIVE-13250.2.patch, HIVE-13250.patch
>
>
> When running a query for the form 
> select count from table where ts_field = "2016-01-23 00:00:00";
> or
> select count from table where ts_field = 1453507200
> ts_field is of type TIMESTAMP
> The predicate is converted to whatever format is appropriate for TIMESTAMP 
> processing on each and every row group.
> It would be far more efficient to process this once on the client - or even 
> once per task.
> The same applies to ORC splt elimination as well - this is applied for each 
> stripe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)