[jira] [Created] (HIVE-18573) Use proper Calcite operator instead of UDFs
slim bouguerra created HIVE-18573: - Summary: Use proper Calcite operator instead of UDFs Key: HIVE-18573 URL: https://issues.apache.org/jira/browse/HIVE-18573 Project: Hive Issue Type: Bug Components: Hive Reporter: slim bouguerra Currently, Hive is mostly using user-defined black box sql operators during Query planning. It will be more beneficial to use proper calcite operators. Also, Use a single name for Extract operator instead of a different name for every Unit, Same for Floor function. This will allow unifying the treatment per operator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18331) Renew the Kerberos ticket used by Druid Query runner
slim bouguerra created HIVE-18331: - Summary: Renew the Kerberos ticket used by Druid Query runner Key: HIVE-18331 URL: https://issues.apache.org/jira/browse/HIVE-18331 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Druid Http Client has to renew the current user Kerberos ticket when it is close to expire. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18254) Use proper AVG Calcite primitive instead of Other_FUNCTION
slim bouguerra created HIVE-18254: - Summary: Use proper AVG Calcite primitive instead of Other_FUNCTION Key: HIVE-18254 URL: https://issues.apache.org/jira/browse/HIVE-18254 Project: Hive Issue Type: Bug Reporter: slim bouguerra Currently Hive-Calcite operator tree treats AVG function as an unknown function that has a Calcite Sql Kind of Other_FUNCTION. This is an issue that can get into the way of rules like {{org.apache.calcite.rel.rules.AggregateReduceFunctionsRule}}. This patch adds the avg function to the list of known aggregate function. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18226) handle UDF to double/int over aggregate
slim bouguerra created HIVE-18226: - Summary: handle UDF to double/int over aggregate Key: HIVE-18226 URL: https://issues.apache.org/jira/browse/HIVE-18226 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra In cases like the following query Hive planner adds extra UDFtoDouble over integer columns. This kind of udf can be pushed to Druid as DoubleSum instead of LongSum and vice versa. {code} PREHOOK: query: EXPLAIN SELECT floor_year(`__time`), SUM(ctinyint)/ count(*) FROM druid_table GROUP BY floor_year(`__time`) PREHOOK: type: QUERY POSTHOOK: query: EXPLAIN SELECT floor_year(`__time`), SUM(ctinyint)/ count(*) FROM druid_table GROUP BY floor_year(`__time`) POSTHOOK: type: QUERY STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: druid_table properties: druid.query.json {"queryType":"timeseries","dataSource":"default.druid_table","descending":false,"granularity":"year","aggregations":[{"type":"longSum","name":"$f1","fieldName":"ctinyint"},{"type":"count","name":"$f2"}],"intervals":["1900-01-01T00:00:00.000/3000-01-01T00:00:00.000"],"context":{"skipEmptyBuckets":true}} druid.query.type timeseries Statistics: Num rows: 9173 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: __time (type: timestamp with local time zone), (UDFToDouble($f1) / UDFToDouble($f2)) (type: double) outputColumnNames: _col0, _col1 Statistics: Num rows: 9173 Data size: 0 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 9173 Data size: 0 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18197) Fix issue with wrong segments identifier usage.
slim bouguerra created HIVE-18197: - Summary: Fix issue with wrong segments identifier usage. Key: HIVE-18197 URL: https://issues.apache.org/jira/browse/HIVE-18197 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra We have 2 different issues, that can make checking of load status fail for druid segments. issues are due to usage of wrong segment identifier at couple of locations. # We are construction the segment identifier with UTC timezone, which can be wrong if the segments we built in a different timezone. The way to fix this is to use the segment identifier instead of re-making it at the client side. # We are using outdate segments identifiers for the INSERT INTO case. The way to fix this is to use the segment metadata produced by the metadata commit phase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18196) Druid Mini Cluster to run Qtests integrations tests.
slim bouguerra created HIVE-18196: - Summary: Druid Mini Cluster to run Qtests integrations tests. Key: HIVE-18196 URL: https://issues.apache.org/jira/browse/HIVE-18196 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: Ashutosh Chauhan The overall Goal of this is to add a new Module that can fork a druid cluster to run integration testing as part of the Mini Clusters Qtest suite. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18156) Provide smooth migration path for CTAS when time column is not with timezone
slim bouguerra created HIVE-18156: - Summary: Provide smooth migration path for CTAS when time column is not with timezone Key: HIVE-18156 URL: https://issues.apache.org/jira/browse/HIVE-18156 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra Assignee: Jesus Camacho Rodriguez Currently, default recommend CTAS and most legacy documentation does not specify that __time column needs to be with timezone. Thus the CTAS will fail with {code} 2017-11-27T17:13:10,241 ERROR [e5f708c8-df4e-41a4-b8a1-d18ac13123d2 main] ql.Driver: FAILED: SemanticException No column with timestamp with local time-zone type on query result; one column should be of timestamp with local time-zone type org.apache.hadoop.hive.ql.parse.SemanticException: No column with timestamp with local time-zone type on query result; one column should be of timestamp with local time-zone type at org.apache.hadoop.hive.ql.optimizer.SortedDynPartitionTimeGranularityOptimizer$SortedDynamicPartitionProc.getGranularitySelOp(SortedDynPartitionTimeGranularityOptimizer.java:242) at org.apache.hadoop.hive.ql.optimizer.SortedDynPartitionTimeGranularityOptimizer$SortedDynamicPartitionProc.process(SortedDynPartitionTimeGranularityOptimizer.java:163) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:158) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120) at org.apache.hadoop.hive.ql.optimizer.SortedDynPartitionTimeGranularityOptimizer.transform(SortedDynPartitionTimeGranularityOptimizer.java:103) at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:250) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11683) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:298) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:268) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:592) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1457) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1589) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1356) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1346) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:187) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:409) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:342) at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1300) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1274) at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:173) at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104) at org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver(TestMiniDruidCliDriver.java:59) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:92) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunne
[jira] [Created] (HIVE-17871) Add non nullability flag to druid time column
slim bouguerra created HIVE-17871: - Summary: Add non nullability flag to druid time column Key: HIVE-17871 URL: https://issues.apache.org/jira/browse/HIVE-17871 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Druid time column is non null all the time. Adding the non nullability flag will enable extra calcite goodness like transforming {code} select count(`__time`) from table {code} to {code} select count(*) from table {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17653) Druid storage handler CTAS with boolean type columns fails.
slim bouguerra created HIVE-17653: - Summary: Druid storage handler CTAS with boolean type columns fails. Key: HIVE-17653 URL: https://issues.apache.org/jira/browse/HIVE-17653 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Fix For: 3.0.0 Druid storage handler CTAS fails with the exception below when a Boolean column is included. A simple workaround would be to add a cast to string over the boolean column, this will lead to index the column as a druid dimension with value `true` or `false`. {code} ERROR : Status: Failed ERROR : Vertex failed, vertexName=Reducer 3, vertexId=vertex_1506230948023_0005_9_02, diagnostics=[Task failed, taskId=task_1506230948023_0005_9_02_03, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1506230948023_0005_9_02_03_0:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing vector batch (tag=0) (vectorizedVertexNum 2) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:218) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:172) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:110) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing vector batch (tag=0) (vectorizedVertexNum 2) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecordVector(ReduceRecordSource.java:406) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:248) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:319) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:189) ... 15 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing vector batch (tag=0) (vectorizedVertexNum 2) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:492) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecordVector(ReduceRecordSource.java:397) ... 18 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Dimension bo does not have STRING type: BOOLEAN at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:564) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:664) at org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator.process(VectorFileSinkOperator.java:101) at org.apache.hadoop.hive.ql.exec.Operator.baseForward(Operator.java:955) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:903) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:145) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:479) ... 19 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Dimension bo does not have STRING type: BOOLEAN at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:272) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:609) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:55
[jira] [Created] (HIVE-17627) Use druid scan query instead of the select query.
slim bouguerra created HIVE-17627: - Summary: Use druid scan query instead of the select query. Key: HIVE-17627 URL: https://issues.apache.org/jira/browse/HIVE-17627 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra The biggest difference between select query and scan query is that, scan query doesn't retain all rows in memory before rows can be returned to client. It will cause memory pressure if too many rows required by select query. Scan query doesn't have this issue. Scan query can return all rows without issuing another pagination query, which is extremely useful when query against historical or realtime node directly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17623) Fix Select query Fix Double column serde and some refactoring
slim bouguerra created HIVE-17623: - Summary: Fix Select query Fix Double column serde and some refactoring Key: HIVE-17623 URL: https://issues.apache.org/jira/browse/HIVE-17623 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 3.0.0 Reporter: slim bouguerra This PR has 2 fixes. First, fixes the limit of results returned by Select query that used to be limited to 16K rows Second fixes the type inference for the double type newly added to druid. Use Jackson polymorphism to infer types and parse results from druid nodes. Removes duplicate codes form RecordReaders. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17582) Followup of HIVE-15708
slim bouguerra created HIVE-17582: - Summary: Followup of HIVE-15708 Key: HIVE-17582 URL: https://issues.apache.org/jira/browse/HIVE-17582 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 3.0.0 Reporter: slim bouguerra Assignee: slim bouguerra HIVE-15708 commit be59e024420ed5ca970e87a6dec402fecee21f06 introduced some unwanted bugs it did change the following code org.apache.hadoop.hive.druid.io.DruidQueryBasedInputFormat#169 {code} builder.intervals(Arrays.asList(DruidTable.DEFAULT_INTERVAL)); {code} with {code} final List intervals = Arrays.asList(); builder.intervals(intervals); {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17581) Replace some calcite dependencies with native ones
slim bouguerra created HIVE-17581: - Summary: Replace some calcite dependencies with native ones Key: HIVE-17581 URL: https://issues.apache.org/jira/browse/HIVE-17581 Project: Hive Issue Type: Sub-task Components: Druid integration Affects Versions: 3.0.0 Reporter: slim bouguerra Assignee: slim bouguerra This is a followup of HIVE-17468. This patch excludes some unwanted druid-calcite dependencies. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17523) Insert into druid table hangs Hive server2 in an infinit loop
slim bouguerra created HIVE-17523: - Summary: Insert into druid table hangs Hive server2 in an infinit loop Key: HIVE-17523 URL: https://issues.apache.org/jira/browse/HIVE-17523 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Inserting data via insert into table backed by druid can lead to a Hive server hang. This is due to some bug in the naming of druid segments partitions. To reproduce the issue {code} drop table login_hive; create table login_hive(`timecolumn` timestamp, `userid` string, `num_l` double); insert into login_hive values ('2015-01-01 00:00:00', 'user1', 5); insert into login_hive values ('2015-01-01 01:00:00', 'user2', 4); insert into login_hive values ('2015-01-01 02:00:00', 'user3', 2); insert into login_hive values ('2015-01-02 00:00:00', 'user1', 1); insert into login_hive values ('2015-01-02 01:00:00', 'user2', 2); insert into login_hive values ('2015-01-02 02:00:00', 'user3', 8); insert into login_hive values ('2015-01-03 00:00:00', 'user1', 5); insert into login_hive values ('2015-01-03 01:00:00', 'user2', 9); insert into login_hive values ('2015-01-03 04:00:00', 'user3', 2); insert into login_hive values ('2015-03-09 00:00:00', 'user3', 5); insert into login_hive values ('2015-03-09 01:00:00', 'user1', 0); insert into login_hive values ('2015-03-09 05:00:00', 'user2', 0); drop table login_druid; CREATE TABLE login_druid STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "druid_login_test_tmp", "druid.segment.granularity" = "DAY", "druid.query.granularity" = "HOUR") AS select `timecolumn` as `__time`, `userid`, `num_l` FROM login_hive; select * FROM login_druid; insert into login_druid values ('2015-03-09 05:00:00', 'user4', 0); {code} This patch unifies the logic of pushing and segments naming by using Druid data segment pusher as much as possible. This patch also has some minor code refactoring and test enhancements. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17468) Shade and package appropriate jackson version for druid storage handler
slim bouguerra created HIVE-17468: - Summary: Shade and package appropriate jackson version for druid storage handler Key: HIVE-17468 URL: https://issues.apache.org/jira/browse/HIVE-17468 Project: Hive Issue Type: Bug Reporter: slim bouguerra Fix For: 3.0.0 Currently we are excluding all the jackson core dependencies coming from druid. This is wrong in my opinion since this will lead to the packaging of unwanted jackson library from other projects. As you can see the file hive-druid-deps.txt currently jacskon core is coming from calcite and the version is 2.6.3 which is very different from 2.4.6 used by druid. This patch exclude the unwanted jars and make sure to bring in druid jackson dependency from druid it self. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17372) update druid dependency to druid 0.10.1
slim bouguerra created HIVE-17372: - Summary: update druid dependency to druid 0.10.1 Key: HIVE-17372 URL: https://issues.apache.org/jira/browse/HIVE-17372 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Update to most recent druid version to be released August 23. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17303) Missmatch between roaring bitmap library used by druid and the one coming from tez
slim bouguerra created HIVE-17303: - Summary: Missmatch between roaring bitmap library used by druid and the one coming from tez Key: HIVE-17303 URL: https://issues.apache.org/jira/browse/HIVE-17303 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra {code} Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.roaringbitmap.buffer.MutableRoaringBitmap.runOptimize()Z at org.apache.hive.druid.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) at org.apache.hive.druid.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) at org.apache.hive.druid.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at org.apache.hadoop.hive.druid.io.DruidRecordWriter.pushSegments(DruidRecordWriter.java:165) ... 25 more Caused by: java.lang.NoSuchMethodError: org.roaringbitmap.buffer.MutableRoaringBitmap.runOptimize()Z at org.apache.hive.druid.com.metamx.collections.bitmap.WrappedRoaringBitmap.toImmutableBitmap(WrappedRoaringBitmap.java:65) at org.apache.hive.druid.com.metamx.collections.bitmap.RoaringBitmapFactory.makeImmutableBitmap(RoaringBitmapFactory.java:88) at org.apache.hive.druid.io.druid.segment.StringDimensionMergerV9.writeIndexes(StringDimensionMergerV9.java:348) at org.apache.hive.druid.io.druid.segment.IndexMergerV9.makeIndexFiles(IndexMergerV9.java:218) at org.apache.hive.druid.io.druid.segment.IndexMerger.merge(IndexMerger.java:438) at org.apache.hive.druid.io.druid.segment.IndexMerger.persist(IndexMerger.java:186) at org.apache.hive.druid.io.druid.segment.IndexMerger.persist(IndexMerger.java:152) at org.apache.hive.druid.io.druid.segment.realtime.appenderator.AppenderatorImpl.persistHydrant(AppenderatorImpl.java:996) at org.apache.hive.druid.io.druid.segment.realtime.appenderator.AppenderatorImpl.access$200(AppenderatorImpl.java:93) at org.apache.hive.druid.io.druid.segment.realtime.appenderator.AppenderatorImpl$2.doCall(AppenderatorImpl.java:385) at org.apache.hive.druid.io.druid.common.guava.ThreadRenamingCallable.call(ThreadRenamingCallable.java:44) ... 4 more ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:89, Vertex vertex_1502470020457_0005_12_05 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2) Options Attachments {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17302) ReduceRecordSource should not add batch string to Exception message
slim bouguerra created HIVE-17302: - Summary: ReduceRecordSource should not add batch string to Exception message Key: HIVE-17302 URL: https://issues.apache.org/jira/browse/HIVE-17302 Project: Hive Issue Type: Bug Reporter: slim bouguerra ReduceRecordSource is adding the batch data as a string to the exception stack, this can lead to an OOM of the Query AM when the query fails due to other issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17160) Adding kerberos Authorization to the Druid hive integration
slim bouguerra created HIVE-17160: - Summary: Adding kerberos Authorization to the Druid hive integration Key: HIVE-17160 URL: https://issues.apache.org/jira/browse/HIVE-17160 Project: Hive Issue Type: New Feature Components: Druid integration Reporter: slim bouguerra This goal of this feature is to allow hive querying a secured druid cluster using kerberos credentials. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-16816) Chained Group by support for druid.
slim bouguerra created HIVE-16816: - Summary: Chained Group by support for druid. Key: HIVE-16816 URL: https://issues.apache.org/jira/browse/HIVE-16816 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra This is more likely to be a calcite enhancement but am logging it here to track it any way. Currently queries like {code} select count (distinct dim) from table {code} is pushed partially to druid as group by dim followed by a count executed by hive QE. This can be enhanced by using the nested (eg chained execution) group by query such as the first (inner) GB query does group by key and the second (outer) does the count. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16588) Ressource leak by druid http client
slim bouguerra created HIVE-16588: - Summary: Ressource leak by druid http client Key: HIVE-16588 URL: https://issues.apache.org/jira/browse/HIVE-16588 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Fix For: 3.0.0 Current implementation of druid storage handler does leak some resources if the creation of the http client fails due to too many files exception. The reason this is leaking is the fact the cleaning hook is registered after the client starts. In order to fix this will extract the creation of the HTTP client to become static and reusable instead of per query creation. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16522) Hive is query timer is not keeping track of the fetch task execution
slim bouguerra created HIVE-16522: - Summary: Hive is query timer is not keeping track of the fetch task execution Key: HIVE-16522 URL: https://issues.apache.org/jira/browse/HIVE-16522 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra Currently Hive CLI query execution time does not include fetch time execution. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16519) Fix exception thrown by checkOutputSpecs
slim bouguerra created HIVE-16519: - Summary: Fix exception thrown by checkOutputSpecs Key: HIVE-16519 URL: https://issues.apache.org/jira/browse/HIVE-16519 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra do not throw exception by checkOutputSpecs -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16482) Druid Ser/Desr need to use dimension output name in order to function with Extraction function
slim bouguerra created HIVE-16482: - Summary: Druid Ser/Desr need to use dimension output name in order to function with Extraction function Key: HIVE-16482 URL: https://issues.apache.org/jira/browse/HIVE-16482 Project: Hive Issue Type: Bug Reporter: slim bouguerra Druid Ser/Desr need to use dimension output name in order to function with Extraction function. Some part of the Ser/Desr code uses the method {code} DimensionSpec.getDimension(){code} although when extraction function are in game the name of the dimension will be defined by {code}DimensionSpec.getOutputName() {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16404) Renaming of public classes in Calcite 12 breeaking druid integration
slim bouguerra created HIVE-16404: - Summary: Renaming of public classes in Calcite 12 breeaking druid integration Key: HIVE-16404 URL: https://issues.apache.org/jira/browse/HIVE-16404 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Fix For: 3.0.0 Changes to names in the druid rules is backward incompatible with current implementation. https://github.com/apache/calcite/commit/a89c62cd6d6cc181c90881afa0bf099746739a91 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16371) Add bitmap selection strategy for druid storage handler
slim bouguerra created HIVE-16371: - Summary: Add bitmap selection strategy for druid storage handler Key: HIVE-16371 URL: https://issues.apache.org/jira/browse/HIVE-16371 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Currently only Concise Bitmap strategy is supported. This Pr is to make Roaring bitmap encoding the default and Concise optional if needed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16210) Use jvm temporary tmp dir by default
slim bouguerra created HIVE-16210: - Summary: Use jvm temporary tmp dir by default Key: HIVE-16210 URL: https://issues.apache.org/jira/browse/HIVE-16210 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra instead of using "/tmp" by default, it makes more sense to use the jvm default tmp dir. This can have dramatic consequences if the indexed files are huge. For instance application run by run containers can be provisioned with a dedicated tmp dir. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16149) Druid query path fails when using LLAP mode
slim bouguerra created HIVE-16149: - Summary: Druid query path fails when using LLAP mode Key: HIVE-16149 URL: https://issues.apache.org/jira/browse/HIVE-16149 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra Assignee: Ashutosh Chauhan {code} hive> select i_item_desc ,i_category ,i_class ,i_current_price ,i_item_id ,sum(ss_ext_sales_price) > as itemrevenue ,sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over (partition by i_class) as revenueratio > from tpcds_store_sales_sold_time_1000_day_all > where (i_category ='Jewelry' or i_category = 'Sports' or i_category ='Books') and `__time` >= cast('2001-01-12' as date) and `__time` <= cast('2001-02-11' as date) > group by i_item_id ,i_item_desc ,i_category ,i_class ,i_current_price order by i_category ,i_class ,i_item_id ,i_item_desc ,revenueratio limit 10; Query ID = sbouguerra_20170308131436_225330b7-1142-4e4e-a05a-46ef544c8ee8 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1488231257387_1862) -- VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -- Map 1 llapINITED 1 001 0 0 Reducer 2 llapINITED 2 002 0 0 Reducer 3 llapINITED 1 001 0 0 -- VERTICES: 00/03 [>>--] 0%ELAPSED TIME: 59.68 s -- Status: Failed Dag received [DAG_TERMINATE, SERVICE_PLUGIN_ERROR] in RUNNING state. Error reported by TaskScheduler [[2:LLAP]][SERVICE_UNAVAILABLE] No LLAP Daemons are running Vertex killed, vertexName=Reducer 3, vertexId=vertex_1488231257387_1862_3_02, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:1, Vertex vertex_1488231257387_1862_3_02 [Reducer 3] killed/failed due to:DAG_TERMINATED] Vertex killed, vertexName=Reducer 2, vertexId=vertex_1488231257387_1862_3_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:2, Vertex vertex_1488231257387_1862_3_01 [Reducer 2] killed/failed due to:DAG_TERMINATED] Vertex killed, vertexName=Map 1, vertexId=vertex_1488231257387_1862_3_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:1, Vertex vertex_1488231257387_1862_3_00 [Map 1] killed/failed due to:DAG_TERMINATED] DAG did not succeed due to SERVICE_PLUGIN_ERROR. failedVertices:0 killedVertices:3 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Dag received [DAG_TERMINATE, SERVICE_PLUGIN_ERROR] in RUNNING state.Error reported by TaskScheduler [[2:LLAP]][SERVICE_UNAVAILABLE] No LLAP Daemons are runningVertex killed, vertexName=Reducer 3, vertexId=vertex_1488231257387_1862_3_02, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:1, Vertex vertex_1488231257387_1862_3_02 [Reducer 3] killed/failed due to:DAG_TERMINATED]Vertex killed, vertexName=Reducer 2, vertexId=vertex_1488231257387_1862_3_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:2, Vertex vertex_1488231257387_1862_3_01 [Reducer 2] killed/failed due to:DAG_TERMINATED]Vertex killed, vertexName=Map 1, vertexId=vertex_1488231257387_1862_3_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:1, Vertex vertex_1488231257387_1862_3_00 [Map 1] killed/failed due to:DAG_TERMINATED]DAG did not succeed due to SERVICE_PLUGIN_ERROR. failedVertices:0 killedVertices:3 {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16126) push all the time extraction to druid
slim bouguerra created HIVE-16126: - Summary: push all the time extraction to druid Key: HIVE-16126 URL: https://issues.apache.org/jira/browse/HIVE-16126 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra currently we don't push most of the time extractions to druid which leads to selecting all the data, bad!. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16125) Split work between reducers.
slim bouguerra created HIVE-16125: - Summary: Split work between reducers. Key: HIVE-16125 URL: https://issues.apache.org/jira/browse/HIVE-16125 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Split work between reducer. currently we have one reducer per segment granularity even if the interval will be partitioned over multiple partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16124) Drop the segments data as soon it is pushed to HDFS
slim bouguerra created HIVE-16124: - Summary: Drop the segments data as soon it is pushed to HDFS Key: HIVE-16124 URL: https://issues.apache.org/jira/browse/HIVE-16124 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Drop the pushed segments from the indexer as soon as the HDFS push is done. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16123) Let user chose the granularity of bucketing.
slim bouguerra created HIVE-16123: - Summary: Let user chose the granularity of bucketing. Key: HIVE-16123 URL: https://issues.apache.org/jira/browse/HIVE-16123 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Currently we index the data with granularity of none which puts lot of pressure on the indexer. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16122) NPE Hive Druid split introduced by HIVE-15928
slim bouguerra created HIVE-16122: - Summary: NPE Hive Druid split introduced by HIVE-15928 Key: HIVE-16122 URL: https://issues.apache.org/jira/browse/HIVE-16122 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16096) Predicate `__time` In ("date", "date") is not pused
slim bouguerra created HIVE-16096: - Summary: Predicate `__time` In ("date", "date") is not pused Key: HIVE-16096 URL: https://issues.apache.org/jira/browse/HIVE-16096 Project: Hive Issue Type: Bug Reporter: slim bouguerra {code} explain select * from login_druid where `__time` in ("2003-1-1", "2004-1-1" ); OK Plan optimized by CBO. Stage-0 Fetch Operator limit:-1 Select Operator [SEL_2] Output:["_col0","_col1","_col2"] Filter Operator [FIL_4] predicate:(__time) IN ('2003-1-1', '2004-1-1') TableScan [TS_0] Output:["__time","userid","num_l"],properties:{"druid.query.json":"{\"queryType\":\"select\",\"dataSource\":\"druid_user_login\",\"descending\":false,\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"dimensions\":[\"userid\"],\"metrics\":[\"num_l\"],\"granularity\":\"all\",\"pagingSpec\":{\"threshold\":16384},\"context\":{\"druid.query.fetch\":false}}","druid.query.type":"select"} {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16095) Filter generation is not taking into account the column type.
slim bouguerra created HIVE-16095: - Summary: Filter generation is not taking into account the column type. Key: HIVE-16095 URL: https://issues.apache.org/jira/browse/HIVE-16095 Project: Hive Issue Type: Bug Reporter: slim bouguerra We are suppose to get alphanumeric comparison when we have a cast to numeric type. This looks like to be a calcite issue. {code} hive> explain select * from login_druid where userid < 2 > ; OK Plan optimized by CBO. Stage-0 Fetch Operator limit:-1 Select Operator [SEL_1] Output:["_col0","_col1","_col2"] TableScan [TS_0] Output:["__time","userid","num_l"],properties:{"druid.query.json":"{\"queryType\":\"select\",\"dataSource\":\"druid_user_login\",\"descending\":false,\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"filter\":{\"type\":\"bound\",\"dimension\":\"userid\",\"upper\":\"2\",\"upperStrict\":true,\"alphaNumeric\":false},\"dimensions\":[\"userid\"],\"metrics\":[\"num_l\"],\"granularity\":\"all\",\"pagingSpec\":{\"threshold\":16384},\"context\":{\"druid.query.fetch\":false}}","druid.query.type":"select"} Time taken: 1.548 seconds, Fetched: 10 row(s) hive> explain select * from login_druid where cast (userid as int) < 2; OK Plan optimized by CBO. Stage-0 Fetch Operator limit:-1 Select Operator [SEL_1] Output:["_col0","_col1","_col2"] TableScan [TS_0] Output:["__time","userid","num_l"],properties:{"druid.query.json":"{\"queryType\":\"select\",\"dataSource\":\"druid_user_login\",\"descending\":false,\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"filter\":{\"type\":\"bound\",\"dimension\":\"userid\",\"upper\":\"2\",\"upperStrict\":true,\"alphaNumeric\":false},\"dimensions\":[\"userid\"],\"metrics\":[\"num_l\"],\"granularity\":\"all\",\"pagingSpec\":{\"threshold\":16384},\"context\":{\"druid.query.fetch\":false}}","druid.query.type":"select"} Time taken: 0.27 seconds, Fetched: 10 row(s) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16026) Generated query will timeout and/or kill the druid cluster.
slim bouguerra created HIVE-16026: - Summary: Generated query will timeout and/or kill the druid cluster. Key: HIVE-16026 URL: https://issues.apache.org/jira/browse/HIVE-16026 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Grouping by `__time` and another dimension generate a query with granularity NONE with an interval from 1970 to 3000. This will kill the druid cluster because druid group by strategy will create cursor for every ms and there is lot of milliseconds between 1970 and 3000. Hence such query can turn into a select then do the group by within hive. This should only happen when we don't know the `__time` granularity. {code} explain select `__time`, userid from login_druid group by `__time`, userid > ; OK Plan optimized by CBO. Stage-0 Fetch Operator limit:-1 Select Operator [SEL_1] Output:["_col0","_col1"] TableScan [TS_0] Output:["__time","userid"],properties:{"druid.query.json":"{\"queryType\":\"groupBy\",\"dataSource\":\"druid_user_login\",\"granularity\":\"NONE\",\"dimensions\":[\"userid\"],\"limitSpec\":{\"type\":\"default\"},\"aggregations\":[{\"type\":\"longSum\",\"name\":\"dummy_agg\",\"fieldName\":\"dummy_agg\"}],\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"]}","druid.query.type":"groupBy"} {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16025) Where IN clause throws exception
slim bouguerra created HIVE-16025: - Summary: Where IN clause throws exception Key: HIVE-16025 URL: https://issues.apache.org/jira/browse/HIVE-16025 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Priority: Critical {code} select * from login_druid where userid IN ("user1", "user2"); Exception in thread "main" java.lang.AssertionError: cannot translate filter: IN($1, _UTF-16LE'user1', _UTF-16LE'user2') at org.apache.calcite.adapter.druid.DruidQuery$Translator.translateFilter(DruidQuery.java:886) at org.apache.calcite.adapter.druid.DruidQuery$Translator.access$000(DruidQuery.java:786) at org.apache.calcite.adapter.druid.DruidQuery.getQuery(DruidQuery.java:424) at org.apache.calcite.adapter.druid.DruidQuery.deriveQuerySpec(DruidQuery.java:402) at org.apache.calcite.adapter.druid.DruidQuery.getQuerySpec(DruidQuery.java:351) at org.apache.calcite.adapter.druid.DruidQuery.deriveRowType(DruidQuery.java:271) at org.apache.calcite.rel.AbstractRelNode.getRowType(AbstractRelNode.java:219) at org.apache.calcite.plan.RelOptUtil.verifyTypeEquivalence(RelOptUtil.java:343) at org.apache.calcite.plan.hep.HepRuleCall.transformTo(HepRuleCall.java:57) at org.apache.calcite.plan.RelOptRuleCall.transformTo(RelOptRuleCall.java:225) at org.apache.calcite.adapter.druid.DruidRules$DruidFilterRule.onMatch(DruidRules.java:142) at org.apache.calcite.plan.AbstractRelOptPlanner.fireRule(AbstractRelOptPlanner.java:314) at org.apache.calcite.plan.hep.HepPlanner.applyRule(HepPlanner.java:502) at org.apache.calcite.plan.hep.HepPlanner.applyRules(HepPlanner.java:381) at org.apache.calcite.plan.hep.HepPlanner.executeInstruction(HepPlanner.java:247) at org.apache.calcite.plan.hep.HepInstruction$RuleInstance.execute(HepInstruction.java:125) at org.apache.calcite.plan.hep.HepPlanner.executeProgram(HepPlanner.java:206) at org.apache.calcite.plan.hep.HepPlanner.findBestExp(HepPlanner.java:193) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.hepPlan(CalcitePlanner.java:1775) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1504) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1260) at org.apache.calcite.tools.Frameworks$1.apply(Frameworks.java:113) at org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:997) at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:149) at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:106) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1068) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.getOptimizedAST(CalcitePlanner.java:1084) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:363) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11026) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:285) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:233) at org.apache.hadoop.util.RunJar.main(RunJar.java:148) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15951) Make sure base persist directory is unique and deleted
slim bouguerra created HIVE-15951: - Summary: Make sure base persist directory is unique and deleted Key: HIVE-15951 URL: https://issues.apache.org/jira/browse/HIVE-15951 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Priority: Critical Fix For: 2.2.0 In some cases the base persist directory will contain old data or shared between reducer in the same physical VM. That will lead to the failure of the job till that the directory is cleaned. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15877) Upload dependency jars for druid storage handler
slim bouguerra created HIVE-15877: - Summary: Upload dependency jars for druid storage handler Key: HIVE-15877 URL: https://issues.apache.org/jira/browse/HIVE-15877 Project: Hive Issue Type: Bug Reporter: slim bouguerra Upload dependency jars for druid storage handler -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15809) Typo in the PostgreSQL database name for druid service
slim bouguerra created HIVE-15809: - Summary: Typo in the PostgreSQL database name for druid service Key: HIVE-15809 URL: https://issues.apache.org/jira/browse/HIVE-15809 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Assignee: slim bouguerra Priority: Trivial Fix For: 2.2.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15785) Add S3 support for druid storage handler
slim bouguerra created HIVE-15785: - Summary: Add S3 support for druid storage handler Key: HIVE-15785 URL: https://issues.apache.org/jira/browse/HIVE-15785 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra Fix For: 2.2.0 Add S3 support for druid storage handler -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15727) Add pre insert work to give storage handler the possibility to perform pre insert checking
slim bouguerra created HIVE-15727: - Summary: Add pre insert work to give storage handler the possibility to perform pre insert checking Key: HIVE-15727 URL: https://issues.apache.org/jira/browse/HIVE-15727 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 2.2.0 Add pre insert work stage to give storage handler the possibility to perform pre insert checking. For instance for the druid storage handler this will block the statement INSERT INTO statement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15586) Make Insert and Create statement Transactional
slim bouguerra created HIVE-15586: - Summary: Make Insert and Create statement Transactional Key: HIVE-15586 URL: https://issues.apache.org/jira/browse/HIVE-15586 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Currently insert/create will return the handle to user without waiting for the data been loaded by the druid cluster. In order to avoid that will add a passive wait till the segment are loaded by historical in case the coordinator is UP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15571) Support Insert into for druid storage handler
slim bouguerra created HIVE-15571: - Summary: Support Insert into for druid storage handler Key: HIVE-15571 URL: https://issues.apache.org/jira/browse/HIVE-15571 Project: Hive Issue Type: New Feature Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15439) Support INSERT OVERWRITE for internal druid datasources.
slim bouguerra created HIVE-15439: - Summary: Support INSERT OVERWRITE for internal druid datasources. Key: HIVE-15439 URL: https://issues.apache.org/jira/browse/HIVE-15439 Project: Hive Issue Type: Sub-task Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Assignee: slim bouguerra Add support for SQL statement INSERT OVERWRITE TABLE druid_internal_table. In order to add this support will need to add new post insert hook to update the druid metadata. Creation of the segment will be the same as CTAS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15393) Update Guava version
slim bouguerra created HIVE-15393: - Summary: Update Guava version Key: HIVE-15393 URL: https://issues.apache.org/jira/browse/HIVE-15393 Project: Hive Issue Type: Sub-task Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Priority: Blocker Druid base code is using newer version of guava 16.0.1 that is not compatible with the current version used by Hive. FYI Hadoop project is moving to Guava 18 not sure if it is better to move to guava 18 or even 19. https://issues.apache.org/jira/browse/HADOOP-10101 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15277) Teach Hive how to create/delete Druid segments
slim bouguerra created HIVE-15277: - Summary: Teach Hive how to create/delete Druid segments Key: HIVE-15277 URL: https://issues.apache.org/jira/browse/HIVE-15277 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Assignee: slim bouguerra We want to extend the DruidStorageHandler to support CTAS queries. In this implementation Hive will generate druid segment files and insert the metadata to signal the handoff to druid. The syntax will be as follows: CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "datasourcename") AS ; This statement stores the results of query in a Druid datasource named 'datasourcename'. One of the columns of the query needs to be the time dimension, which is mandatory in Druid. In particular, we use the same convention that it is used for Druid: there needs to be a the column named '__time' in the result of the executed query, which will act as the time dimension column in Druid. Currently, the time column dimension needs to be a 'timestamp' type column. metrics can be of type long, double and float while dimensions are strings. Keep in mind that druid has a clear separation between dimensions and metrics, therefore if you have a column in hive that contains number and need to be presented as dimension use the cast operator to cast as string. This initial implementation interacts with Druid Meta data storage to add/remove the table in druid, user need to supply the meta data config as --hiveconf hive.druid.metadata.password=XXX --hiveconf hive.druid.metadata.username=druid --hiveconf hive.druid.metadata.uri=jdbc:mysql://host/druid -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15274) wrong results on the column __time
slim bouguerra created HIVE-15274: - Summary: wrong results on the column __time Key: HIVE-15274 URL: https://issues.apache.org/jira/browse/HIVE-15274 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: Jesus Camacho Rodriguez Priority: Minor issuing select * from table will return wrong time column. expected results ─┬┬─┐ │ __time │ dimension1 │ metric1 │ ├─┼┼─┤ │ Wed Dec 31 2014 16:00:00 GMT-0800 (PST) │ value1 │ 1 │ │ Wed Dec 31 2014 16:00:00 GMT-0800 (PST) │ value1.1 │ 1 │ │ Sun May 31 2015 19:00:00 GMT-0700 (PDT) │ value2 │ 20.5│ │ Sun May 31 2015 19:00:00 GMT-0700 (PDT) │ value2.1 │ 32 │ └─┴┴─┘ returned result 2014-12-31 19:00:00 value1 1.0 2014-12-31 19:00:00 value1.11.0 2014-12-31 19:00:00 value2 20.5 2014-12-31 19:00:00 value2.132.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15273) Http Client not configured correctly
slim bouguerra created HIVE-15273: - Summary: Http Client not configured correctly Key: HIVE-15273 URL: https://issues.apache.org/jira/browse/HIVE-15273 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Priority: Minor Current used http client by the druid-hive record reader is constructed with default values. Default values of numConnection and ReadTimeout are very small which can lead to following exception " ERROR [2ee34a2b-c8a5-4748-ab91-db3621d2aa5c main] CliDriver: Failed with exception java.io.IOException:java.io.IOException: java.io.IOException: org.apache.h ive.druid.org.jboss.netty.channel.ChannelException: Channel disconnected" Full stack can be found here.https://gist.github.com/b-slim/384ca6a96698f5b51ad9b171cff556a2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)