[jira] [Created] (DRILL-4128) null pointer at org.apache.drill.exec.vector.accessor.AbstractSqlAccessor.getString(AbstractSqlAccessor.java:101)

2015-11-24 Thread Devender Yadav (JIRA)
Devender Yadav  created DRILL-4128:
--

 Summary: null pointer at 
org.apache.drill.exec.vector.accessor.AbstractSqlAccessor.getString(AbstractSqlAccessor.java:101)
 Key: DRILL-4128
 URL: https://issues.apache.org/jira/browse/DRILL-4128
 Project: Apache Drill
  Issue Type: Bug
  Components: Client - JDBC
Affects Versions: 1.3.0, 1.2.0, 1.1.0, 1.0.0
Reporter: Devender Yadav 


Below mentioned method is throwing null pointer becaue getObject(rowOffset) 
returns null for null values & null.toString() is throwing null pointer.

 @Override
  public String getString(int rowOffset) throws InvalidAccessException{
return getObject(rowOffset).toString();
  }

It should be like:

 @Override
  public String getString(int rowOffset) throws InvalidAccessException{
return getObject(rowOffset)==null? null:getObject(rowOffset).toString();
  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (DRILL-4035) NPE seen on Functional test run using JDK 8

2015-11-24 Thread Deneche A. Hakim (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deneche A. Hakim reassigned DRILL-4035:
---

Assignee: Deneche A. Hakim  (was: Sudheesh Katkam)

> NPE seen on Functional test run using JDK 8
> ---
>
> Key: DRILL-4035
> URL: https://issues.apache.org/jira/browse/DRILL-4035
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.3.0
> Environment: 4 node cluster CentOS
>Reporter: Khurram Faraaz
>Assignee: Deneche A. Hakim
> Fix For: 1.4.0
>
>
> I am seeing an NPE in the Functional test run using JDK8 and Drill 1.3
> Failing test is : 
> Functional/partition_pruning/hive/parquet/dynamic_hier_intint/data/parquetCount1.q
> select count(*) from 
> hive.dynamic_partitions.lineitem_parquet_partitioned_hive_hier_intint;
> {code}
> Drill version was, git.commit.id=e4b94a78
> root@centos drill-1.3.0]# java -version
> openjdk version "1.8.0_65"
> OpenJDK Runtime Environment (build 1.8.0_65-b17)
> OpenJDK 64-Bit Server VM (build 25.65-b01, mixed mode)
> [root@centos drill-1.3.0]# javac -version
> javac 1.8.0_65
> {code}
> {code}
> 2015-11-05 01:37:45 INFO  DrillTestJdbc:76 - running test 
> /root/public_framework/drill-test-framework/framework/resources/Functional/window_functions/last_val/lastValFn_9.q
>  981260622
> 2015-11-05 01:37:45 INFO  DrillResultSetImpl$ResultsListener:1470 - [#137] 
> Query failed:
> oadd.org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
> NullPointerException
> Fragment 0:0
> [Error Id: cefc7238-a646-4f9a-b4f2-0bd102efe393 on centos-01.qa.lab:31010]
> at 
> oadd.org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:118)
> at 
> oadd.org.apache.drill.exec.rpc.user.UserClient.handleReponse(UserClient.java:110)
> at 
> oadd.org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:47)
> at 
> oadd.org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:32)
> at oadd.org.apache.drill.exec.rpc.RpcBus.handle(RpcBus.java:61)
> at 
> oadd.org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:233)
> at 
> oadd.org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:205)
> at 
> oadd.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
> at 
> oadd.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> at 
> oadd.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> at 
> oadd.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
> at 
> oadd.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> at 
> oadd.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> at 
> oadd.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> oadd.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> at 
> oadd.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> at 
> oadd.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
> at 
> oadd.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> at 
> oadd.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> at 
> oadd.io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
> at 
> oadd.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
> at 
> oadd.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
> at 
> oadd.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
> at 
> oadd.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at 
> oadd.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> oadd.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> oadd.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at oadd.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> oadd.io.netty.util.concurrent.SingleThreadEventExec

[jira] [Commented] (DRILL-4111) turn tests off in travis as they don't work there

2015-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15026052#comment-15026052
 ] 

ASF GitHub Bot commented on DRILL-4111:
---

Github user sudheeshkatkam commented on the pull request:

https://github.com/apache/drill/pull/267#issuecomment-159471158
  
:+1:


> turn tests off in travis as they don't work there
> -
>
> Key: DRILL-4111
> URL: https://issues.apache.org/jira/browse/DRILL-4111
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>
> Since the travis build always fails, we should just turn it off for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4111) turn tests off in travis as they don't work there

2015-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15026004#comment-15026004
 ] 

ASF GitHub Bot commented on DRILL-4111:
---

Github user julienledem commented on the pull request:

https://github.com/apache/drill/pull/267#issuecomment-159465600
  
travis-ci is gree: mvn package -DskipTests=true


> turn tests off in travis as they don't work there
> -
>
> Key: DRILL-4111
> URL: https://issues.apache.org/jira/browse/DRILL-4111
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>
> Since the travis build always fails, we should just turn it off for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-2618) BasicFormatMatcher calls getFirstPath(...) without checking # of paths is not zero

2015-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025862#comment-15025862
 ] 

ASF GitHub Bot commented on DRILL-2618:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/270


> BasicFormatMatcher calls getFirstPath(...) without checking # of paths is not 
> zero
> --
>
> Key: DRILL-2618
> URL: https://issues.apache.org/jira/browse/DRILL-2618
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Other
>Reporter: Daniel Barclay (Drill)
>Assignee: Hanifi Gunes
> Fix For: 1.4.0
>
>
> {{BasicFormatMatcher.isReadable(...)}} calls {{getFirstPath(...)}} without 
> checking that there is at least one path.  This can cause an 
> IndexOutOfBoundsException.
> To reproduce, create an empty directory {{/tmp/CaseInsensitiveColumnNames}} 
> and run 
> {{exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3854) IOB Exception : CONVERT_FROM (sal, int_be)

2015-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025781#comment-15025781
 ] 

ASF GitHub Bot commented on DRILL-3854:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/262


> IOB Exception : CONVERT_FROM (sal, int_be)
> --
>
> Key: DRILL-3854
> URL: https://issues.apache.org/jira/browse/DRILL-3854
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.2.0
> Environment: 4 node cluster CentOS
>Reporter: Khurram Faraaz
>Assignee: Sean Hsuan-Yi Chu
>Priority: Critical
> Fix For: 1.4.0
>
> Attachments: log, run_time_code.txt
>
>
> CONVERT_FROM function results in IOB Exception
> Drill master commit id : b9afcf8f
> {code}
> 0: jdbc:drill:schema=dfs.tmp> select salary from Emp;
> +-+
> | salary  |
> +-+
> | 8   |
> | 9   |
> | 20  |
> | 95000   |
> | 85000   |
> | 9   |
> | 10  |
> | 87000   |
> | 8   |
> | 10  |
> | 99000   |
> +-+
> 11 rows selected (0.535 seconds)
> # create table using above Emp table
> create table tbl_int_be as select convert_to(salary, 'int_be') sal from Emp;
> 0: jdbc:drill:schema=dfs.tmp> alter session set `planner.slice_target`=1;
> +---++
> |  ok   |summary |
> +---++
> | true  | planner.slice_target updated.  |
> +---++
> 1 row selected (0.19 seconds)
> # Below query results in IOB on server.
> 0: jdbc:drill:schema=dfs.tmp> select convert_from(sal, 'int_be') from 
> tbl_int_be order by sal;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: DrillBuf(ridx: 0, widx: 158, 
> cap: 158/158, unwrapped: SlicedByteBuf(ridx: 0, widx: 158, cap: 158/158, 
> unwrapped: UnsafeDirectLittleEndian(PooledUnsafeDirectByteBuf(ridx: 0, widx: 
> 0, cap: 417/417.slice(158, 44)
> Fragment 2:0
> [Error Id: 4ee1361d-9877-45eb-bde6-57d5add9fe5e on centos-04.qa.lab:31010] 
> (state=,code=0)
> # Apply convert_from function and project original column results in IOB on 
> client. (because Error Id is missing)
> 0: jdbc:drill:schema=dfs.tmp> select convert_from(sal, 'int_be'), sal from 
> tbl_int_be;
> Error: Unexpected RuntimeException: java.lang.IndexOutOfBoundsException: 
> DrillBuf(ridx: 0, widx: 114, cap: 114/114, unwrapped: DrillBuf(ridx: 321, 
> widx: 321, cap: 321/321, unwrapped: 
> UnsafeDirectLittleEndian(PooledUnsafeDirectByteBuf(ridx: 0, widx: 0, cap: 
> 321/321.slice(55, 103) (state=,code=0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4109) NPE in RecordIterator

2015-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025757#comment-15025757
 ] 

ASF GitHub Bot commented on DRILL-4109:
---

GitHub user amithadke opened a pull request:

https://github.com/apache/drill/pull/282

DRILL-4109 Fix NPE in RecordIterator.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/amithadke/drill DRILL-4109

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/282.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #282


commit 7adc9243c17ed7a11f2d791e120731f99c219d75
Author: Amit Hadke 
Date:   2015-11-25T00:00:16Z

DRILL-4109 Fix NPE in RecordIterator.




> NPE in RecordIterator
> -
>
> Key: DRILL-4109
> URL: https://issues.apache.org/jira/browse/DRILL-4109
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Victoria Markman
>Assignee: amit hadke
>Priority: Blocker
> Fix For: 1.4.0
>
> Attachments: 29ac6c1b-9b33-3457-8bc8-9e2dff6ad438.sys.drill, 
> 29b41f37-4803-d7ce-e05f-912d1f65da79.sys.drill, drillbit.log, 
> drillbit.log.debug
>
>
> 4 node cluster
> 36GB of direct memory
> 4GB heap memory
> planner.memory.max_query_memory_per_node=2GB (default)
> planner.enable_hashjoin = false
> Spill directory has 6.4T of memory available:
> {noformat}
> [Tue Nov 17 18:23:18 /tmp/drill ] # df -H .
> Filesystem   Size  Used Avail Use% Mounted on
> localhost:/mapr  7.7T  1.4T  6.4T  18% /mapr
> {noformat}
> Run query below: 
> framework/resources/Advanced/tpcds/tpcds_sf100/original/query15.sql
> drillbit.log
> {code}
> 2015-11-18 02:22:12,639 [29b41f37-4803-d7ce-e05f-912d1f65da79:frag:3:9] INFO  
> o.a.d.e.p.i.xsort.ExternalSortBatch - Merging and spilling to 
> /tmp/drill/spill/29b41f37-4803-d7ce-e05f-912d1f65da79/major_fragment_3/minor_fragment_9/operator_17/7
> 2015-11-18 02:22:12,770 [29b41f37-4803-d7ce-e05f-912d1f65da79:frag:3:5] INFO  
> o.a.d.e.p.i.xsort.ExternalSortBatch - Merging and spilling to 
> /tmp/drill/spill/29b41f37-4803-d7ce-e05f-912d1f65da79/major_fragment_3/minor_fragment_5/operator_17/7
> 2015-11-18 02:22:13,345 [29b41f37-4803-d7ce-e05f-912d1f65da79:frag:3:17] INFO 
>  o.a.d.e.p.i.xsort.ExternalSortBatch - Completed spilling to 
> /tmp/drill/spill/29b41f37-4803-d7ce-e05f-912d1f65da79/major_fragment_3/minor_fragment_17/operator_17/7
> 2015-11-18 02:22:13,346 [29b41f37-4803-d7ce-e05f-912d1f65da79:frag:3:13] INFO 
>  o.a.d.e.p.i.xsort.ExternalSortBatch - Completed spilling to 
> /tmp/drill/spill/29b41f37-4803-d7ce-e05f-912d1f65da79/major_fragment_3/minor_fragment_13/operator_16/1
> 2015-11-18 02:22:13,346 [29b41f37-4803-d7ce-e05f-912d1f65da79:frag:3:13] WARN 
>  o.a.d.e.p.i.xsort.ExternalSortBatch - Starting to merge. 34 batch groups. 
> Current allocated memory: 2252186
> 2015-11-18 02:22:13,363 [29b41f37-4803-d7ce-e05f-912d1f65da79:frag:3:13] INFO 
>  o.a.d.e.w.fragment.FragmentExecutor - 
> 29b41f37-4803-d7ce-e05f-912d1f65da79:3:13: State change requested RUNNING --> 
> FAILED
> 2015-11-18 02:22:13,370 [29b41f37-4803-d7ce-e05f-912d1f65da79:frag:3:13] INFO 
>  o.a.d.e.w.fragment.FragmentExecutor - 
> 29b41f37-4803-d7ce-e05f-912d1f65da79:3:13: State change requested FAILED --> 
> FINISHED
> 2015-11-18 02:22:13,371 [29b41f37-4803-d7ce-e05f-912d1f65da79:frag:3:13] 
> ERROR o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: NullPointerException
> Fragment 3:13
> [Error Id: c5d67dcb-16aa-4951-89f5-599b4b4eb54d on atsqa4-133.qa.lab:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> NullPointerException
> Fragment 3:13
> [Error Id: c5d67dcb-16aa-4951-89f5-599b4b4eb54d on atsqa4-133.qa.lab:31010]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
>  ~[drill-common-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:321)
>  [drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184)
>  [drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290)
>  [drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_71]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.ru

[jira] [Commented] (DRILL-4103) Add additional metadata to Parquet files generated by Drill

2015-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025725#comment-15025725
 ] 

ASF GitHub Bot commented on DRILL-4103:
---

Github user julienledem commented on the pull request:

https://github.com/apache/drill/pull/264#issuecomment-159440630
  
@jaltekruse did you mean to merge this?


> Add additional metadata to Parquet files generated by Drill
> ---
>
> Key: DRILL-4103
> URL: https://issues.apache.org/jira/browse/DRILL-4103
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Jacques Nadeau
>Assignee: Julien Le Dem
> Fix For: 1.3.0
>
>
> For future compatibility efforts, it would be good for us to automatically 
> add metadata to Drill generated Parquet files. At a minimum, we should add 
> information about the fact that Drill generated the files and the version of 
> Drill that generated the files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-3845) UnorderedReceiver shouldn't terminate until it receives a final batch

2015-11-24 Thread Deneche A. Hakim (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025562#comment-15025562
 ] 

Deneche A. Hakim commented on DRILL-3845:
-

I changed the UnorderedReceiver to not kill it's providers until it receives 
the "last batch" (you can see the change 
[here|https://github.com/adeneche/incubator-drill/commit/5dbd9fdc88b1c802dff3509dee85416efa3dac15]
 but now, some queries will fail with the following error:
{noformat}
Error: SYSTEM ERROR: IllegalStateException: Cleanup before finished. 0 out of 1 
strams have finished
{noformat}

Fixing the receiver doesn't enforce the protocol. Senders will close their 
fragment as soon as they receive a "kill signal", causing their receivers to 
close before they get the "final batch", which throws the error above.

[~jnadeau] and [~sphillips]: is it valid to change the protocol such as 
receivers can terminate before they get their "final batch" (which is already 
the case sometimes) and senders don't send the "final batch" for receivers that 
already finished (they sent a "receiver finished" message) ?


> UnorderedReceiver shouldn't terminate until it receives a final batch
> -
>
> Key: DRILL-3845
> URL: https://issues.apache.org/jira/browse/DRILL-3845
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Reporter: Deneche A. Hakim
>Assignee: Deneche A. Hakim
> Fix For: 1.4.0
>
> Attachments: 29c45a5b-e2b9-72d6-89f2-d49ba88e2939.sys.drill
>
>
> Even if a receiver has finished and informed the corresponding partition 
> sender, the sender will still try to send a "last batch" to the receiver when 
> it's done. In most cases this is fine as those batches will be silently 
> dropped by the receiving DataServer, but if a receiver has finished +10 
> minutes ago, DataServer will throw an exception as it couldn't find the 
> corresponding FragmentManager (WorkEventBus has a 10 minutes recentlyFinished 
> cache).
> DRILL-2274 is a reproduction for this case (after the corresponding fix is 
> applied).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4127) HiveSchema.getSubSchema() should use lazy loading of all the table names

2015-11-24 Thread Jinfeng Ni (JIRA)
Jinfeng Ni created DRILL-4127:
-

 Summary: HiveSchema.getSubSchema() should use lazy loading of all 
the table names
 Key: DRILL-4127
 URL: https://issues.apache.org/jira/browse/DRILL-4127
 Project: Apache Drill
  Issue Type: Bug
Reporter: Jinfeng Ni
Assignee: Jinfeng Ni


Currently, HiveSchema.getSubSchema() will pre-load all the table names when it 
constructs the subschema, even though those tables names are not requested at 
all. This could cause considerably big performance overhead, especially when 
the hive schema contains large # of objects (thousands of tables/views are not 
un-common in some use case). 

In stead, we should change the loading of table names to on-demand. Only when 
there is a request of get all table names, we load them into hive schema.

This should help "show schemas", since it only requires the schema name, not 
the table names in the schema. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4126) Adding HiveMetaStore caching when impersonation is enabled.

2015-11-24 Thread Jinfeng Ni (JIRA)
Jinfeng Ni created DRILL-4126:
-

 Summary: Adding HiveMetaStore caching when impersonation is 
enabled. 
 Key: DRILL-4126
 URL: https://issues.apache.org/jira/browse/DRILL-4126
 Project: Apache Drill
  Issue Type: Bug
Reporter: Jinfeng Ni
Assignee: Jinfeng Ni


Currently, HiveMetastore caching is used only when impersonation is disabled, 
such that all the hivemetastore call goes through 
NonCloseableHiveClientWithCaching [1]. However, if impersonation is enabled, 
caching is not used for HiveMetastore access.

This could significantly increase the planning time when hive storage plugin is 
enabled, or when running a query against INFORMATION_SCHEMA. Depending on the # 
of databases/tables in Hive storage plugin, the planning time or 
INFORMATION_SCHEMA query could become unacceptable. This becomes even worse if 
the hive metastore is running on a different node from drillbit, making the 
access of hivemetastore even slower.

We are seeing that it could takes 30~60 seconds for planning time, or execution 
time for INFORMATION_SCHEMA query.  The long planning or execution time for 
INFORMATION_SCHEMA query prevents Drill from acting "interactively" for such 
queries. 

We should enable caching when impersonation is used. As long as the authorizer 
verifies the user has the access to databases/tables, we should get the data 
from caching. By doing that, we should see reduced number of api call to 
HiveMetaStore.


[1] 
https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/DrillHiveMetaStoreClient.java#L299



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Zelaine Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025071#comment-15025071
 ] 

Zelaine Fong commented on DRILL-4119:
-

Per discussion at today's Drill hangout, Jacques mentioned that one of the 
differences resulting from the port is dealing with Java not supporting 
unsigneds.

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025061#comment-15025061
 ] 

Aman Sinha commented on DRILL-4119:
---

Sure, if you want to try out the original version go for it...

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (DRILL-4125) Illegal argument exception during merge join

2015-11-24 Thread amit hadke (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

amit hadke reassigned DRILL-4125:
-

Assignee: amit hadke

> Illegal argument exception during merge join 
> -
>
> Key: DRILL-4125
> URL: https://issues.apache.org/jira/browse/DRILL-4125
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Victoria Markman
>Assignee: amit hadke
>Priority: Blocker
> Fix For: 1.4.0
>
> Attachments: 29ac59f2-5d92-7378-bf81-e844a300efd7.sys.drill, 
> drillbit.log
>
>
> Same setup as in DRILL-4109
> Query: framework/resources/Advanced/tpcds/tpcds_sf100/original/query93.sql
> Excerpt from drillbit.log
> {code}
> 2015-11-23 23:50:44,071 [29ac59f2-5d92-7378-bf81-e844a300efd7:frag:5:74] 
> ERROR o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: 
> IllegalArgumentException
> Fragment 5:74
> [Error Id: 1ca9758d-1864-4940-9efa-b8906d4f9b52 on atsqa4-133.qa.lab:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> IllegalArgumentException
> Fragment 5:74
> [Error Id: 1ca9758d-1864-4940-9efa-b8906d4f9b52 on atsqa4-133.qa.lab:31010]
> at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:534)
>  ~[drill-common-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:321)
>  [drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:184)
>  [drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:290)
>  [drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_71]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_71]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
> Caused by: java.lang.IllegalArgumentException: null
> at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:76) 
> ~[guava-14.0.1.jar:na]
> at 
> org.apache.drill.exec.record.RecordIterator.getCurrentPosition(RecordIterator.java:242)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.test.generated.JoinWorkerGen8348.doJoin(JoinTemplate.java:63)
>  ~[na:na]
> at 
> org.apache.drill.exec.physical.impl.join.MergeJoinBatch.innerNext(MergeJoinBatch.java:206)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:132)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.innerNext(ExternalSortBatch.java:276)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
> at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>  ~[drill-java-exec-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
>

[jira] [Commented] (DRILL-4047) Select with options

2015-11-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024967#comment-15024967
 ] 

ASF GitHub Bot commented on DRILL-4047:
---

Github user julienledem commented on the pull request:

https://github.com/apache/drill/pull/246#issuecomment-159355075
  
I ran the full test suite. It's green


> Select with options
> ---
>
> Key: DRILL-4047
> URL: https://issues.apache.org/jira/browse/DRILL-4047
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Relational Operators
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>
> Add a mechanism to pass parameters down to the StoragePlugin when writing a 
> Select statement.
> Some discussion here:
> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAO%2Bvc4AcGK3%2B3QYvQV1-xPPdpG3Tc%2BfG%3D0xDGEUPrhd6ktHv5Q%40mail.gmail.com%3E
> http://mail-archives.apache.org/mod_mbox/drill-dev/201511.mbox/%3ccao+vc4clzylvjevisfjqtcyxb-zsmfy4bqrm-jhbidwzgqf...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Mehant Baid (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024951#comment-15024951
 ] 

Mehant Baid commented on DRILL-4119:


If we are returning different values from the original implementation then I 
feel we should fix that issue? I can help out to identify the differences.

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024913#comment-15024913
 ] 

Aman Sinha edited comment on DRILL-4119 at 11/24/15 5:32 PM:
-

Our hash64 implementation looks similar to the original one but I haven't done 
enough analysis to say they are exactly the same.  The only way to check is 
through testing.  Here are 2 values and their corresponding hash from the 
original (note, for some reason the command line utility xxh64sum does not read 
multiple lines from a file, so I had to break up the values into separate 
files): 
{noformat}
$ cat sample1.csv
1a883d005e0ce003b918d737ac697e7c

$ cat sample2.csv
e4b4388e8865819126cb0e4dcaa7261d

$ ./xxh64sum sample1.csv
1213a50f060e0659  sample1.csv

$ ./xxh64sum sample2.csv
e0658433041ce9aa  sample2.csv
{noformat}

These values don't match the value I am getting from Drill  after doing the 
conversion of the long to hex (I used Long.toHexString() method in debugger to 
convert), so it is possible something may have gotten lost in translation. 


was (Author: amansinha100):
Our hash64 implementation looks similar to the original one but I haven't done 
enough analysis to say they are exactly the same.  The only way to check is 
through testing.  Here are 2 values and their corresponding hash from the 
original (note, for some reason the command line utility xxh64sum does not read 
multiple lines from a file, so I had to break up the values into separate 
files): 
{noformat}
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat > sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample1.csv
1a883d005e0ce003b918d737ac697e7c
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample1.csv
1213a50f060e0659  sample1.csv
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample2.csv
e0658433041ce9aa  sample2.csv
{noformat}

These values don't match the value I am getting from Drill  after doing the 
conversion of the long to hex (I used Long.toHexString() method in debugger to 
convert), so it is possible something may have gotten lost in translation. 

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024913#comment-15024913
 ] 

Aman Sinha commented on DRILL-4119:
---

Our hash64 implementation looks similar to the original one but I haven't done 
enough analysis to say they are exactly the same.  The only way to check is 
through testing.  Here are 2 values and their corresponding hash from the 
original (note, for some reason the command line utility xxh64sum does not read 
multiple lines from a file, so I had to break up the values into separate 
files): 
{noformat}
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat > sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample1.csv
1a883d005e0ce003b918d737ac697e7c
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample1.csv
1213a50f060e0659  sample1.csv
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample2.csv
e0658433041ce9aa  sample2.csv
{noformat}

These values don't match the value I am getting from Drill  after doing the 
conversion of the long to hex (I used Long.toHexString() method in debugger to 
convert), so it is possible something may have gotten lost in translation. 

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Zelaine Fong (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024866#comment-15024866
 ] 

Zelaine Fong commented on DRILL-4119:
-

[~amansinha100] - when you say "the original XXHash's C implementation and 
based on an initial analysis that one produces different hash value than our 
implementation and does not seem to have the same 'even number' pattern", are 
you saying that our current hash64 implementation is different from the 
original one?  If yes, does that mean something got lost in the translation 
from C to Java?

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
> Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of 
> length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02  HashAgg(group=[{0}])
> 01-03Project(SomeId=[$0])
> 01-04  HashToRandomExchange(dist0=[[$0]])
> 02-01UnorderedMuxExchange
> 03-01  Project(SomeId=[$0], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02HashAgg(group=[{0}])
> 03-03  Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data

2015-11-24 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024802#comment-15024802
 ] 

Aman Sinha commented on DRILL-4119:
---

I did some more testing with the sample data.  Here are 3 hash values: 
 - hash64 is the native hash64 computed by XXHash.hash64()
 - hash64_downcast is the same value downcast to int
 - newhash is the the new 32 bit hash value computed by the proposed fix 
(combining the first and last 4 bytes of hash64).  

{noformat}
0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0]) as hash64, 
castInt(hash64(columns[0])) as hash64_downcast, hash32(columns[0]) as newhash 
from dfs.`/Users/asinha/data/sample.csv`;
+---+--+--+--+
|id |hash64| hash64_downcast  | 
  newhash|
+---+--+--+--+
| 1a883d005e0ce003b918d737ac697e7c  | 6695077304582944118  | 594687350| 
2140898336   |
| e4b4388e8865819126cb0e4dcaa7261d  | 2614721709087477964  | -2136387380  | 
-1528820922  |
| 639a06fb09c70cc397666d38a8134af5  | 3943910117127083836  | 359520060| 
601244263|
| ae03f853f40c307aa24894e414a6dfdc  | 4320987148691340574  | 214334750| 
925976565|
| 2dd3fdace36431e3810437bee1c7e3f1  | 5657579594883017754  | -1719653350  | 
-687608144   |
| 00abdb137380e6ea8cb3e67df40c30dd  | 5039129256017100358  | 573406790| 
1740892954   |
| d65d4e30ec96a588e82847aca619e4a0  | 550451582126160076   | 716077260| 
755884032|
| 956f968866b3151ad472edfcafb579fa  | 39366413145792912| 1336074640   | 
1328101915   |
| 75577f830d12c86fd1de94d45cfa0715  | 6480730101791620276  | -226984780   | 
-1417129724  |
| 298aa703dbee9e5f303372fe7a764975  | 7844015280248941602  | -2013696990  | 
-350034316   |
+---+--+--+--+
10 rows selected (0.228 seconds)
{noformat}

A key observation is that all hash64 values are even numbers.   This is not a 
good thing.  I confirmed the behavior over a larger sample of 100 rows.  
However, this seems specific to strings that are 32 chars (or maybe longer, 
although a simple test for a 64 char string did not show the same pattern).  

I then modified the seed value to 1 (default is 0).  This time I got better 
distribution for the hash64:
{noformat}
0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0], 1) as 
hash64, castInt(hash64(columns[0], 1)) as hash64_downcast, hash32(columns[0], 
1) as newhash from dfs.`/Users/asinha/data/sample.csv`;
+---+--+--+-+
|id |hash64| hash64_downcast  | 
  newhash   |
+---+--+--+-+
| 1a883d005e0ce003b918d737ac697e7c  | 3877569168361489241  | 1211204441   | 
2113824708  |
| e4b4388e8865819126cb0e4dcaa7261d  | 510472474498931  | 567154547| 
1826042916  |
| 639a06fb09c70cc397666d38a8134af5  | 6160367672898924663  | 1827713143   | 
965653941   |
| ae03f853f40c307aa24894e414a6dfdc  | 5573714012720216212  | 533608596| 
1385691081  |
| 2dd3fdace36431e3810437bee1c7e3f1  | 4742615352245986962  | 284141202| 
1363050779  |
| 00abdb137380e6ea8cb3e67df40c30dd  | 5870154798330275502  | 185067182| 
1517362206  |
| d65d4e30ec96a588e82847aca619e4a0  | 5469776233948339425  | 828202209| 
2058735712  |
| 956f968866b3151ad472edfcafb579fa  | 8671446365158603789  | -1675527155  | 
-462006645  |
| 75577f830d12c86fd1de94d45cfa0715  | 3369914886384026207  | 238584415| 
553440739   |
| 298aa703dbee9e5f303372fe7a764975  | 3765901389360033496  | 1811181272   | 
1605846404  |
+---+--+--+-+
10 rows selected (0.263 seconds)
{noformat}

I am thinking we should put in the proposed fix I sent earlier since it 
improves things.  Separately, I think we need to investigate the quality of the 
XXHash.hash64 implementation.   BTW, I also downloaded the original XXHash's C 
implementation and based on an initial analysis that one produces different 
hash value than our implementation and does not seem to have the same 'even 
number' pattern. 

> Skew in hash distribution for varchar (and possibly other) types of data
> 
>
> Key: DRILL-4119
> URL: https://issues.apache.org/jira/browse/DRILL-4119
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
>Reporter: Aman Sinha
>