[jira] [Created] (DRILL-5023) ExternalSortBatch does not spill fully, throws off spill calculations

2016-11-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5023:
--

 Summary: ExternalSortBatch does not spill fully, throws off spill 
calculations
 Key: DRILL-5023
 URL: https://issues.apache.org/jira/browse/DRILL-5023
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


The {{ExternalSortBatch}} (ESB) operator sorts records, spilling to disk as 
needed to operate within a defined memory budget.

When needed, ESB spills accumulated record batches to disk. However, when doing 
so, the ESB carves off the first spillable batch and holds it in memory:

{{code}}
// 1 output container is kept in memory, so we want to hold on to it and 
transferClone
// allows keeping ownership
VectorContainer c1 = VectorContainer.getTransferClone(outputContainer, 
oContext);
c1.buildSchema(BatchSchema.SelectionVectorMode.NONE);
c1.setRecordCount(count);
...
BatchGroup newGroup = new BatchGroup(c1, fs, outputFile, oContext);
}}

When the spill batch size gets larger (to fix DRILL-5022), the result is that 
nothing is spilled as the first spillable batch is simply stored back into 
memory on the (supposedly) spilled batches list.

The desired behavior is for all spillable batches to be written to disk. If the 
first batch is held back to work around some issue (to keep a schema, say?), 
then fine a different solution that allows the actual data to spill.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-5022) ExternalSortBatch sets two different limits for "copier" memory

2016-11-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5022:
--

 Summary: ExternalSortBatch sets two different limits for "copier" 
memory
 Key: DRILL-5022
 URL: https://issues.apache.org/jira/browse/DRILL-5022
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


The {{ExternalSortBatch}} (ESB) operator sorts rows and supports spilling to 
disk to operate within a set memory budget.

A key step in disk-based sorting is to merge "runs" of previously-sorted 
records. ESB does this with a class created from the 
{{PriorityQueueCopierTemplate}}, called the "copier" in the code.

The sort runs are represented by record batches, each with an indirection 
vector (AKA {{SelectionVector}}) that point to the records in sort order.

The copier restructures the incoming runs: copying from the original batches 
(from positions given by the indirection vector) into new output vectors in 
sorted order. To do this work, the copier must allocate new vectors to hold the 
merged data. These vectors consume memory, and must fit into the overall memory 
budget assigned to the ESB.

As it turns out, the ESB code has two conflicting ways of setting the limit. 
One is hard-coded:
{code}
  private static final int COPIER_BATCH_MEM_LIMIT = 256 * 1024;
{code}

The other comes from config parameters:
{code}
  public static final long INITIAL_ALLOCATION = 10_000_000;
  public static final long MAX_ALLOCATION = 20_000_000;

copierAllocator = oAllocator.newChildAllocator(oAllocator.getName() + 
":copier",
PriorityQueueCopier.INITIAL_ALLOCATION, 
PriorityQueueCopier.MAX_ALLOCATION);
{code}

Strangely, the config parameters are used to set aside memory for the copier to 
use. But, the {{COPIER_BATCH_MEM_LIMIT}} is used to determine how large of a 
merged batch to actually create.

The result is that we set aside 10 MB of memory, but use only 256K of it, 
wasting 9 MB.

This ticket asks to:

* Determine the proper merged batch size.
* Use that limit to set the memory allocation for the copier.

Elsewhere in Drill batch sizes tend to be on the order of 32K records. In the 
ESB, the low {{COPIER_BATCH_MEM_LIMIT}} tends to favor smaller batches: A test 
case has a row width of 114 bytes, and produces batches of just 2299 records. 
So, likely the proper choice is the larger 10 MB memory allocator limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-5021) ExternalSortBatch redundantly redefines the batch schema

2016-11-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5021:
--

 Summary: ExternalSortBatch redundantly redefines the batch schema
 Key: DRILL-5021
 URL: https://issues.apache.org/jira/browse/DRILL-5021
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


Much code in the {{ExternalSortBatch}} (ESB) deals with building vector batches 
and schemas. However, ESB cannot handle schema changes. The only valid schema 
difference is the same field path in a different position in the vector array. 
Given this restriction, the code can be simplified (and sped up) by exploiting 
the fact that all batches are required to have the same conceptual schema (same 
set of fields, but perhaps in different vector order) and most probably, the 
same physical schema (same fields and same vector order.) Note that, because of 
the way that the {{getValueVectorId()}} method works, each lookup of a value 
vector is an O(n) operation, so that each remapping of vectors is O(n^2).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-5020) ExernalSortBatch has inconsistent notions of the memory limit

2016-11-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5020:
--

 Summary: ExernalSortBatch has inconsistent notions of the memory 
limit
 Key: DRILL-5020
 URL: https://issues.apache.org/jira/browse/DRILL-5020
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


The memory limit cases for the spill-needed test seem inconsistent:

For the test for in-memory sort:
{code}
long currentlyAvailable =  popConfig.getMaxAllocation() - 
oAllocator.getAllocatedMemory();
{code}

For reaching the memory limit:
{code}
oAllocator.getAllocatedMemory() > .95 * oAllocator.getLimit()
{code}

That is, one uses {{oAllocator.getLimit}} ("the current maximum limit this 
allocator imposes"), the other uses {{popConfig.getMaxAllocation}} ("The 
maximum memory this operator can allocate".)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-5019) ExternalSortBatch spills all batches to disk even if even one spills

2016-11-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5019:
--

 Summary: ExternalSortBatch spills all batches to disk even if even 
one spills
 Key: DRILL-5019
 URL: https://issues.apache.org/jira/browse/DRILL-5019
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


The ExternalSortBatch (ESB) operator sorts batches while spilling to disk to 
stay within a defined memory budget.

Assume the memory budget is 10 GB. Assume that the actual volume of data to be 
sorted is 10.1 GB. The ESB spills the extra 0.1 GB to disk. (Actually spills 
more than that, say 5 GB.)

At the completion of the run, ESB has read all incoming batches. It must now 
merge those batches. It does so by spilling **all** batches to disk, then doing 
a disk-based merge.

This means that exceeding the memory limit by even a small amount is the same 
as having a very low memory limit: all batches must spill.

This solution is simple, it works, and has some amount of logic.

But, it would be better to have a slightly more advanced solution that spills 
only the smallest possible set of batches to disk, then does a hybrid 
in-memory, on-disk merge, saving the unnecessary write/read cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-5018) Metadata cache has duplicate columnTypeInfo values

2016-11-08 Thread Robert Hou (JIRA)
Robert Hou created DRILL-5018:
-

 Summary: Metadata cache has duplicate columnTypeInfo values
 Key: DRILL-5018
 URL: https://issues.apache.org/jira/browse/DRILL-5018
 Project: Apache Drill
  Issue Type: Bug
  Components: Metadata
Affects Versions: 1.8.0
Reporter: Robert Hou
Assignee: Parth Chandra


This lineitem table has duplicate entries in its metadata file, although the 
entries have slightly different values.  This lineitem table uses 
directory-based partitioning on year and month.

  "columnTypeInfo" : {
"L_RETURNFLAG" : {
  "name" : [ "L_RETURNFLAG" ],
  "primitiveType" : "BINARY",
  "originalType" : null,
  "precision" : 0,
  "scale" : 0,
  "repetitionLevel" : 0,
  "definitionLevel" : 1
},

"l_returnflag" : {
  "name" : [ "l_returnflag" ],
  "primitiveType" : "BINARY",
  "originalType" : "UTF8",
  "precision" : 0,
  "scale" : 0,
  "repetitionLevel" : 0,
  "definitionLevel" : 0
},


This lineitem table has two entries in its metadata file for each column, but 
the two entries have different column names (adding a zero).  It also has 
slightly different values.  This lineitem table was created using CTAS with the 
first table above.

"l_shipinstruct" : {
  "name" : [ "l_shipinstruct" ],
  "primitiveType" : "BINARY",
  "originalType" : "UTF8",
  "precision" : 0,
  "scale" : 0,
  "repetitionLevel" : 0,
  "definitionLevel" : 0
},

"L_SHIPINSTRUCT0" : {
  "name" : [ "L_SHIPINSTRUCT0" ],
  "primitiveType" : "BINARY",
  "originalType" : null,
  "precision" : 0,
  "scale" : 0,
  "repetitionLevel" : 0,
  "definitionLevel" : 1
},






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87107869
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
--- End diff --

Should keep the current name. Agree that "drillbit" should be documented 
has having the same form as zk:

drillbit=node1[:port][,node2[:port]]*

The new form is backward compatible with the old form.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #647: DRILL-4935 Allow Drill to advertise a specific host...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/647#discussion_r87103376
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/service/ServiceEngine.java 
---
@@ -140,9 +140,19 @@ private static BufferAllocator newAllocator(
 name, context.getConfig().getLong(initReservation), 
context.getConfig().getLong(maxAllocation));
   }
 
+  private String getHostName() throws UnknownHostException{
+// DRILL_HOST_NAME sets custom host name.  See drill-env.sh for 
details.
+String customHost = System.getenv("DRILL_HOST_NAME");
+if (customHost == null) {
+  return useIP ? InetAddress.getLocalHost().getHostAddress() : 
InetAddress.getLocalHost().getCanonicalHostName();
+} else {
+  return customHost;
+}
--- End diff --

Minor code flow suggestion:

if (customHost == null) {
  return customHost;
}
return useIP ? InetAddress.getLocalHost().getHostAddress() : 
InetAddress.getLocalHost().getCanonicalHostName();
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87097440
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
+   */
+  public void populateEndpointsList(ArrayList 
endpointList, String drillbits){
--- End diff --

ArrayList -> List

Generally, create a specific list, but pass it around as the generic form. 
(Allows changing the implementation without impacting clients.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87101443
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/client/DrillClientSystemTest.java
 ---
@@ -73,4 +77,90 @@ public void testSubmitPlanTwoNodes() throws Exception {
 }
 client.close();
   }
+
+  @Test
+  public void testPopulateEndpointsList() throws Exception{
+
+ArrayList endpointsList = new ArrayList<>();
+String drillBitConnection;
+DrillClient client = new DrillClient();
+DrillbitEndpoint endpoint;
+Iterator endpointIterator;
+
+
+// Test with single drillbit ip
+drillBitConnection = "10.10.100.161";
+client.populateEndpointsList(endpointsList, drillBitConnection);
+endpoint = endpointsList.iterator().next();
+assert(endpointsList.size() == 1);
+assert(endpoint.getAddress().equalsIgnoreCase(drillBitConnection));
+assert(endpoint.getUserPort() == 
client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT));
+
+// Test with single drillbit ip:port
+endpointsList.clear();
+drillBitConnection = "10.10.100.161:5000";
+String[] ipAndPort = drillBitConnection.split(":");
+client.populateEndpointsList(endpointsList, drillBitConnection);
+assert(endpointsList.size() == 1);
+
+endpoint = endpointsList.iterator().next();
+assert(endpoint.getAddress().equalsIgnoreCase(ipAndPort[0]));
+assert(endpoint.getUserPort() == Integer.parseInt(ipAndPort[1]));
+
+// Test with multiple drillbit ip
+endpointsList.clear();
+drillBitConnection = "10.10.100.161,10.10.100.162";
+client.populateEndpointsList(endpointsList, drillBitConnection);
+assert(endpointsList.size() == 2);
+
+endpointIterator = endpointsList.iterator();
+endpoint = endpointIterator.next();
+assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.161"));
+assert(endpoint.getUserPort() == 
client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT));
--- End diff --

Common practice:

assertEquals( expected, actual )

So:

assert(client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT), 
endpoint.getUserPort());


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87100814
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -245,14 +291,15 @@ public synchronized void connect(String connect, 
Properties props) throws RpcExc
   throw new RpcException("Failure setting up ZK for client.", e);
 }
   }
-
-  final ArrayList endpoints = new 
ArrayList<>(clusterCoordinator.getAvailableEndpoints());
-  checkState(!endpoints.isEmpty(), "No DrillbitEndpoint can be found");
-  // shuffle the collection then get the first endpoint
-  Collections.shuffle(endpoints);
-  endpoint = endpoints.iterator().next();
+  endpoints.addAll(clusterCoordinator.getAvailableEndpoints());
 }
 
+// Make sure we have at least one endpoint in the list
+checkState(!endpoints.isEmpty(), "No DrillbitEndpoint can be found");
+// shuffle the collection then get the first endpoint
+Collections.shuffle(endpoints);
+endpoint = endpoints.iterator().next();
--- End diff --

endpoint = endpoints.get( 0 );


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87098478
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
+   */
+  public void populateEndpointsList(ArrayList 
endpointList, String drillbits){
+
+// If no information about drillbits is provided then just return 
empty list.
+if(drillbits == null || drillbits.length() == 0){
--- End diff --

drillbits.length( ) == 0 --> drillbits.trim( ).isEmpty( )

The trim removes any bogus white space. IsEmpty( ) is a shorthand for 
length( ) == 0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87100318
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
+   */
+  public void populateEndpointsList(ArrayList 
endpointList, String drillbits){
+
+// If no information about drillbits is provided then just return 
empty list.
+if(drillbits == null || drillbits.length() == 0){
+  return;
+}
+final String[] connectInfo = drillbits.split(",");
+
+/* For direct connection we can get URL string having drillbit 
property as below:
+ drillbit=: --- Use the IP and port specified as the 
Foreman IP and port
+ drillbit=--- Use the IP specified as the Foreman IP 
with default port in config file
+ drillbit=:,:... --- Randomly select the 
IP and port pair from the specified
+ list as the Foreman 
IP and port.
+
+   Fetch ip address and port information for each drillbit and 
populate the list
+*/
+for(String info : connectInfo){
+  info = info.trim();
+
+  if(info != null){
+// Split each info to get ip address and port value
+final String[] drillbitInfo = info.split(":");
+
+// Check for malformed ip:port string
+if(drillbitInfo == null || drillbitInfo.length == 0){
+  continue;
+}
+
+/* If port is present use that one else use the configured one
+   Assumptions: 1) IP Address provided in connection string is 
valid
+2) Port without IP address is never specified.
+*/
+final String port = (drillbitInfo.length == 2) ? drillbitInfo[1] : 
config.getString(ExecConstants.INITIAL_USER_PORT);
+final DrillbitEndpoint endpoint = DrillbitEndpoint.newBuilder()
+  .setAddress(drillbitInfo[0])
+  
.setUserPort(Integer.parseInt(port))
+  .build();
+endpointList.add(endpoint);
+  }
+}
+  }
+
   public synchronized void connect(String connect, Properties props) 
throws RpcException {
 if (connected) {
   return;
 }
 
 final DrillbitEndpoint endpoint;
+final ArrayList endpoints = new ArrayList<>();
 if (isDirectConnection) {
-  final String[] connectInfo = 
props.getProperty("drillbit").split(":");
-  final String port = 
connectInfo.length==2?connectInfo[1]:config.getString(ExecConstants.INITIAL_USER_PORT);
-  endpoint = DrillbitEndpoint.newBuilder()
-  .setAddress(connectInfo[0])
-  .setUserPort(Integer.parseInt(port))
-  .build();
+  // Populate the endpoints list with all the drillbit information 
provided in the
+  // connection string
+  populateEndpointsList(endpoints, 
props.getProperty("drillbit").trim());
--- End diff --

What happens if "drillbit" is unset? Will the return from getProperty( ) be 
null? If so, the ".trim( )" will throw an NPE. Solution: push the trim into the 
method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87099094
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
+   */
+  public void populateEndpointsList(ArrayList 
endpointList, String drillbits){
+
+// If no information about drillbits is provided then just return 
empty list.
+if(drillbits == null || drillbits.length() == 0){
+  return;
+}
+final String[] connectInfo = drillbits.split(",");
+
+/* For direct connection we can get URL string having drillbit 
property as below:
+ drillbit=: --- Use the IP and port specified as the 
Foreman IP and port
+ drillbit=--- Use the IP specified as the Foreman IP 
with default port in config file
+ drillbit=:,:... --- Randomly select the 
IP and port pair from the specified
+ list as the Foreman 
IP and port.
+
+   Fetch ip address and port information for each drillbit and 
populate the list
+*/
+for(String info : connectInfo){
+  info = info.trim();
+
+  if(info != null){
+// Split each info to get ip address and port value
+final String[] drillbitInfo = info.split(":");
+
+// Check for malformed ip:port string
+if(drillbitInfo == null || drillbitInfo.length == 0){
--- End diff --

Will never be null. Length won't be 0 if the string is non-empty. But, we 
checked the non-empty case. So, the length will be 1 or more. What if the 
length is greater than 2 (drillbit:10:20:30)?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87100146
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
+   */
+  public void populateEndpointsList(ArrayList 
endpointList, String drillbits){
+
+// If no information about drillbits is provided then just return 
empty list.
+if(drillbits == null || drillbits.length() == 0){
+  return;
+}
+final String[] connectInfo = drillbits.split(",");
+
+/* For direct connection we can get URL string having drillbit 
property as below:
+ drillbit=: --- Use the IP and port specified as the 
Foreman IP and port
+ drillbit=--- Use the IP specified as the Foreman IP 
with default port in config file
+ drillbit=:,:... --- Randomly select the 
IP and port pair from the specified
+ list as the Foreman 
IP and port.
+
+   Fetch ip address and port information for each drillbit and 
populate the list
+*/
+for(String info : connectInfo){
+  info = info.trim();
+
+  if(info != null){
+// Split each info to get ip address and port value
+final String[] drillbitInfo = info.split(":");
+
+// Check for malformed ip:port string
+if(drillbitInfo == null || drillbitInfo.length == 0){
+  continue;
+}
+
+/* If port is present use that one else use the configured one
+   Assumptions: 1) IP Address provided in connection string is 
valid
+2) Port without IP address is never specified.
+*/
+final String port = (drillbitInfo.length == 2) ? drillbitInfo[1] : 
config.getString(ExecConstants.INITIAL_USER_PORT);
+final DrillbitEndpoint endpoint = DrillbitEndpoint.newBuilder()
+  .setAddress(drillbitInfo[0])
+  
.setUserPort(Integer.parseInt(port))
+  .build();
+endpointList.add(endpoint);
+  }
+}
+  }
+
   public synchronized void connect(String connect, Properties props) 
throws RpcException {
 if (connected) {
   return;
 }
 
 final DrillbitEndpoint endpoint;
+final ArrayList endpoints = new ArrayList<>();
--- End diff --

final List = ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87098173
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
--- End diff --

One or more drillbit -> drillbits

(Javadoc practice is to use lower case in param comments. See: 
http://www.oracle.com/technetwork/java/javase/documentation/index-137868.html#styleguide)

Please copy detailed description from method body to here so it shows up in 
Javadoc.

I one port allowed? Or, should the user specify all three? If only one, 
what port numbers are used for the other two? If this is documented elsewhere, 
just point to that location.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87100068
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
+   */
+  public void populateEndpointsList(ArrayList 
endpointList, String drillbits){
+
+// If no information about drillbits is provided then just return 
empty list.
+if(drillbits == null || drillbits.length() == 0){
+  return;
+}
+final String[] connectInfo = drillbits.split(",");
+
+/* For direct connection we can get URL string having drillbit 
property as below:
+ drillbit=: --- Use the IP and port specified as the 
Foreman IP and port
+ drillbit=--- Use the IP specified as the Foreman IP 
with default port in config file
+ drillbit=:,:... --- Randomly select the 
IP and port pair from the specified
+ list as the Foreman 
IP and port.
+
+   Fetch ip address and port information for each drillbit and 
populate the list
+*/
+for(String info : connectInfo){
+  info = info.trim();
+
+  if(info != null){
+// Split each info to get ip address and port value
+final String[] drillbitInfo = info.split(":");
+
+// Check for malformed ip:port string
+if(drillbitInfo == null || drillbitInfo.length == 0){
+  continue;
+}
+
+/* If port is present use that one else use the configured one
+   Assumptions: 1) IP Address provided in connection string is 
valid
+2) Port without IP address is never specified.
+*/
+final String port = (drillbitInfo.length == 2) ? drillbitInfo[1] : 
config.getString(ExecConstants.INITIAL_USER_PORT);
+final DrillbitEndpoint endpoint = DrillbitEndpoint.newBuilder()
+  .setAddress(drillbitInfo[0])
+  
.setUserPort(Integer.parseInt(port))
--- End diff --

parseInt( ) will throw an (unchecked) exception if the port is not an 
integer. That is, will throw an error for "host:foo".

This raises an interesting question, how are we doing error handling? Just 
ignoring invalid entries? Should we throw a uniform exception instead? 
InvalidArgumentException or something Drill-specific?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87099827
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
+   */
+  public void populateEndpointsList(ArrayList 
endpointList, String drillbits){
+
+// If no information about drillbits is provided then just return 
empty list.
+if(drillbits == null || drillbits.length() == 0){
+  return;
+}
+final String[] connectInfo = drillbits.split(",");
+
+/* For direct connection we can get URL string having drillbit 
property as below:
+ drillbit=: --- Use the IP and port specified as the 
Foreman IP and port
+ drillbit=--- Use the IP specified as the Foreman IP 
with default port in config file
+ drillbit=:,:... --- Randomly select the 
IP and port pair from the specified
+ list as the Foreman 
IP and port.
+
+   Fetch ip address and port information for each drillbit and 
populate the list
+*/
+for(String info : connectInfo){
+  info = info.trim();
+
+  if(info != null){
+// Split each info to get ip address and port value
+final String[] drillbitInfo = info.split(":");
+
+// Check for malformed ip:port string
+if(drillbitInfo == null || drillbitInfo.length == 0){
+  continue;
+}
+
+/* If port is present use that one else use the configured one
+   Assumptions: 1) IP Address provided in connection string is 
valid
+2) Port without IP address is never specified.
+*/
+final String port = (drillbitInfo.length == 2) ? drillbitInfo[1] : 
config.getString(ExecConstants.INITIAL_USER_PORT);
--- End diff --

What about "host:" The array will be ["host",""].

And, what about "host: " (space after colon) or "host : 20" (space around 
colon) or "host:20 " (space after port)?

May need a condition that is length == 2 & ! drillbitInfo[1].trim( 
).isEmpty( ).

Also, what happens if length > 2?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87102298
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/client/DrillClientSystemTest.java
 ---
@@ -73,4 +77,90 @@ public void testSubmitPlanTwoNodes() throws Exception {
 }
 client.close();
   }
+
+  @Test
+  public void testPopulateEndpointsList() throws Exception{
+
+ArrayList endpointsList = new ArrayList<>();
+String drillBitConnection;
+DrillClient client = new DrillClient();
+DrillbitEndpoint endpoint;
+Iterator endpointIterator;
+
+
+// Test with single drillbit ip
+drillBitConnection = "10.10.100.161";
+client.populateEndpointsList(endpointsList, drillBitConnection);
+endpoint = endpointsList.iterator().next();
+assert(endpointsList.size() == 1);
+assert(endpoint.getAddress().equalsIgnoreCase(drillBitConnection));
+assert(endpoint.getUserPort() == 
client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT));
+
+// Test with single drillbit ip:port
+endpointsList.clear();
+drillBitConnection = "10.10.100.161:5000";
+String[] ipAndPort = drillBitConnection.split(":");
+client.populateEndpointsList(endpointsList, drillBitConnection);
+assert(endpointsList.size() == 1);
+
+endpoint = endpointsList.iterator().next();
+assert(endpoint.getAddress().equalsIgnoreCase(ipAndPort[0]));
+assert(endpoint.getUserPort() == Integer.parseInt(ipAndPort[1]));
+
+// Test with multiple drillbit ip
+endpointsList.clear();
+drillBitConnection = "10.10.100.161,10.10.100.162";
+client.populateEndpointsList(endpointsList, drillBitConnection);
+assert(endpointsList.size() == 2);
+
+endpointIterator = endpointsList.iterator();
+endpoint = endpointIterator.next();
+assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.161"));
+assert(endpoint.getUserPort() == 
client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT));
+
+endpoint = endpointIterator.next();
+assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.162"));
+assert(endpoint.getUserPort() == 
client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT));
+
+// Test with multiple drillbit ip:port
+endpointsList.clear();
+drillBitConnection = "10.10.100.161:5000,10.10.100.162:5000";
+client.populateEndpointsList(endpointsList, drillBitConnection);
+assert(endpointsList.size() == 2);
+
+endpointIterator = endpointsList.iterator();
+endpoint = endpointIterator.next();
+assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.161"));
+assert(endpoint.getUserPort() == 5000);
+
+endpoint = endpointIterator.next();
+assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.162"));
+assert(endpoint.getUserPort() == 5000);
+
+// Test with multiple drillbit with mix of ip:port and ip
+endpointsList.clear();
+drillBitConnection = "10.10.100.161:5000,10.10.100.162";
+client.populateEndpointsList(endpointsList, drillBitConnection);
+assert(endpointsList.size() == 2);
+
+endpointIterator = endpointsList.iterator();
+endpoint = endpointIterator.next();
+assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.161"));
+assert(endpoint.getUserPort() == 5000);
+
+endpoint = endpointIterator.next();
+assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.162"));
+assert(endpoint.getUserPort() == 
client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT));
+
+// Test with empty string
+endpointsList.clear();
+drillBitConnection = "";
+client.populateEndpointsList(endpointsList, drillBitConnection);
+assert(endpointsList.size() == 0);
+
+
--- End diff --

Test the other strange cases noted in earlier comments: spaces, non-numeric 
ports, missing host, etc:

" foo : 10 ", "foo: ", "foo:", "foo:bar", ":20", etc.

for errors:

{
try { parseEndpoints( "foo:bogus" ); fail( ); }
catch( YourException e ) { }
}



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87098346
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
+   */
+  public void populateEndpointsList(ArrayList 
endpointList, String drillbits){
+
+// If no information about drillbits is provided then just return 
empty list.
--- End diff --

no information about --> no drillbits


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87097059
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
--- End diff --

Thanks for the Javadoc!

Function to populate -> Populate

information of all the -> the list of


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87100765
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -245,14 +291,15 @@ public synchronized void connect(String connect, 
Properties props) throws RpcExc
   throw new RpcException("Failure setting up ZK for client.", e);
 }
   }
-
-  final ArrayList endpoints = new 
ArrayList<>(clusterCoordinator.getAvailableEndpoints());
-  checkState(!endpoints.isEmpty(), "No DrillbitEndpoint can be found");
-  // shuffle the collection then get the first endpoint
-  Collections.shuffle(endpoints);
-  endpoint = endpoints.iterator().next();
+  endpoints.addAll(clusterCoordinator.getAvailableEndpoints());
 }
 
+// Make sure we have at least one endpoint in the list
+checkState(!endpoints.isEmpty(), "No DrillbitEndpoint can be found");
--- End diff --

If this error is shown to the user, perhaps help them out a bit:

No Drillbit endpoints found in either the connect string 
(drillbit=host[:port]) or in the drill-override.conf file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87101810
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/client/DrillClientSystemTest.java
 ---
@@ -73,4 +77,90 @@ public void testSubmitPlanTwoNodes() throws Exception {
 }
 client.close();
   }
+
+  @Test
+  public void testPopulateEndpointsList() throws Exception{
+
+ArrayList endpointsList = new ArrayList<>();
+String drillBitConnection;
+DrillClient client = new DrillClient();
+DrillbitEndpoint endpoint;
+Iterator endpointIterator;
+
+
+// Test with single drillbit ip
+drillBitConnection = "10.10.100.161";
+client.populateEndpointsList(endpointsList, drillBitConnection);
+endpoint = endpointsList.iterator().next();
+assert(endpointsList.size() == 1);
+assert(endpoint.getAddress().equalsIgnoreCase(drillBitConnection));
+assert(endpoint.getUserPort() == 
client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT));
+
+// Test with single drillbit ip:port
+endpointsList.clear();
--- End diff --

Having to remember `endpointsList.clear( )` every time is error-prone. 
Better:

  {
  String drillBitConnection = "10.10.100.161";
  List endpointList = parseEndpoints( drillbitConnection 
);
  ... // Your tests here
}

The above creates a new block, and new case-specific variables, of each 
specific test. Less chance of one test "bleeding through" into the next one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87098807
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
+   */
+  public void populateEndpointsList(ArrayList 
endpointList, String drillbits){
+
+// If no information about drillbits is provided then just return 
empty list.
+if(drillbits == null || drillbits.length() == 0){
+  return;
+}
+final String[] connectInfo = drillbits.split(",");
+
+/* For direct connection we can get URL string having drillbit 
property as below:
+ drillbit=: --- Use the IP and port specified as the 
Foreman IP and port
+ drillbit=--- Use the IP specified as the Foreman IP 
with default port in config file
+ drillbit=:,:... --- Randomly select the 
IP and port pair from the specified
+ list as the Foreman 
IP and port.
+
+   Fetch ip address and port information for each drillbit and 
populate the list
+*/
+for(String info : connectInfo){
+  info = info.trim();
+
+  if(info != null){
--- End diff --

info will never be null here. Did you mean info.isEmpty( )? That will catch 
"foo,,bar".

info -> drillbit (The word info is a bit generic...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87098611
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
+   */
+  public void populateEndpointsList(ArrayList 
endpointList, String drillbits){
+
+// If no information about drillbits is provided then just return 
empty list.
+if(drillbits == null || drillbits.length() == 0){
+  return;
+}
+final String[] connectInfo = drillbits.split(",");
+
+/* For direct connection we can get URL string having drillbit 
property as below:
--- End diff --

Cool! Best to put this into the Javadoc comment for this function where it 
can be seen in the Javadoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/648#discussion_r87100583
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -223,19 +223,65 @@ public void connect(Properties props) throws 
RpcException {
 connect(null, props);
   }
 
+  /**
+   * Function to populate the endpointList with information of all the 
drillbits
+   * provided in the connection string by client
+   * @param endpointList - ArrayList of DrillbitEndpoints
+   * @param drillbits - One or more drillbit ip[:port] provided in 
connection string
+   */
+  public void populateEndpointsList(ArrayList 
endpointList, String drillbits){
+
+// If no information about drillbits is provided then just return 
empty list.
+if(drillbits == null || drillbits.length() == 0){
+  return;
+}
+final String[] connectInfo = drillbits.split(",");
+
+/* For direct connection we can get URL string having drillbit 
property as below:
+ drillbit=: --- Use the IP and port specified as the 
Foreman IP and port
+ drillbit=--- Use the IP specified as the Foreman IP 
with default port in config file
+ drillbit=:,:... --- Randomly select the 
IP and port pair from the specified
+ list as the Foreman 
IP and port.
+
+   Fetch ip address and port information for each drillbit and 
populate the list
+*/
+for(String info : connectInfo){
+  info = info.trim();
+
+  if(info != null){
+// Split each info to get ip address and port value
+final String[] drillbitInfo = info.split(":");
+
+// Check for malformed ip:port string
+if(drillbitInfo == null || drillbitInfo.length == 0){
+  continue;
+}
+
+/* If port is present use that one else use the configured one
+   Assumptions: 1) IP Address provided in connection string is 
valid
+2) Port without IP address is never specified.
+*/
+final String port = (drillbitInfo.length == 2) ? drillbitInfo[1] : 
config.getString(ExecConstants.INITIAL_USER_PORT);
+final DrillbitEndpoint endpoint = DrillbitEndpoint.newBuilder()
+  .setAddress(drillbitInfo[0])
+  
.setUserPort(Integer.parseInt(port))
+  .build();
+endpointList.add(endpoint);
+  }
+}
+  }
+
   public synchronized void connect(String connect, Properties props) 
throws RpcException {
 if (connected) {
   return;
 }
 
 final DrillbitEndpoint endpoint;
+final ArrayList endpoints = new ArrayList<>();
 if (isDirectConnection) {
-  final String[] connectInfo = 
props.getProperty("drillbit").split(":");
-  final String port = 
connectInfo.length==2?connectInfo[1]:config.getString(ExecConstants.INITIAL_USER_PORT);
-  endpoint = DrillbitEndpoint.newBuilder()
-  .setAddress(connectInfo[0])
-  .setUserPort(Integer.parseInt(port))
-  .build();
+  // Populate the endpoints list with all the drillbit information 
provided in the
+  // connection string
+  populateEndpointsList(endpoints, 
props.getProperty("drillbit").trim());
--- End diff --

Actually, the code here would be cleaner as:

final List endpoints = parseEndpoints( 
props.getProperty("drillbit") );


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #648: DRILL-5015: Randomly select the drillbit from the list pro...

2016-11-08 Thread sohami
Github user sohami commented on the issue:

https://github.com/apache/drill/pull/648
  
@kkhatua - The unit test I have written is just for validating the newly 
introduced "populateEndpointsList" method. I am not actually connecting to the 
drillbit from within the test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...

2016-11-08 Thread sohami
GitHub user sohami opened a pull request:

https://github.com/apache/drill/pull/648

DRILL-5015: Randomly select the drillbit from the list provided by user in 
connection string

@sudheeshkatkam - Please help to review the changes

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sohami/drill DRILL-5015

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/648.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #648


commit e5ef1b62233b7f21868a094813d429a7b8afc5b1
Author: Sorabh Hamirwasia 
Date:   2016-11-08T19:27:57Z

DRILL-5015: Randomly select the drillbit from the list provided by user in 
connection string




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #647: DRILL-4935 Allow Drill to advertise a specific hostname to...

2016-11-08 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/647
  
Two ways to test.

First, try the fix manually. Start Drill once without the env var set to 
ensure Drill uses the default. Then, set the env var and ensure that Drill uses 
the correct value.

Second, would be a unit test. I'm not familiar with any existing ZK unit 
tests. Do any of the other committers know if we have such tests? Something 
that validates Drillbit registration, etc? If so, then that test can be 
modified for this use case.

But, there is one gotcha: in Java, env vars are read-only, there is no way 
to set env vars in a test (that I've ever found.)  Anyone have a work-around?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #647: DRILL-4935 Allow Drill to advertise a specific hostname to...

2016-11-08 Thread harrisonmebane
Github user harrisonmebane commented on the issue:

https://github.com/apache/drill/pull/647
  
I have made the changes suggested by @paul-rogers .  Are there any unit 
tests we should add?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (DRILL-5017) Config param drill.exec.sort.external.batch.size is not used

2016-11-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5017:
--

 Summary: Config param drill.exec.sort.external.batch.size is not 
used
 Key: DRILL-5017
 URL: https://issues.apache.org/jira/browse/DRILL-5017
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


The Drill config file defines the {{drill.exec.sort.external.batch.size}} 
parameter, as does {{ExecConstants}}:

{code}
  String EXTERNAL_SORT_TARGET_BATCH_SIZE = 
"drill.exec.sort.external.batch.size";
{code}

However, this parameter is never used. It seems to be a duplicate of:

{code}
  String EXTERNAL_SORT_TARGET_SPILL_BATCH_SIZE = 
"drill.exec.sort.external.spill.batch.size";
{code}

Which, itself, is never used.

Remove these parameters from {{ExecConstants}}, {{drill-module.conf}}, 
{{drill-override-example.conf}} (if they appear in those files) and from the 
documentation (if they appear in the docs.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-5016) drill.exec.sort.purge.threshold is misnamed

2016-11-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5016:
--

 Summary: drill.exec.sort.purge.threshold is misnamed
 Key: DRILL-5016
 URL: https://issues.apache.org/jira/browse/DRILL-5016
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


The Drill config system provides a property called 
{{drill.exec.sort.purge.threshold}}. The name suggests that this is a parameter 
related to sorting. Perhaps it controls something having to do with when we 
purge buffered batches from memory in the ExternalSortBatch?

In fact, this is actually {{drill.exec.topn.purge-threshold}} - it affects only 
the Top-N operator, not sort.

To make this change, rename the config attribute in {{ExecConstants}} from

{code}
  String BATCH_PURGE_THRESHOLD = "drill.exec.sort.purge.threshold";
{code}

to:

{code}
  String TOP_N_PURGE_THRESHOLD = "drill.exec.topn.purge-threshold";
{code}

To permit backward compatibility, modify the use in TopNBatch to check the old 
value, use it if set, else use the new value.

{code}
// Check pre x.y config parameter for backward compatibility.

if ( ! context.getConfig( ).isEmpty( "drill.exec.sort.purge.threshold" ) ) {
  batchPurgeThreshold = 
context.getConfig().getInt("drill.exec.sort.purge.threshold");
} else {
  batchPurgeThreshold = context.getConfig().getInt(ExecConstants. 
TOP_N_PURGE_THRESHOLD);
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-5015) As per documentation, when issuing a list of drillbits in the connection string, we always attempt to connect only to the first one

2016-11-08 Thread Sorabh Hamirwasia (JIRA)
Sorabh Hamirwasia created DRILL-5015:


 Summary: As per documentation, when issuing a list of drillbits in 
the connection string, we always attempt to connect only to the first one
 Key: DRILL-5015
 URL: https://issues.apache.org/jira/browse/DRILL-5015
 Project: Apache Drill
  Issue Type: Bug
  Components: Client - JDBC
Affects Versions: 1.8.0, 1.9.0
Reporter: Sorabh Hamirwasia
Assignee: Sorabh Hamirwasia


When trying to connect to a Drill cluster by specifying more than 1 drillbits 
to connect to, we always attempt to connect to only the first drillbit.
As an example, we tested against a pair of drillbits, but we always connect to 
the first entry in the CSV list by querying for the 'current' drillbit. The 
remaining entries are never attempted.
[root@pssc-60 agileSqlPerfTests]# /opt/mapr/drill/drill-1.8.0/bin/sqlline  -u  
"jdbc:drill:schema=dfs.tmp;drillbit=pssc-61:31010,pssc-62:31010" -f whereAmI.q  
| grep -v logback

1/1  select * from sys.drillbits where `current`;
+-++---++--+
|hostname | user_port  | control_port  | data_port  | current  |
+-++---++--+
| pssc-61.qa.lab  | 31010  | 31011 | 31012  | true |
+-++---++--+
1 row selected (0.265 seconds)
Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl
apache drill 1.8.0 
"a little sql for your nosql"
This property is meant for use by clients when not wanting to overload the ZK 
for fetching a list of existing Drillbits, but the behaviour doesn't match the 
documentation. 
Making a Direct Drillbit Connection
We ned too raandomly shuffle between this list and If an entry in the shuffled 
list is unreachable, we need to try for the next entry in the list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-5014) ExternalSortBatch spills only half the batches requested in config

2016-11-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5014:
--

 Summary: ExternalSortBatch spills only half the batches requested 
in config
 Key: DRILL-5014
 URL: https://issues.apache.org/jira/browse/DRILL-5014
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


The ExternalSortBatch (ESB) operator sorts data while spilling to disk to 
remain within a defined memory footprint. Spilling occurs based on a number of 
factors. Among those are two config parameters:

* {{drill.exec.sort.external.spill.group.size}}: The number of batches to spill 
per spill event.
* {{drill.exec.sort.external.spill.threshold}}: The number of batches to 
accumulate in memory before starting a spill event.

The expected behavior would be:

* When the accumulated number of batches exceeds the threshold, spill the 
requested number of batches.

That is if the threshold is 200, and the size is 150, we should accumulate 200 
batches, then spill 150 of them (leaving 50) and repeat.

The actual behavior is:

* When the accumulated records exceeds the threshold and more than "batch size" 
new batches have arrived since the last spill,
* Spill half the accumulated records.

So, the actual behavior for the (threshold=200, size=150) case is:

{code}
Before spilling, buffered batch count: 201
After spilling, buffered batch count: 101
Before spilling, buffered batch count: 251
After spilling, buffered batch count: 126
Before spilling, buffered batch count: 276
After spilling, buffered batch count: 138
Before spilling, buffered batch count: 288
After spilling, buffered batch count: 144
Before spilling, buffered batch count: 294
After spilling, buffered batch count: 147
Before spilling, buffered batch count: 297
After spilling, buffered batch count: 149
Before spilling, buffered batch count: 299
After spilling, buffered batch count: 150
Before spilling, buffered batch count: 300
After spilling, buffered batch count: 150
Before spilling, buffered batch count: 300
{code}

In short, the actual number of batches retained in memory is twice the spill 
size, **NOT** the number set in the threshold. As a result, the ESB operator 
will use more memory than expected.

The work-around is to set a batch size that is half the threshold so that the 
batch size (used in spill decisions) matches the actual spill count (as 
implemented by the code.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Proposed November report for Drill

2016-11-08 Thread Gautam Parai
+1

On Tue, Nov 8, 2016 at 9:59 AM, Jinfeng Ni  wrote:

> +1
>
>
>
> On Tue, Nov 8, 2016 at 9:31 AM, Sudheesh Katkam 
> wrote:
> > +1
> >
> >> On Nov 7, 2016, at 10:35 PM, Parth Chandra  wrote:
> >>
> >> Hi Drill devs,
> >>
> >> Below is the proposed report for the November Apache board meeting.
> Please
> >> provide any comments/corrections. I'll be submitting this Wednesday Nov
> >> 9th, so please provide input before that.
> >>
> >> Thanks
> >>
> >> Parth
> >>
> >> --- Report Begin -
> >>
> >> ## Description:
> >> - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud
> >> Storage
> >>
> >> ## Issues:
> >>
> >> - There are no issues requiring board attention at this time
> >>
> >> ## Activity:
> >>
> >> - Since the last board report, Drill has released version 1.8
> >> - Drill has added many new features referred to in the last report.
> >> Dynamic UDFs, Parquet reader performance improvements, filter pushdown
> for
> >> Parquet, and improved support for Metadata in the clients has been
> added.
> >> - Improved use of statistics, and security enhancements (including
> support
> >> for Kerberos) continue to be in the works. Also in progress is an
> >> improvement to the data locality algorithm.
> >>
> >> ## Health report:
> >>
> >> - There has been a good increase in the number posts in the dev and jira
> >> lists. This reflects the increased activity on the development front.
> User
> >> list activity is down this period, but not a concern at the moment.
> >>
> >> ## PMC changes:
> >>
> >> - Currently 18 PMC members.
> >> - Sudheesh Katkam was added to the PMC on Wed Oct 05 2016
> >>
> >> ## Committer base changes:
> >>
> >> - Currently 28 committers.
> >> - No new committers added in the last 3 months
> >> - Last committer addition was Hsuan-Yi Chu at Thu Apr 07 2016
> >>
> >> ## Releases:
> >>
> >> - 1.8.0 was released on Mon Aug 29 2016
> >>
> >> ## Mailing list activity:
> >>
> >> - dev@drill.apache.org:
> >>- 436 subscribers (down -10 in the last 3 months):
> >>- 1797 emails sent to list (1231 in previous quarter)
> >>
> >> - iss...@drill.apache.org:
> >>- 20 subscribers (up 0 in the last 3 months):
> >>- 2188 emails sent to list (1550 in previous quarter)
> >>
> >> - u...@drill.apache.org:
> >>- 567 subscribers (down -14 in the last 3 months):
> >>- 436 emails sent to list (824 in previous quarter)
> >>
> >> ## JIRA activity:
> >>
> >> - 173 JIRA tickets created in the last 3 months
> >> - 88 JIRA tickets closed/resolved in the last 3 months
> >>
> >> --- Report End -
> >
>


[jira] [Created] (DRILL-5013) Heap allocation, data copies in UDLE write path for ExternalSortBatch

2016-11-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5013:
--

 Summary: Heap allocation, data copies in UDLE write path for 
ExternalSortBatch
 Key: DRILL-5013
 URL: https://issues.apache.org/jira/browse/DRILL-5013
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


The ExternalSortBatch (ESB) uses spill-to-disk to sort a large collection of 
records within a limited memory footprint.

As part of writing data to disk, ESB writes each of a target byte buffer to 
disk. Since the vector is stored in direct memory (not visible to an output 
stream), the code path first makes a temporary on-heap copy.

In particular the code in `io.netty.buffer.PooledUnsafeDirectByteBuf` does the 
following:

{code}
@Override
public ByteBuf getBytes(int index, OutputStream out, int length) throws 
IOException {
checkIndex(index, length);
if (length != 0) {
byte[] tmp = new byte[length];
PlatformDependent.copyMemory(addr(index), tmp, 0, length);
out.write(tmp);
}
return this;
}
{code}

The result is that we 1) create a large number of on-heap objects, and 2) copy 
the data twice: once from direct memory to the tmp buffer, and from the tmp 
buffer into the output stream's own buffer.

Two optimizations are possible:

1. Copy the data byte-by-byte from the direct memory buffer to the output 
stream, or
2. Reuse the same tmp buffer across vector writes.

Since the code is in Netty, if we do either of the above, we'd have to write 
our own "getBytes" (misnomer, really write bytes) method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Proposed November report for Drill

2016-11-08 Thread Jinfeng Ni
+1



On Tue, Nov 8, 2016 at 9:31 AM, Sudheesh Katkam  wrote:
> +1
>
>> On Nov 7, 2016, at 10:35 PM, Parth Chandra  wrote:
>>
>> Hi Drill devs,
>>
>> Below is the proposed report for the November Apache board meeting. Please
>> provide any comments/corrections. I'll be submitting this Wednesday Nov
>> 9th, so please provide input before that.
>>
>> Thanks
>>
>> Parth
>>
>> --- Report Begin -
>>
>> ## Description:
>> - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud
>> Storage
>>
>> ## Issues:
>>
>> - There are no issues requiring board attention at this time
>>
>> ## Activity:
>>
>> - Since the last board report, Drill has released version 1.8
>> - Drill has added many new features referred to in the last report.
>> Dynamic UDFs, Parquet reader performance improvements, filter pushdown for
>> Parquet, and improved support for Metadata in the clients has been added.
>> - Improved use of statistics, and security enhancements (including support
>> for Kerberos) continue to be in the works. Also in progress is an
>> improvement to the data locality algorithm.
>>
>> ## Health report:
>>
>> - There has been a good increase in the number posts in the dev and jira
>> lists. This reflects the increased activity on the development front. User
>> list activity is down this period, but not a concern at the moment.
>>
>> ## PMC changes:
>>
>> - Currently 18 PMC members.
>> - Sudheesh Katkam was added to the PMC on Wed Oct 05 2016
>>
>> ## Committer base changes:
>>
>> - Currently 28 committers.
>> - No new committers added in the last 3 months
>> - Last committer addition was Hsuan-Yi Chu at Thu Apr 07 2016
>>
>> ## Releases:
>>
>> - 1.8.0 was released on Mon Aug 29 2016
>>
>> ## Mailing list activity:
>>
>> - dev@drill.apache.org:
>>- 436 subscribers (down -10 in the last 3 months):
>>- 1797 emails sent to list (1231 in previous quarter)
>>
>> - iss...@drill.apache.org:
>>- 20 subscribers (up 0 in the last 3 months):
>>- 2188 emails sent to list (1550 in previous quarter)
>>
>> - u...@drill.apache.org:
>>- 567 subscribers (down -14 in the last 3 months):
>>- 436 emails sent to list (824 in previous quarter)
>>
>> ## JIRA activity:
>>
>> - 173 JIRA tickets created in the last 3 months
>> - 88 JIRA tickets closed/resolved in the last 3 months
>>
>> --- Report End -
>


[jira] [Created] (DRILL-5012) External Sort Batch does not check free space before writing

2016-11-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5012:
--

 Summary: External Sort Batch does not check free space before 
writing
 Key: DRILL-5012
 URL: https://issues.apache.org/jira/browse/DRILL-5012
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


The ExternalSortBatch operator sorts data and spills to disk to control memory 
usage. The ESB accepts a configured set of output directories.

The ESB does not check if sufficient space is available in a directory before 
writing to it; ESB requires that the admin ensure space is available.

Desired fix: skip a directory if it does not contain space to hold the target 
spill file. (That is, write to the other directories if that have space.)

The fix is not fail-safe: two threads may both check and both write. However, 
the fix reduces the chances of an out-of-disk error for the most obvious cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Proposed November report for Drill

2016-11-08 Thread Sudheesh Katkam
+1

> On Nov 7, 2016, at 10:35 PM, Parth Chandra  wrote:
> 
> Hi Drill devs,
> 
> Below is the proposed report for the November Apache board meeting. Please
> provide any comments/corrections. I'll be submitting this Wednesday Nov
> 9th, so please provide input before that.
> 
> Thanks
> 
> Parth
> 
> --- Report Begin -
> 
> ## Description:
> - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud
> Storage
> 
> ## Issues:
> 
> - There are no issues requiring board attention at this time
> 
> ## Activity:
> 
> - Since the last board report, Drill has released version 1.8
> - Drill has added many new features referred to in the last report.
> Dynamic UDFs, Parquet reader performance improvements, filter pushdown for
> Parquet, and improved support for Metadata in the clients has been added.
> - Improved use of statistics, and security enhancements (including support
> for Kerberos) continue to be in the works. Also in progress is an
> improvement to the data locality algorithm.
> 
> ## Health report:
> 
> - There has been a good increase in the number posts in the dev and jira
> lists. This reflects the increased activity on the development front. User
> list activity is down this period, but not a concern at the moment.
> 
> ## PMC changes:
> 
> - Currently 18 PMC members.
> - Sudheesh Katkam was added to the PMC on Wed Oct 05 2016
> 
> ## Committer base changes:
> 
> - Currently 28 committers.
> - No new committers added in the last 3 months
> - Last committer addition was Hsuan-Yi Chu at Thu Apr 07 2016
> 
> ## Releases:
> 
> - 1.8.0 was released on Mon Aug 29 2016
> 
> ## Mailing list activity:
> 
> - dev@drill.apache.org:
>- 436 subscribers (down -10 in the last 3 months):
>- 1797 emails sent to list (1231 in previous quarter)
> 
> - iss...@drill.apache.org:
>- 20 subscribers (up 0 in the last 3 months):
>- 2188 emails sent to list (1550 in previous quarter)
> 
> - u...@drill.apache.org:
>- 567 subscribers (down -14 in the last 3 months):
>- 436 emails sent to list (824 in previous quarter)
> 
> ## JIRA activity:
> 
> - 173 JIRA tickets created in the last 3 months
> - 88 JIRA tickets closed/resolved in the last 3 months
> 
> --- Report End -



[jira] [Created] (DRILL-5011) External Sort Batch memory use depends on record width

2016-11-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5011:
--

 Summary: External Sort Batch memory use depends on record width
 Key: DRILL-5011
 URL: https://issues.apache.org/jira/browse/DRILL-5011
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


The ExternalSortBatch operator uses spill-to-disk to keep memory needs within a 
defined limit. However, the "copier" (really, the merge operation) can use an 
amount of memory determined not by the operator configuration, but by the width 
of each record.

The copier memory limit appears to be set by the COPIER_BATCH_MEM_LIMIT value.

However, the actual memory use is determined by the number of records that the 
copier is asked to copy. That record comes from an estimate of row width based 
on the type of each column. Note that the row width *is not* based on the 
actual data in each row. Varchar fields, for example, are assumed to be 40 
characters wide. If the sorter is asked to sort records with Varchar fields of, 
say, 1000 characters, then the row width estimate will be a poor estimator of 
actual width.

Memory use is based on a

{code}
target record count = memory limit / estimate row width
{code}
Actual memory use is:
{code}
memory use = target row count * actual row width
{code}
Which is
{code}
memory use = memory limit * actual row width  / estimate row width
{code}

That is, memory use depends on the ratio of actual to estimated width. If the 
estimate is off by 2, then we use twice as much memory as expected.

Not that the memory used for the copier defaults to 20 MB, so even an error of 
4x still means only 80 MB of memory used; small in comparison to the many GB 
typically allocated to ESB storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill issue #644: DRILL-4980: Upgrading of the approach of parquet date corr...

2016-11-08 Thread vdiravka
Github user vdiravka commented on the issue:

https://github.com/apache/drill/pull/644
  
@paul-rogers Could you please review? I answered on the last comment and 
rebased the branch with resolving conflicts (ParquetTableMetadata_v3 was added)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (DRILL-5010) Equality join condition is treated as a MergeJoin and not as a HashJoin.

2016-11-08 Thread Khurram Faraaz (JIRA)
Khurram Faraaz created DRILL-5010:
-

 Summary: Equality join condition is treated as a MergeJoin and not 
as a HashJoin.
 Key: DRILL-5010
 URL: https://issues.apache.org/jira/browse/DRILL-5010
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Flow
Affects Versions: 1.9.0
Reporter: Khurram Faraaz


Equality join condition is treated as a MergeJoin and not as a HashJoin.
Drill 1.9.0 git commit ID: 83513daf

Projecting the join columns results in merge join, whereas it should be doing a 
HashJoin.
{noformat}
0: jdbc:drill:schema=dfs.tmp> explain plan for
. . . . . . . . . . . . . . > select t1.intKey, t2.intKey from `left.json` t1, 
`right_all_nulls.json` t2 
WHERE t1.intKey = t2.intKey OR ( t1.intKey IS NULL AND t2.intKey IS NULL);
+--+--+
| text | json |
+--+--+
| 00-00Screen
00-01  Project(intKey=[$0], intKey0=[$1])
00-02Project(intKey=[$0], intKey0=[$1])
00-03  MergeJoin(condition=[IS NOT DISTINCT FROM($0, $1)], 
joinType=[inner])
00-05SelectionVectorRemover
00-07  Sort(sort0=[$0], dir0=[ASC])
00-09Scan(groupscan=[EasyGroupScan 
[selectionRoot=maprfs:/tmp/left.json, numFiles=1, columns=[`intKey`], 
files=[maprfs:///tmp/left.json]]])
00-04Project(intKey0=[$0])
00-06  SelectionVectorRemover
00-08Sort(sort0=[$0], dir0=[ASC])
00-10  Scan(groupscan=[EasyGroupScan 
[selectionRoot=maprfs:/tmp/right_all_nulls.json, numFiles=1, 
columns=[`intKey`], files=[maprfs:///tmp/right_all_nulls.json]]])
{noformat}

Note that HashAgg and HashJoin were enabled.
{noformat}
| planner.enable_hashagg | BOOLEAN  | SYSTEM  | DEFAULT  | 
null| null| true  | null   |
| planner.enable_hashjoin| BOOLEAN  | SYSTEM  | DEFAULT  | 
null| null| true  | null   |
{noformat}

Doing a SELECT  results in a HashJoin in the query plan

{noformat}
0: jdbc:drill:schema=dfs.tmp> explain plan for
. . . . . . . . . . . . . . > select * from `left.json` t1, 
`right_all_nulls.json` t2 
WHERE t1.intKey = t2.intKey OR ( t1.intKey IS NULL AND t2.intKey IS NULL);
+--+--+
| text | json |
+--+--+
| 00-00Screen
00-01  ProjectAllowDup(*=[$0], *0=[$1])
00-02Project(T46¦¦*=[$0], T47¦¦*=[$2])
00-03  HashJoin(condition=[IS NOT DISTINCT FROM($1, $3)], 
joinType=[inner])
00-04Project(T47¦¦*=[$0], intKey0=[$1])
00-06  Project(T47¦¦*=[$0], intKey=[$1])
00-08Scan(groupscan=[EasyGroupScan 
[selectionRoot=maprfs:/tmp/right_all_nulls.json, numFiles=1, columns=[`*`], 
files=[maprfs:///tmp/right_all_nulls.json]]])
00-05Project(T46¦¦*=[$0], intKey=[$1])
00-07  Scan(groupscan=[EasyGroupScan 
[selectionRoot=maprfs:/tmp/left.json, numFiles=1, columns=[`*`], 
files=[maprfs:///tmp/left.json]]])
{noformat}

Data used in above queries

{noformat}
[root@centos-01 null_eq_joins]# cat left.json
{
"intKey" : 123,
"bgintKey": 1234567,
"strKey": "this is a test string",
"boolKey": true,
"fltKey": 123.786,
"dblKey": 457.984,
"timKey": "18:30:45",
"dtKey": "1997-10-21",
"tmstmpKey": "2007-04-30 13:10:02.047",
"intrvldyKey": "P9DT38833S",
"intrvlyrKey": "P255M"
}
[root@centos-01 null_equality_joins]#

[root@centos-01 null_eq_joins]# cat right_all_nulls.json
{
 "intKey" : null,
 "bgintKey": null,
 "strKey": null,
 "boolKey": null,
 "fltKey": null,
 "dblKey": null,
 "timKey": null,
 "dtKey": null,
 "tmstmpKey": null,
 "intrvldyKey": null,
 "intrvlyrKey": null
}
[root@centos-01 null_eq_joins]#
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Query JSON that has null as value for each key

2016-11-08 Thread Khurram Faraaz
Hi All,

Drill 1.9.0 git commit ID : 83513daf

Drill returns same result with or without `store.json.all_text_mode`=true

[root@cent01 null_eq_joins]# cat right_all_nulls.json
{
 "intKey" : null,
 "bgintKey": null,
 "strKey": null,
 "boolKey": null,
 "fltKey": null,
 "dblKey": null,
 "timKey": null,
 "dtKey": null,
 "tmstmpKey": null,
 "intrvldyKey": null,
 "intrvlyrKey": null
}
[root@cent01 null_eq_joins]#

Querying the above JSON file results in null as query result.
 -  We should see each of the keys in the JSON as a column in query result.
 -  And in each column the value should be a null value.
Current behavior does not look right.

{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `right_all_nulls.json`;
+---+
|   *   |
+---+
| null  |
+---+
1 row selected (0.313 seconds)
{noformat}

Thanks,
Khurram