[jira] [Created] (DRILL-5023) ExternalSortBatch does not spill fully, throws off spill calculations
Paul Rogers created DRILL-5023: -- Summary: ExternalSortBatch does not spill fully, throws off spill calculations Key: DRILL-5023 URL: https://issues.apache.org/jira/browse/DRILL-5023 Project: Apache Drill Issue Type: Bug Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor The {{ExternalSortBatch}} (ESB) operator sorts records, spilling to disk as needed to operate within a defined memory budget. When needed, ESB spills accumulated record batches to disk. However, when doing so, the ESB carves off the first spillable batch and holds it in memory: {{code}} // 1 output container is kept in memory, so we want to hold on to it and transferClone // allows keeping ownership VectorContainer c1 = VectorContainer.getTransferClone(outputContainer, oContext); c1.buildSchema(BatchSchema.SelectionVectorMode.NONE); c1.setRecordCount(count); ... BatchGroup newGroup = new BatchGroup(c1, fs, outputFile, oContext); }} When the spill batch size gets larger (to fix DRILL-5022), the result is that nothing is spilled as the first spillable batch is simply stored back into memory on the (supposedly) spilled batches list. The desired behavior is for all spillable batches to be written to disk. If the first batch is held back to work around some issue (to keep a schema, say?), then fine a different solution that allows the actual data to spill. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-5022) ExternalSortBatch sets two different limits for "copier" memory
Paul Rogers created DRILL-5022: -- Summary: ExternalSortBatch sets two different limits for "copier" memory Key: DRILL-5022 URL: https://issues.apache.org/jira/browse/DRILL-5022 Project: Apache Drill Issue Type: Bug Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor The {{ExternalSortBatch}} (ESB) operator sorts rows and supports spilling to disk to operate within a set memory budget. A key step in disk-based sorting is to merge "runs" of previously-sorted records. ESB does this with a class created from the {{PriorityQueueCopierTemplate}}, called the "copier" in the code. The sort runs are represented by record batches, each with an indirection vector (AKA {{SelectionVector}}) that point to the records in sort order. The copier restructures the incoming runs: copying from the original batches (from positions given by the indirection vector) into new output vectors in sorted order. To do this work, the copier must allocate new vectors to hold the merged data. These vectors consume memory, and must fit into the overall memory budget assigned to the ESB. As it turns out, the ESB code has two conflicting ways of setting the limit. One is hard-coded: {code} private static final int COPIER_BATCH_MEM_LIMIT = 256 * 1024; {code} The other comes from config parameters: {code} public static final long INITIAL_ALLOCATION = 10_000_000; public static final long MAX_ALLOCATION = 20_000_000; copierAllocator = oAllocator.newChildAllocator(oAllocator.getName() + ":copier", PriorityQueueCopier.INITIAL_ALLOCATION, PriorityQueueCopier.MAX_ALLOCATION); {code} Strangely, the config parameters are used to set aside memory for the copier to use. But, the {{COPIER_BATCH_MEM_LIMIT}} is used to determine how large of a merged batch to actually create. The result is that we set aside 10 MB of memory, but use only 256K of it, wasting 9 MB. This ticket asks to: * Determine the proper merged batch size. * Use that limit to set the memory allocation for the copier. Elsewhere in Drill batch sizes tend to be on the order of 32K records. In the ESB, the low {{COPIER_BATCH_MEM_LIMIT}} tends to favor smaller batches: A test case has a row width of 114 bytes, and produces batches of just 2299 records. So, likely the proper choice is the larger 10 MB memory allocator limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-5021) ExternalSortBatch redundantly redefines the batch schema
Paul Rogers created DRILL-5021: -- Summary: ExternalSortBatch redundantly redefines the batch schema Key: DRILL-5021 URL: https://issues.apache.org/jira/browse/DRILL-5021 Project: Apache Drill Issue Type: Bug Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor Much code in the {{ExternalSortBatch}} (ESB) deals with building vector batches and schemas. However, ESB cannot handle schema changes. The only valid schema difference is the same field path in a different position in the vector array. Given this restriction, the code can be simplified (and sped up) by exploiting the fact that all batches are required to have the same conceptual schema (same set of fields, but perhaps in different vector order) and most probably, the same physical schema (same fields and same vector order.) Note that, because of the way that the {{getValueVectorId()}} method works, each lookup of a value vector is an O(n) operation, so that each remapping of vectors is O(n^2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-5020) ExernalSortBatch has inconsistent notions of the memory limit
Paul Rogers created DRILL-5020: -- Summary: ExernalSortBatch has inconsistent notions of the memory limit Key: DRILL-5020 URL: https://issues.apache.org/jira/browse/DRILL-5020 Project: Apache Drill Issue Type: Bug Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor The memory limit cases for the spill-needed test seem inconsistent: For the test for in-memory sort: {code} long currentlyAvailable = popConfig.getMaxAllocation() - oAllocator.getAllocatedMemory(); {code} For reaching the memory limit: {code} oAllocator.getAllocatedMemory() > .95 * oAllocator.getLimit() {code} That is, one uses {{oAllocator.getLimit}} ("the current maximum limit this allocator imposes"), the other uses {{popConfig.getMaxAllocation}} ("The maximum memory this operator can allocate".) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-5019) ExternalSortBatch spills all batches to disk even if even one spills
Paul Rogers created DRILL-5019: -- Summary: ExternalSortBatch spills all batches to disk even if even one spills Key: DRILL-5019 URL: https://issues.apache.org/jira/browse/DRILL-5019 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor The ExternalSortBatch (ESB) operator sorts batches while spilling to disk to stay within a defined memory budget. Assume the memory budget is 10 GB. Assume that the actual volume of data to be sorted is 10.1 GB. The ESB spills the extra 0.1 GB to disk. (Actually spills more than that, say 5 GB.) At the completion of the run, ESB has read all incoming batches. It must now merge those batches. It does so by spilling **all** batches to disk, then doing a disk-based merge. This means that exceeding the memory limit by even a small amount is the same as having a very low memory limit: all batches must spill. This solution is simple, it works, and has some amount of logic. But, it would be better to have a slightly more advanced solution that spills only the smallest possible set of batches to disk, then does a hybrid in-memory, on-disk merge, saving the unnecessary write/read cycle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-5018) Metadata cache has duplicate columnTypeInfo values
Robert Hou created DRILL-5018: - Summary: Metadata cache has duplicate columnTypeInfo values Key: DRILL-5018 URL: https://issues.apache.org/jira/browse/DRILL-5018 Project: Apache Drill Issue Type: Bug Components: Metadata Affects Versions: 1.8.0 Reporter: Robert Hou Assignee: Parth Chandra This lineitem table has duplicate entries in its metadata file, although the entries have slightly different values. This lineitem table uses directory-based partitioning on year and month. "columnTypeInfo" : { "L_RETURNFLAG" : { "name" : [ "L_RETURNFLAG" ], "primitiveType" : "BINARY", "originalType" : null, "precision" : 0, "scale" : 0, "repetitionLevel" : 0, "definitionLevel" : 1 }, "l_returnflag" : { "name" : [ "l_returnflag" ], "primitiveType" : "BINARY", "originalType" : "UTF8", "precision" : 0, "scale" : 0, "repetitionLevel" : 0, "definitionLevel" : 0 }, This lineitem table has two entries in its metadata file for each column, but the two entries have different column names (adding a zero). It also has slightly different values. This lineitem table was created using CTAS with the first table above. "l_shipinstruct" : { "name" : [ "l_shipinstruct" ], "primitiveType" : "BINARY", "originalType" : "UTF8", "precision" : 0, "scale" : 0, "repetitionLevel" : 0, "definitionLevel" : 0 }, "L_SHIPINSTRUCT0" : { "name" : [ "L_SHIPINSTRUCT0" ], "primitiveType" : "BINARY", "originalType" : null, "precision" : 0, "scale" : 0, "repetitionLevel" : 0, "definitionLevel" : 1 }, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87107869 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string --- End diff -- Should keep the current name. Agree that "drillbit" should be documented has having the same form as zk: drillbit=node1[:port][,node2[:port]]* The new form is backward compatible with the old form. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #647: DRILL-4935 Allow Drill to advertise a specific host...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/647#discussion_r87103376 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/service/ServiceEngine.java --- @@ -140,9 +140,19 @@ private static BufferAllocator newAllocator( name, context.getConfig().getLong(initReservation), context.getConfig().getLong(maxAllocation)); } + private String getHostName() throws UnknownHostException{ +// DRILL_HOST_NAME sets custom host name. See drill-env.sh for details. +String customHost = System.getenv("DRILL_HOST_NAME"); +if (customHost == null) { + return useIP ? InetAddress.getLocalHost().getHostAddress() : InetAddress.getLocalHost().getCanonicalHostName(); +} else { + return customHost; +} --- End diff -- Minor code flow suggestion: if (customHost == null) { return customHost; } return useIP ? InetAddress.getLocalHost().getHostAddress() : InetAddress.getLocalHost().getCanonicalHostName(); --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87097440 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string + */ + public void populateEndpointsList(ArrayList endpointList, String drillbits){ --- End diff -- ArrayList -> List Generally, create a specific list, but pass it around as the generic form. (Allows changing the implementation without impacting clients.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87101443 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/client/DrillClientSystemTest.java --- @@ -73,4 +77,90 @@ public void testSubmitPlanTwoNodes() throws Exception { } client.close(); } + + @Test + public void testPopulateEndpointsList() throws Exception{ + +ArrayList endpointsList = new ArrayList<>(); +String drillBitConnection; +DrillClient client = new DrillClient(); +DrillbitEndpoint endpoint; +Iterator endpointIterator; + + +// Test with single drillbit ip +drillBitConnection = "10.10.100.161"; +client.populateEndpointsList(endpointsList, drillBitConnection); +endpoint = endpointsList.iterator().next(); +assert(endpointsList.size() == 1); +assert(endpoint.getAddress().equalsIgnoreCase(drillBitConnection)); +assert(endpoint.getUserPort() == client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT)); + +// Test with single drillbit ip:port +endpointsList.clear(); +drillBitConnection = "10.10.100.161:5000"; +String[] ipAndPort = drillBitConnection.split(":"); +client.populateEndpointsList(endpointsList, drillBitConnection); +assert(endpointsList.size() == 1); + +endpoint = endpointsList.iterator().next(); +assert(endpoint.getAddress().equalsIgnoreCase(ipAndPort[0])); +assert(endpoint.getUserPort() == Integer.parseInt(ipAndPort[1])); + +// Test with multiple drillbit ip +endpointsList.clear(); +drillBitConnection = "10.10.100.161,10.10.100.162"; +client.populateEndpointsList(endpointsList, drillBitConnection); +assert(endpointsList.size() == 2); + +endpointIterator = endpointsList.iterator(); +endpoint = endpointIterator.next(); +assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.161")); +assert(endpoint.getUserPort() == client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT)); --- End diff -- Common practice: assertEquals( expected, actual ) So: assert(client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT), endpoint.getUserPort()); --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87100814 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -245,14 +291,15 @@ public synchronized void connect(String connect, Properties props) throws RpcExc throw new RpcException("Failure setting up ZK for client.", e); } } - - final ArrayList endpoints = new ArrayList<>(clusterCoordinator.getAvailableEndpoints()); - checkState(!endpoints.isEmpty(), "No DrillbitEndpoint can be found"); - // shuffle the collection then get the first endpoint - Collections.shuffle(endpoints); - endpoint = endpoints.iterator().next(); + endpoints.addAll(clusterCoordinator.getAvailableEndpoints()); } +// Make sure we have at least one endpoint in the list +checkState(!endpoints.isEmpty(), "No DrillbitEndpoint can be found"); +// shuffle the collection then get the first endpoint +Collections.shuffle(endpoints); +endpoint = endpoints.iterator().next(); --- End diff -- endpoint = endpoints.get( 0 ); --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87098478 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string + */ + public void populateEndpointsList(ArrayList endpointList, String drillbits){ + +// If no information about drillbits is provided then just return empty list. +if(drillbits == null || drillbits.length() == 0){ --- End diff -- drillbits.length( ) == 0 --> drillbits.trim( ).isEmpty( ) The trim removes any bogus white space. IsEmpty( ) is a shorthand for length( ) == 0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87100318 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string + */ + public void populateEndpointsList(ArrayList endpointList, String drillbits){ + +// If no information about drillbits is provided then just return empty list. +if(drillbits == null || drillbits.length() == 0){ + return; +} +final String[] connectInfo = drillbits.split(","); + +/* For direct connection we can get URL string having drillbit property as below: + drillbit=: --- Use the IP and port specified as the Foreman IP and port + drillbit=--- Use the IP specified as the Foreman IP with default port in config file + drillbit=:,:... --- Randomly select the IP and port pair from the specified + list as the Foreman IP and port. + + Fetch ip address and port information for each drillbit and populate the list +*/ +for(String info : connectInfo){ + info = info.trim(); + + if(info != null){ +// Split each info to get ip address and port value +final String[] drillbitInfo = info.split(":"); + +// Check for malformed ip:port string +if(drillbitInfo == null || drillbitInfo.length == 0){ + continue; +} + +/* If port is present use that one else use the configured one + Assumptions: 1) IP Address provided in connection string is valid +2) Port without IP address is never specified. +*/ +final String port = (drillbitInfo.length == 2) ? drillbitInfo[1] : config.getString(ExecConstants.INITIAL_USER_PORT); +final DrillbitEndpoint endpoint = DrillbitEndpoint.newBuilder() + .setAddress(drillbitInfo[0]) + .setUserPort(Integer.parseInt(port)) + .build(); +endpointList.add(endpoint); + } +} + } + public synchronized void connect(String connect, Properties props) throws RpcException { if (connected) { return; } final DrillbitEndpoint endpoint; +final ArrayList endpoints = new ArrayList<>(); if (isDirectConnection) { - final String[] connectInfo = props.getProperty("drillbit").split(":"); - final String port = connectInfo.length==2?connectInfo[1]:config.getString(ExecConstants.INITIAL_USER_PORT); - endpoint = DrillbitEndpoint.newBuilder() - .setAddress(connectInfo[0]) - .setUserPort(Integer.parseInt(port)) - .build(); + // Populate the endpoints list with all the drillbit information provided in the + // connection string + populateEndpointsList(endpoints, props.getProperty("drillbit").trim()); --- End diff -- What happens if "drillbit" is unset? Will the return from getProperty( ) be null? If so, the ".trim( )" will throw an NPE. Solution: push the trim into the method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87099094 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string + */ + public void populateEndpointsList(ArrayList endpointList, String drillbits){ + +// If no information about drillbits is provided then just return empty list. +if(drillbits == null || drillbits.length() == 0){ + return; +} +final String[] connectInfo = drillbits.split(","); + +/* For direct connection we can get URL string having drillbit property as below: + drillbit=: --- Use the IP and port specified as the Foreman IP and port + drillbit=--- Use the IP specified as the Foreman IP with default port in config file + drillbit=:,:... --- Randomly select the IP and port pair from the specified + list as the Foreman IP and port. + + Fetch ip address and port information for each drillbit and populate the list +*/ +for(String info : connectInfo){ + info = info.trim(); + + if(info != null){ +// Split each info to get ip address and port value +final String[] drillbitInfo = info.split(":"); + +// Check for malformed ip:port string +if(drillbitInfo == null || drillbitInfo.length == 0){ --- End diff -- Will never be null. Length won't be 0 if the string is non-empty. But, we checked the non-empty case. So, the length will be 1 or more. What if the length is greater than 2 (drillbit:10:20:30)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87100146 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string + */ + public void populateEndpointsList(ArrayList endpointList, String drillbits){ + +// If no information about drillbits is provided then just return empty list. +if(drillbits == null || drillbits.length() == 0){ + return; +} +final String[] connectInfo = drillbits.split(","); + +/* For direct connection we can get URL string having drillbit property as below: + drillbit=: --- Use the IP and port specified as the Foreman IP and port + drillbit=--- Use the IP specified as the Foreman IP with default port in config file + drillbit=:,:... --- Randomly select the IP and port pair from the specified + list as the Foreman IP and port. + + Fetch ip address and port information for each drillbit and populate the list +*/ +for(String info : connectInfo){ + info = info.trim(); + + if(info != null){ +// Split each info to get ip address and port value +final String[] drillbitInfo = info.split(":"); + +// Check for malformed ip:port string +if(drillbitInfo == null || drillbitInfo.length == 0){ + continue; +} + +/* If port is present use that one else use the configured one + Assumptions: 1) IP Address provided in connection string is valid +2) Port without IP address is never specified. +*/ +final String port = (drillbitInfo.length == 2) ? drillbitInfo[1] : config.getString(ExecConstants.INITIAL_USER_PORT); +final DrillbitEndpoint endpoint = DrillbitEndpoint.newBuilder() + .setAddress(drillbitInfo[0]) + .setUserPort(Integer.parseInt(port)) + .build(); +endpointList.add(endpoint); + } +} + } + public synchronized void connect(String connect, Properties props) throws RpcException { if (connected) { return; } final DrillbitEndpoint endpoint; +final ArrayList endpoints = new ArrayList<>(); --- End diff -- final List = ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87098173 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string --- End diff -- One or more drillbit -> drillbits (Javadoc practice is to use lower case in param comments. See: http://www.oracle.com/technetwork/java/javase/documentation/index-137868.html#styleguide) Please copy detailed description from method body to here so it shows up in Javadoc. I one port allowed? Or, should the user specify all three? If only one, what port numbers are used for the other two? If this is documented elsewhere, just point to that location. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87100068 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string + */ + public void populateEndpointsList(ArrayList endpointList, String drillbits){ + +// If no information about drillbits is provided then just return empty list. +if(drillbits == null || drillbits.length() == 0){ + return; +} +final String[] connectInfo = drillbits.split(","); + +/* For direct connection we can get URL string having drillbit property as below: + drillbit=: --- Use the IP and port specified as the Foreman IP and port + drillbit=--- Use the IP specified as the Foreman IP with default port in config file + drillbit=:,:... --- Randomly select the IP and port pair from the specified + list as the Foreman IP and port. + + Fetch ip address and port information for each drillbit and populate the list +*/ +for(String info : connectInfo){ + info = info.trim(); + + if(info != null){ +// Split each info to get ip address and port value +final String[] drillbitInfo = info.split(":"); + +// Check for malformed ip:port string +if(drillbitInfo == null || drillbitInfo.length == 0){ + continue; +} + +/* If port is present use that one else use the configured one + Assumptions: 1) IP Address provided in connection string is valid +2) Port without IP address is never specified. +*/ +final String port = (drillbitInfo.length == 2) ? drillbitInfo[1] : config.getString(ExecConstants.INITIAL_USER_PORT); +final DrillbitEndpoint endpoint = DrillbitEndpoint.newBuilder() + .setAddress(drillbitInfo[0]) + .setUserPort(Integer.parseInt(port)) --- End diff -- parseInt( ) will throw an (unchecked) exception if the port is not an integer. That is, will throw an error for "host:foo". This raises an interesting question, how are we doing error handling? Just ignoring invalid entries? Should we throw a uniform exception instead? InvalidArgumentException or something Drill-specific? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87099827 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string + */ + public void populateEndpointsList(ArrayList endpointList, String drillbits){ + +// If no information about drillbits is provided then just return empty list. +if(drillbits == null || drillbits.length() == 0){ + return; +} +final String[] connectInfo = drillbits.split(","); + +/* For direct connection we can get URL string having drillbit property as below: + drillbit=: --- Use the IP and port specified as the Foreman IP and port + drillbit=--- Use the IP specified as the Foreman IP with default port in config file + drillbit=:,:... --- Randomly select the IP and port pair from the specified + list as the Foreman IP and port. + + Fetch ip address and port information for each drillbit and populate the list +*/ +for(String info : connectInfo){ + info = info.trim(); + + if(info != null){ +// Split each info to get ip address and port value +final String[] drillbitInfo = info.split(":"); + +// Check for malformed ip:port string +if(drillbitInfo == null || drillbitInfo.length == 0){ + continue; +} + +/* If port is present use that one else use the configured one + Assumptions: 1) IP Address provided in connection string is valid +2) Port without IP address is never specified. +*/ +final String port = (drillbitInfo.length == 2) ? drillbitInfo[1] : config.getString(ExecConstants.INITIAL_USER_PORT); --- End diff -- What about "host:" The array will be ["host",""]. And, what about "host: " (space after colon) or "host : 20" (space around colon) or "host:20 " (space after port)? May need a condition that is length == 2 & ! drillbitInfo[1].trim( ).isEmpty( ). Also, what happens if length > 2? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87102298 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/client/DrillClientSystemTest.java --- @@ -73,4 +77,90 @@ public void testSubmitPlanTwoNodes() throws Exception { } client.close(); } + + @Test + public void testPopulateEndpointsList() throws Exception{ + +ArrayList endpointsList = new ArrayList<>(); +String drillBitConnection; +DrillClient client = new DrillClient(); +DrillbitEndpoint endpoint; +Iterator endpointIterator; + + +// Test with single drillbit ip +drillBitConnection = "10.10.100.161"; +client.populateEndpointsList(endpointsList, drillBitConnection); +endpoint = endpointsList.iterator().next(); +assert(endpointsList.size() == 1); +assert(endpoint.getAddress().equalsIgnoreCase(drillBitConnection)); +assert(endpoint.getUserPort() == client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT)); + +// Test with single drillbit ip:port +endpointsList.clear(); +drillBitConnection = "10.10.100.161:5000"; +String[] ipAndPort = drillBitConnection.split(":"); +client.populateEndpointsList(endpointsList, drillBitConnection); +assert(endpointsList.size() == 1); + +endpoint = endpointsList.iterator().next(); +assert(endpoint.getAddress().equalsIgnoreCase(ipAndPort[0])); +assert(endpoint.getUserPort() == Integer.parseInt(ipAndPort[1])); + +// Test with multiple drillbit ip +endpointsList.clear(); +drillBitConnection = "10.10.100.161,10.10.100.162"; +client.populateEndpointsList(endpointsList, drillBitConnection); +assert(endpointsList.size() == 2); + +endpointIterator = endpointsList.iterator(); +endpoint = endpointIterator.next(); +assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.161")); +assert(endpoint.getUserPort() == client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT)); + +endpoint = endpointIterator.next(); +assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.162")); +assert(endpoint.getUserPort() == client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT)); + +// Test with multiple drillbit ip:port +endpointsList.clear(); +drillBitConnection = "10.10.100.161:5000,10.10.100.162:5000"; +client.populateEndpointsList(endpointsList, drillBitConnection); +assert(endpointsList.size() == 2); + +endpointIterator = endpointsList.iterator(); +endpoint = endpointIterator.next(); +assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.161")); +assert(endpoint.getUserPort() == 5000); + +endpoint = endpointIterator.next(); +assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.162")); +assert(endpoint.getUserPort() == 5000); + +// Test with multiple drillbit with mix of ip:port and ip +endpointsList.clear(); +drillBitConnection = "10.10.100.161:5000,10.10.100.162"; +client.populateEndpointsList(endpointsList, drillBitConnection); +assert(endpointsList.size() == 2); + +endpointIterator = endpointsList.iterator(); +endpoint = endpointIterator.next(); +assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.161")); +assert(endpoint.getUserPort() == 5000); + +endpoint = endpointIterator.next(); +assert(endpoint.getAddress().equalsIgnoreCase("10.10.100.162")); +assert(endpoint.getUserPort() == client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT)); + +// Test with empty string +endpointsList.clear(); +drillBitConnection = ""; +client.populateEndpointsList(endpointsList, drillBitConnection); +assert(endpointsList.size() == 0); + + --- End diff -- Test the other strange cases noted in earlier comments: spaces, non-numeric ports, missing host, etc: " foo : 10 ", "foo: ", "foo:", "foo:bar", ":20", etc. for errors: { try { parseEndpoints( "foo:bogus" ); fail( ); } catch( YourException e ) { } } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87098346 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string + */ + public void populateEndpointsList(ArrayList endpointList, String drillbits){ + +// If no information about drillbits is provided then just return empty list. --- End diff -- no information about --> no drillbits --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87097059 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits --- End diff -- Thanks for the Javadoc! Function to populate -> Populate information of all the -> the list of --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87100765 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -245,14 +291,15 @@ public synchronized void connect(String connect, Properties props) throws RpcExc throw new RpcException("Failure setting up ZK for client.", e); } } - - final ArrayList endpoints = new ArrayList<>(clusterCoordinator.getAvailableEndpoints()); - checkState(!endpoints.isEmpty(), "No DrillbitEndpoint can be found"); - // shuffle the collection then get the first endpoint - Collections.shuffle(endpoints); - endpoint = endpoints.iterator().next(); + endpoints.addAll(clusterCoordinator.getAvailableEndpoints()); } +// Make sure we have at least one endpoint in the list +checkState(!endpoints.isEmpty(), "No DrillbitEndpoint can be found"); --- End diff -- If this error is shown to the user, perhaps help them out a bit: No Drillbit endpoints found in either the connect string (drillbit=host[:port]) or in the drill-override.conf file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87101810 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/client/DrillClientSystemTest.java --- @@ -73,4 +77,90 @@ public void testSubmitPlanTwoNodes() throws Exception { } client.close(); } + + @Test + public void testPopulateEndpointsList() throws Exception{ + +ArrayList endpointsList = new ArrayList<>(); +String drillBitConnection; +DrillClient client = new DrillClient(); +DrillbitEndpoint endpoint; +Iterator endpointIterator; + + +// Test with single drillbit ip +drillBitConnection = "10.10.100.161"; +client.populateEndpointsList(endpointsList, drillBitConnection); +endpoint = endpointsList.iterator().next(); +assert(endpointsList.size() == 1); +assert(endpoint.getAddress().equalsIgnoreCase(drillBitConnection)); +assert(endpoint.getUserPort() == client.getConfig().getInt(ExecConstants.INITIAL_USER_PORT)); + +// Test with single drillbit ip:port +endpointsList.clear(); --- End diff -- Having to remember `endpointsList.clear( )` every time is error-prone. Better: { String drillBitConnection = "10.10.100.161"; List endpointList = parseEndpoints( drillbitConnection ); ... // Your tests here } The above creates a new block, and new case-specific variables, of each specific test. Less chance of one test "bleeding through" into the next one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87098807 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string + */ + public void populateEndpointsList(ArrayList endpointList, String drillbits){ + +// If no information about drillbits is provided then just return empty list. +if(drillbits == null || drillbits.length() == 0){ + return; +} +final String[] connectInfo = drillbits.split(","); + +/* For direct connection we can get URL string having drillbit property as below: + drillbit=: --- Use the IP and port specified as the Foreman IP and port + drillbit=--- Use the IP specified as the Foreman IP with default port in config file + drillbit=:,:... --- Randomly select the IP and port pair from the specified + list as the Foreman IP and port. + + Fetch ip address and port information for each drillbit and populate the list +*/ +for(String info : connectInfo){ + info = info.trim(); + + if(info != null){ --- End diff -- info will never be null here. Did you mean info.isEmpty( )? That will catch "foo,,bar". info -> drillbit (The word info is a bit generic...) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87098611 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string + */ + public void populateEndpointsList(ArrayList endpointList, String drillbits){ + +// If no information about drillbits is provided then just return empty list. +if(drillbits == null || drillbits.length() == 0){ + return; +} +final String[] connectInfo = drillbits.split(","); + +/* For direct connection we can get URL string having drillbit property as below: --- End diff -- Cool! Best to put this into the Javadoc comment for this function where it can be seen in the Javadoc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/648#discussion_r87100583 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -223,19 +223,65 @@ public void connect(Properties props) throws RpcException { connect(null, props); } + /** + * Function to populate the endpointList with information of all the drillbits + * provided in the connection string by client + * @param endpointList - ArrayList of DrillbitEndpoints + * @param drillbits - One or more drillbit ip[:port] provided in connection string + */ + public void populateEndpointsList(ArrayList endpointList, String drillbits){ + +// If no information about drillbits is provided then just return empty list. +if(drillbits == null || drillbits.length() == 0){ + return; +} +final String[] connectInfo = drillbits.split(","); + +/* For direct connection we can get URL string having drillbit property as below: + drillbit=: --- Use the IP and port specified as the Foreman IP and port + drillbit=--- Use the IP specified as the Foreman IP with default port in config file + drillbit=:,:... --- Randomly select the IP and port pair from the specified + list as the Foreman IP and port. + + Fetch ip address and port information for each drillbit and populate the list +*/ +for(String info : connectInfo){ + info = info.trim(); + + if(info != null){ +// Split each info to get ip address and port value +final String[] drillbitInfo = info.split(":"); + +// Check for malformed ip:port string +if(drillbitInfo == null || drillbitInfo.length == 0){ + continue; +} + +/* If port is present use that one else use the configured one + Assumptions: 1) IP Address provided in connection string is valid +2) Port without IP address is never specified. +*/ +final String port = (drillbitInfo.length == 2) ? drillbitInfo[1] : config.getString(ExecConstants.INITIAL_USER_PORT); +final DrillbitEndpoint endpoint = DrillbitEndpoint.newBuilder() + .setAddress(drillbitInfo[0]) + .setUserPort(Integer.parseInt(port)) + .build(); +endpointList.add(endpoint); + } +} + } + public synchronized void connect(String connect, Properties props) throws RpcException { if (connected) { return; } final DrillbitEndpoint endpoint; +final ArrayList endpoints = new ArrayList<>(); if (isDirectConnection) { - final String[] connectInfo = props.getProperty("drillbit").split(":"); - final String port = connectInfo.length==2?connectInfo[1]:config.getString(ExecConstants.INITIAL_USER_PORT); - endpoint = DrillbitEndpoint.newBuilder() - .setAddress(connectInfo[0]) - .setUserPort(Integer.parseInt(port)) - .build(); + // Populate the endpoints list with all the drillbit information provided in the + // connection string + populateEndpointsList(endpoints, props.getProperty("drillbit").trim()); --- End diff -- Actually, the code here would be cleaner as: final List endpoints = parseEndpoints( props.getProperty("drillbit") ); --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #648: DRILL-5015: Randomly select the drillbit from the list pro...
Github user sohami commented on the issue: https://github.com/apache/drill/pull/648 @kkhatua - The unit test I have written is just for validating the newly introduced "populateEndpointsList" method. I am not actually connecting to the drillbit from within the test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #648: DRILL-5015: Randomly select the drillbit from the l...
GitHub user sohami opened a pull request: https://github.com/apache/drill/pull/648 DRILL-5015: Randomly select the drillbit from the list provided by user in connection string @sudheeshkatkam - Please help to review the changes You can merge this pull request into a Git repository by running: $ git pull https://github.com/sohami/drill DRILL-5015 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/648.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #648 commit e5ef1b62233b7f21868a094813d429a7b8afc5b1 Author: Sorabh HamirwasiaDate: 2016-11-08T19:27:57Z DRILL-5015: Randomly select the drillbit from the list provided by user in connection string --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #647: DRILL-4935 Allow Drill to advertise a specific hostname to...
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/647 Two ways to test. First, try the fix manually. Start Drill once without the env var set to ensure Drill uses the default. Then, set the env var and ensure that Drill uses the correct value. Second, would be a unit test. I'm not familiar with any existing ZK unit tests. Do any of the other committers know if we have such tests? Something that validates Drillbit registration, etc? If so, then that test can be modified for this use case. But, there is one gotcha: in Java, env vars are read-only, there is no way to set env vars in a test (that I've ever found.) Anyone have a work-around? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #647: DRILL-4935 Allow Drill to advertise a specific hostname to...
Github user harrisonmebane commented on the issue: https://github.com/apache/drill/pull/647 I have made the changes suggested by @paul-rogers . Are there any unit tests we should add? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (DRILL-5017) Config param drill.exec.sort.external.batch.size is not used
Paul Rogers created DRILL-5017: -- Summary: Config param drill.exec.sort.external.batch.size is not used Key: DRILL-5017 URL: https://issues.apache.org/jira/browse/DRILL-5017 Project: Apache Drill Issue Type: Bug Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor The Drill config file defines the {{drill.exec.sort.external.batch.size}} parameter, as does {{ExecConstants}}: {code} String EXTERNAL_SORT_TARGET_BATCH_SIZE = "drill.exec.sort.external.batch.size"; {code} However, this parameter is never used. It seems to be a duplicate of: {code} String EXTERNAL_SORT_TARGET_SPILL_BATCH_SIZE = "drill.exec.sort.external.spill.batch.size"; {code} Which, itself, is never used. Remove these parameters from {{ExecConstants}}, {{drill-module.conf}}, {{drill-override-example.conf}} (if they appear in those files) and from the documentation (if they appear in the docs.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-5016) drill.exec.sort.purge.threshold is misnamed
Paul Rogers created DRILL-5016: -- Summary: drill.exec.sort.purge.threshold is misnamed Key: DRILL-5016 URL: https://issues.apache.org/jira/browse/DRILL-5016 Project: Apache Drill Issue Type: Bug Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor The Drill config system provides a property called {{drill.exec.sort.purge.threshold}}. The name suggests that this is a parameter related to sorting. Perhaps it controls something having to do with when we purge buffered batches from memory in the ExternalSortBatch? In fact, this is actually {{drill.exec.topn.purge-threshold}} - it affects only the Top-N operator, not sort. To make this change, rename the config attribute in {{ExecConstants}} from {code} String BATCH_PURGE_THRESHOLD = "drill.exec.sort.purge.threshold"; {code} to: {code} String TOP_N_PURGE_THRESHOLD = "drill.exec.topn.purge-threshold"; {code} To permit backward compatibility, modify the use in TopNBatch to check the old value, use it if set, else use the new value. {code} // Check pre x.y config parameter for backward compatibility. if ( ! context.getConfig( ).isEmpty( "drill.exec.sort.purge.threshold" ) ) { batchPurgeThreshold = context.getConfig().getInt("drill.exec.sort.purge.threshold"); } else { batchPurgeThreshold = context.getConfig().getInt(ExecConstants. TOP_N_PURGE_THRESHOLD); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-5015) As per documentation, when issuing a list of drillbits in the connection string, we always attempt to connect only to the first one
Sorabh Hamirwasia created DRILL-5015: Summary: As per documentation, when issuing a list of drillbits in the connection string, we always attempt to connect only to the first one Key: DRILL-5015 URL: https://issues.apache.org/jira/browse/DRILL-5015 Project: Apache Drill Issue Type: Bug Components: Client - JDBC Affects Versions: 1.8.0, 1.9.0 Reporter: Sorabh Hamirwasia Assignee: Sorabh Hamirwasia When trying to connect to a Drill cluster by specifying more than 1 drillbits to connect to, we always attempt to connect to only the first drillbit. As an example, we tested against a pair of drillbits, but we always connect to the first entry in the CSV list by querying for the 'current' drillbit. The remaining entries are never attempted. [root@pssc-60 agileSqlPerfTests]# /opt/mapr/drill/drill-1.8.0/bin/sqlline -u "jdbc:drill:schema=dfs.tmp;drillbit=pssc-61:31010,pssc-62:31010" -f whereAmI.q | grep -v logback 1/1 select * from sys.drillbits where `current`; +-++---++--+ |hostname | user_port | control_port | data_port | current | +-++---++--+ | pssc-61.qa.lab | 31010 | 31011 | 31012 | true | +-++---++--+ 1 row selected (0.265 seconds) Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl apache drill 1.8.0 "a little sql for your nosql" This property is meant for use by clients when not wanting to overload the ZK for fetching a list of existing Drillbits, but the behaviour doesn't match the documentation. Making a Direct Drillbit Connection We ned too raandomly shuffle between this list and If an entry in the shuffled list is unreachable, we need to try for the next entry in the list. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-5014) ExternalSortBatch spills only half the batches requested in config
Paul Rogers created DRILL-5014: -- Summary: ExternalSortBatch spills only half the batches requested in config Key: DRILL-5014 URL: https://issues.apache.org/jira/browse/DRILL-5014 Project: Apache Drill Issue Type: Bug Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor The ExternalSortBatch (ESB) operator sorts data while spilling to disk to remain within a defined memory footprint. Spilling occurs based on a number of factors. Among those are two config parameters: * {{drill.exec.sort.external.spill.group.size}}: The number of batches to spill per spill event. * {{drill.exec.sort.external.spill.threshold}}: The number of batches to accumulate in memory before starting a spill event. The expected behavior would be: * When the accumulated number of batches exceeds the threshold, spill the requested number of batches. That is if the threshold is 200, and the size is 150, we should accumulate 200 batches, then spill 150 of them (leaving 50) and repeat. The actual behavior is: * When the accumulated records exceeds the threshold and more than "batch size" new batches have arrived since the last spill, * Spill half the accumulated records. So, the actual behavior for the (threshold=200, size=150) case is: {code} Before spilling, buffered batch count: 201 After spilling, buffered batch count: 101 Before spilling, buffered batch count: 251 After spilling, buffered batch count: 126 Before spilling, buffered batch count: 276 After spilling, buffered batch count: 138 Before spilling, buffered batch count: 288 After spilling, buffered batch count: 144 Before spilling, buffered batch count: 294 After spilling, buffered batch count: 147 Before spilling, buffered batch count: 297 After spilling, buffered batch count: 149 Before spilling, buffered batch count: 299 After spilling, buffered batch count: 150 Before spilling, buffered batch count: 300 After spilling, buffered batch count: 150 Before spilling, buffered batch count: 300 {code} In short, the actual number of batches retained in memory is twice the spill size, **NOT** the number set in the threshold. As a result, the ESB operator will use more memory than expected. The work-around is to set a batch size that is half the threshold so that the batch size (used in spill decisions) matches the actual spill count (as implemented by the code.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Proposed November report for Drill
+1 On Tue, Nov 8, 2016 at 9:59 AM, Jinfeng Niwrote: > +1 > > > > On Tue, Nov 8, 2016 at 9:31 AM, Sudheesh Katkam > wrote: > > +1 > > > >> On Nov 7, 2016, at 10:35 PM, Parth Chandra wrote: > >> > >> Hi Drill devs, > >> > >> Below is the proposed report for the November Apache board meeting. > Please > >> provide any comments/corrections. I'll be submitting this Wednesday Nov > >> 9th, so please provide input before that. > >> > >> Thanks > >> > >> Parth > >> > >> --- Report Begin - > >> > >> ## Description: > >> - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud > >> Storage > >> > >> ## Issues: > >> > >> - There are no issues requiring board attention at this time > >> > >> ## Activity: > >> > >> - Since the last board report, Drill has released version 1.8 > >> - Drill has added many new features referred to in the last report. > >> Dynamic UDFs, Parquet reader performance improvements, filter pushdown > for > >> Parquet, and improved support for Metadata in the clients has been > added. > >> - Improved use of statistics, and security enhancements (including > support > >> for Kerberos) continue to be in the works. Also in progress is an > >> improvement to the data locality algorithm. > >> > >> ## Health report: > >> > >> - There has been a good increase in the number posts in the dev and jira > >> lists. This reflects the increased activity on the development front. > User > >> list activity is down this period, but not a concern at the moment. > >> > >> ## PMC changes: > >> > >> - Currently 18 PMC members. > >> - Sudheesh Katkam was added to the PMC on Wed Oct 05 2016 > >> > >> ## Committer base changes: > >> > >> - Currently 28 committers. > >> - No new committers added in the last 3 months > >> - Last committer addition was Hsuan-Yi Chu at Thu Apr 07 2016 > >> > >> ## Releases: > >> > >> - 1.8.0 was released on Mon Aug 29 2016 > >> > >> ## Mailing list activity: > >> > >> - dev@drill.apache.org: > >>- 436 subscribers (down -10 in the last 3 months): > >>- 1797 emails sent to list (1231 in previous quarter) > >> > >> - iss...@drill.apache.org: > >>- 20 subscribers (up 0 in the last 3 months): > >>- 2188 emails sent to list (1550 in previous quarter) > >> > >> - u...@drill.apache.org: > >>- 567 subscribers (down -14 in the last 3 months): > >>- 436 emails sent to list (824 in previous quarter) > >> > >> ## JIRA activity: > >> > >> - 173 JIRA tickets created in the last 3 months > >> - 88 JIRA tickets closed/resolved in the last 3 months > >> > >> --- Report End - > > >
[jira] [Created] (DRILL-5013) Heap allocation, data copies in UDLE write path for ExternalSortBatch
Paul Rogers created DRILL-5013: -- Summary: Heap allocation, data copies in UDLE write path for ExternalSortBatch Key: DRILL-5013 URL: https://issues.apache.org/jira/browse/DRILL-5013 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor The ExternalSortBatch (ESB) uses spill-to-disk to sort a large collection of records within a limited memory footprint. As part of writing data to disk, ESB writes each of a target byte buffer to disk. Since the vector is stored in direct memory (not visible to an output stream), the code path first makes a temporary on-heap copy. In particular the code in `io.netty.buffer.PooledUnsafeDirectByteBuf` does the following: {code} @Override public ByteBuf getBytes(int index, OutputStream out, int length) throws IOException { checkIndex(index, length); if (length != 0) { byte[] tmp = new byte[length]; PlatformDependent.copyMemory(addr(index), tmp, 0, length); out.write(tmp); } return this; } {code} The result is that we 1) create a large number of on-heap objects, and 2) copy the data twice: once from direct memory to the tmp buffer, and from the tmp buffer into the output stream's own buffer. Two optimizations are possible: 1. Copy the data byte-by-byte from the direct memory buffer to the output stream, or 2. Reuse the same tmp buffer across vector writes. Since the code is in Netty, if we do either of the above, we'd have to write our own "getBytes" (misnomer, really write bytes) method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Proposed November report for Drill
+1 On Tue, Nov 8, 2016 at 9:31 AM, Sudheesh Katkamwrote: > +1 > >> On Nov 7, 2016, at 10:35 PM, Parth Chandra wrote: >> >> Hi Drill devs, >> >> Below is the proposed report for the November Apache board meeting. Please >> provide any comments/corrections. I'll be submitting this Wednesday Nov >> 9th, so please provide input before that. >> >> Thanks >> >> Parth >> >> --- Report Begin - >> >> ## Description: >> - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud >> Storage >> >> ## Issues: >> >> - There are no issues requiring board attention at this time >> >> ## Activity: >> >> - Since the last board report, Drill has released version 1.8 >> - Drill has added many new features referred to in the last report. >> Dynamic UDFs, Parquet reader performance improvements, filter pushdown for >> Parquet, and improved support for Metadata in the clients has been added. >> - Improved use of statistics, and security enhancements (including support >> for Kerberos) continue to be in the works. Also in progress is an >> improvement to the data locality algorithm. >> >> ## Health report: >> >> - There has been a good increase in the number posts in the dev and jira >> lists. This reflects the increased activity on the development front. User >> list activity is down this period, but not a concern at the moment. >> >> ## PMC changes: >> >> - Currently 18 PMC members. >> - Sudheesh Katkam was added to the PMC on Wed Oct 05 2016 >> >> ## Committer base changes: >> >> - Currently 28 committers. >> - No new committers added in the last 3 months >> - Last committer addition was Hsuan-Yi Chu at Thu Apr 07 2016 >> >> ## Releases: >> >> - 1.8.0 was released on Mon Aug 29 2016 >> >> ## Mailing list activity: >> >> - dev@drill.apache.org: >>- 436 subscribers (down -10 in the last 3 months): >>- 1797 emails sent to list (1231 in previous quarter) >> >> - iss...@drill.apache.org: >>- 20 subscribers (up 0 in the last 3 months): >>- 2188 emails sent to list (1550 in previous quarter) >> >> - u...@drill.apache.org: >>- 567 subscribers (down -14 in the last 3 months): >>- 436 emails sent to list (824 in previous quarter) >> >> ## JIRA activity: >> >> - 173 JIRA tickets created in the last 3 months >> - 88 JIRA tickets closed/resolved in the last 3 months >> >> --- Report End - >
[jira] [Created] (DRILL-5012) External Sort Batch does not check free space before writing
Paul Rogers created DRILL-5012: -- Summary: External Sort Batch does not check free space before writing Key: DRILL-5012 URL: https://issues.apache.org/jira/browse/DRILL-5012 Project: Apache Drill Issue Type: Bug Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor The ExternalSortBatch operator sorts data and spills to disk to control memory usage. The ESB accepts a configured set of output directories. The ESB does not check if sufficient space is available in a directory before writing to it; ESB requires that the admin ensure space is available. Desired fix: skip a directory if it does not contain space to hold the target spill file. (That is, write to the other directories if that have space.) The fix is not fail-safe: two threads may both check and both write. However, the fix reduces the chances of an out-of-disk error for the most obvious cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Proposed November report for Drill
+1 > On Nov 7, 2016, at 10:35 PM, Parth Chandrawrote: > > Hi Drill devs, > > Below is the proposed report for the November Apache board meeting. Please > provide any comments/corrections. I'll be submitting this Wednesday Nov > 9th, so please provide input before that. > > Thanks > > Parth > > --- Report Begin - > > ## Description: > - Drill is a Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud > Storage > > ## Issues: > > - There are no issues requiring board attention at this time > > ## Activity: > > - Since the last board report, Drill has released version 1.8 > - Drill has added many new features referred to in the last report. > Dynamic UDFs, Parquet reader performance improvements, filter pushdown for > Parquet, and improved support for Metadata in the clients has been added. > - Improved use of statistics, and security enhancements (including support > for Kerberos) continue to be in the works. Also in progress is an > improvement to the data locality algorithm. > > ## Health report: > > - There has been a good increase in the number posts in the dev and jira > lists. This reflects the increased activity on the development front. User > list activity is down this period, but not a concern at the moment. > > ## PMC changes: > > - Currently 18 PMC members. > - Sudheesh Katkam was added to the PMC on Wed Oct 05 2016 > > ## Committer base changes: > > - Currently 28 committers. > - No new committers added in the last 3 months > - Last committer addition was Hsuan-Yi Chu at Thu Apr 07 2016 > > ## Releases: > > - 1.8.0 was released on Mon Aug 29 2016 > > ## Mailing list activity: > > - dev@drill.apache.org: >- 436 subscribers (down -10 in the last 3 months): >- 1797 emails sent to list (1231 in previous quarter) > > - iss...@drill.apache.org: >- 20 subscribers (up 0 in the last 3 months): >- 2188 emails sent to list (1550 in previous quarter) > > - u...@drill.apache.org: >- 567 subscribers (down -14 in the last 3 months): >- 436 emails sent to list (824 in previous quarter) > > ## JIRA activity: > > - 173 JIRA tickets created in the last 3 months > - 88 JIRA tickets closed/resolved in the last 3 months > > --- Report End -
[jira] [Created] (DRILL-5011) External Sort Batch memory use depends on record width
Paul Rogers created DRILL-5011: -- Summary: External Sort Batch memory use depends on record width Key: DRILL-5011 URL: https://issues.apache.org/jira/browse/DRILL-5011 Project: Apache Drill Issue Type: Bug Affects Versions: 1.8.0 Reporter: Paul Rogers Priority: Minor The ExternalSortBatch operator uses spill-to-disk to keep memory needs within a defined limit. However, the "copier" (really, the merge operation) can use an amount of memory determined not by the operator configuration, but by the width of each record. The copier memory limit appears to be set by the COPIER_BATCH_MEM_LIMIT value. However, the actual memory use is determined by the number of records that the copier is asked to copy. That record comes from an estimate of row width based on the type of each column. Note that the row width *is not* based on the actual data in each row. Varchar fields, for example, are assumed to be 40 characters wide. If the sorter is asked to sort records with Varchar fields of, say, 1000 characters, then the row width estimate will be a poor estimator of actual width. Memory use is based on a {code} target record count = memory limit / estimate row width {code} Actual memory use is: {code} memory use = target row count * actual row width {code} Which is {code} memory use = memory limit * actual row width / estimate row width {code} That is, memory use depends on the ratio of actual to estimated width. If the estimate is off by 2, then we use twice as much memory as expected. Not that the memory used for the copier defaults to 20 MB, so even an error of 4x still means only 80 MB of memory used; small in comparison to the many GB typically allocated to ESB storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] drill issue #644: DRILL-4980: Upgrading of the approach of parquet date corr...
Github user vdiravka commented on the issue: https://github.com/apache/drill/pull/644 @paul-rogers Could you please review? I answered on the last comment and rebased the branch with resolving conflicts (ParquetTableMetadata_v3 was added) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (DRILL-5010) Equality join condition is treated as a MergeJoin and not as a HashJoin.
Khurram Faraaz created DRILL-5010: - Summary: Equality join condition is treated as a MergeJoin and not as a HashJoin. Key: DRILL-5010 URL: https://issues.apache.org/jira/browse/DRILL-5010 Project: Apache Drill Issue Type: Bug Components: Execution - Flow Affects Versions: 1.9.0 Reporter: Khurram Faraaz Equality join condition is treated as a MergeJoin and not as a HashJoin. Drill 1.9.0 git commit ID: 83513daf Projecting the join columns results in merge join, whereas it should be doing a HashJoin. {noformat} 0: jdbc:drill:schema=dfs.tmp> explain plan for . . . . . . . . . . . . . . > select t1.intKey, t2.intKey from `left.json` t1, `right_all_nulls.json` t2 WHERE t1.intKey = t2.intKey OR ( t1.intKey IS NULL AND t2.intKey IS NULL); +--+--+ | text | json | +--+--+ | 00-00Screen 00-01 Project(intKey=[$0], intKey0=[$1]) 00-02Project(intKey=[$0], intKey0=[$1]) 00-03 MergeJoin(condition=[IS NOT DISTINCT FROM($0, $1)], joinType=[inner]) 00-05SelectionVectorRemover 00-07 Sort(sort0=[$0], dir0=[ASC]) 00-09Scan(groupscan=[EasyGroupScan [selectionRoot=maprfs:/tmp/left.json, numFiles=1, columns=[`intKey`], files=[maprfs:///tmp/left.json]]]) 00-04Project(intKey0=[$0]) 00-06 SelectionVectorRemover 00-08Sort(sort0=[$0], dir0=[ASC]) 00-10 Scan(groupscan=[EasyGroupScan [selectionRoot=maprfs:/tmp/right_all_nulls.json, numFiles=1, columns=[`intKey`], files=[maprfs:///tmp/right_all_nulls.json]]]) {noformat} Note that HashAgg and HashJoin were enabled. {noformat} | planner.enable_hashagg | BOOLEAN | SYSTEM | DEFAULT | null| null| true | null | | planner.enable_hashjoin| BOOLEAN | SYSTEM | DEFAULT | null| null| true | null | {noformat} Doing a SELECT results in a HashJoin in the query plan {noformat} 0: jdbc:drill:schema=dfs.tmp> explain plan for . . . . . . . . . . . . . . > select * from `left.json` t1, `right_all_nulls.json` t2 WHERE t1.intKey = t2.intKey OR ( t1.intKey IS NULL AND t2.intKey IS NULL); +--+--+ | text | json | +--+--+ | 00-00Screen 00-01 ProjectAllowDup(*=[$0], *0=[$1]) 00-02Project(T46¦¦*=[$0], T47¦¦*=[$2]) 00-03 HashJoin(condition=[IS NOT DISTINCT FROM($1, $3)], joinType=[inner]) 00-04Project(T47¦¦*=[$0], intKey0=[$1]) 00-06 Project(T47¦¦*=[$0], intKey=[$1]) 00-08Scan(groupscan=[EasyGroupScan [selectionRoot=maprfs:/tmp/right_all_nulls.json, numFiles=1, columns=[`*`], files=[maprfs:///tmp/right_all_nulls.json]]]) 00-05Project(T46¦¦*=[$0], intKey=[$1]) 00-07 Scan(groupscan=[EasyGroupScan [selectionRoot=maprfs:/tmp/left.json, numFiles=1, columns=[`*`], files=[maprfs:///tmp/left.json]]]) {noformat} Data used in above queries {noformat} [root@centos-01 null_eq_joins]# cat left.json { "intKey" : 123, "bgintKey": 1234567, "strKey": "this is a test string", "boolKey": true, "fltKey": 123.786, "dblKey": 457.984, "timKey": "18:30:45", "dtKey": "1997-10-21", "tmstmpKey": "2007-04-30 13:10:02.047", "intrvldyKey": "P9DT38833S", "intrvlyrKey": "P255M" } [root@centos-01 null_equality_joins]# [root@centos-01 null_eq_joins]# cat right_all_nulls.json { "intKey" : null, "bgintKey": null, "strKey": null, "boolKey": null, "fltKey": null, "dblKey": null, "timKey": null, "dtKey": null, "tmstmpKey": null, "intrvldyKey": null, "intrvlyrKey": null } [root@centos-01 null_eq_joins]# {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Query JSON that has null as value for each key
Hi All, Drill 1.9.0 git commit ID : 83513daf Drill returns same result with or without `store.json.all_text_mode`=true [root@cent01 null_eq_joins]# cat right_all_nulls.json { "intKey" : null, "bgintKey": null, "strKey": null, "boolKey": null, "fltKey": null, "dblKey": null, "timKey": null, "dtKey": null, "tmstmpKey": null, "intrvldyKey": null, "intrvlyrKey": null } [root@cent01 null_eq_joins]# Querying the above JSON file results in null as query result. - We should see each of the keys in the JSON as a column in query result. - And in each column the value should be a null value. Current behavior does not look right. {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `right_all_nulls.json`; +---+ | * | +---+ | null | +---+ 1 row selected (0.313 seconds) {noformat} Thanks, Khurram