[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions
[ https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170728#comment-14170728 ] Remi Catherinot commented on PIG-3259: -- Make SanityChecker thread safe. Current implementation is statefull (because of the numDots field) and not used within a synchronized block so it's not thread-safe. Make sanityCheckIntegerLongDecimal so it returns a byte, 0 would mean long/integer/byte/short, 1 would mean double, 2 would mean NaN. Doing so would make it thread safe and won't slow down implementation. another little speed up is : when doing if (str.charAt(i)='0' str.charAt(i)='9' charAt(i) ... charAt(i) ) This can be replaced by declaring a char before the test, and use it in the test : char c; if ( (c=str.charAt(i))='0' c='9' ... c c ) because this code only calls charAt once Also beware, it seems to me that you change the contract of the method. The current one tries its best to find a Long, if it fails then it fully relies on the JVM parsing (and so on the full specs) which cause the performance degradation in case of a bad format (mostly because of the exception). In the optimized one, if the check fails, null is returned. We can only do this if we are really fully confident on the fact the checker follow strictfully all the JVM number format specs (like for exemple octal long values, hexadecimal values which use 'p' rather than 'e' as their exponent operator, etc.). Maybe a good way would be to take the code from the src.jar shipped with the JVM changing the throw NumberFormatException behavior with a return null + rounding in case of double2long implicit cast behavior, which is what you want to achieve. The JVM is slow in case of bad format because of the exception but is the fastest in case of good format. Just changing the behaviour. Optimize byte to Long/Integer conversions - Key: PIG-3259 URL: https://issues.apache.org/jira/browse/PIG-3259 Project: Pig Issue Type: Improvement Affects Versions: 0.11, 0.11.1 Reporter: Prashant Kommireddi Assignee: Prashant Kommireddi Fix For: 0.15.0 Attachments: byteToLong.xlsx These conversions can be performing better. If the input is not numeric (1234abcd) the code calls Double.valueOf(String) regardless before finally returning null. Any script that inadvertently (user's mistake or not) tries to cast non-numeric column to int or long would result in many wasteful calls. We can avoid this and only handle the cases we find the input to be a decimal number (1234.56) and return null otherwise even before trying Double.valueOf(String). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions
[ https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614137#comment-13614137 ] Thejas M Nair commented on PIG-3259: bq. How do we determine the number of non-numbers without making calls to sanityCheck..()? By counting the number of times exception has so far been thrown by .valueOf(). Once a threshold has been crossed, we can introduce the sanity check for each new value. This will put a limit on worst ('incorrect') case performance without degrading the 'correct' case performance by much. I wonder if there are good libraries that we can use for the sanity checks, as the decimal check seems bit more complicated . Optimize byte to Long/Integer conversions - Key: PIG-3259 URL: https://issues.apache.org/jira/browse/PIG-3259 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1 Reporter: Prashant Kommireddi Assignee: Prashant Kommireddi Fix For: 0.12 Attachments: byteToLong.xlsx These conversions can be performing better. If the input is not numeric (1234abcd) the code calls Double.valueOf(String) regardless before finally returning null. Any script that inadvertently (user's mistake or not) tries to cast non-numeric column to int or long would result in many wasteful calls. We can avoid this and only handle the cases we find the input to be a decimal number (1234.56) and return null otherwise even before trying Double.valueOf(String). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions
[ https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614940#comment-13614940 ] Prashant Kommireddi commented on PIG-3259: -- {quote} By counting the number of times exception has so far been thrown by .valueOf() {quote} I see what you mean. That could be an approach, though the heuristic for determining the threshold could be tricky. {quote}I wonder if there are good libraries that we can use for the sanity checks, as the decimal check seems bit more complicated{quote} I will try and look if any such libraries are available. There's a method to check for Double in the javadoc you pointed before, but it could be more expensive than we want http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#valueOf%28java.lang.String%29. Optimize byte to Long/Integer conversions - Key: PIG-3259 URL: https://issues.apache.org/jira/browse/PIG-3259 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1 Reporter: Prashant Kommireddi Assignee: Prashant Kommireddi Fix For: 0.12 Attachments: byteToLong.xlsx These conversions can be performing better. If the input is not numeric (1234abcd) the code calls Double.valueOf(String) regardless before finally returning null. Any script that inadvertently (user's mistake or not) tries to cast non-numeric column to int or long would result in many wasteful calls. We can avoid this and only handle the cases we find the input to be a decimal number (1234.56) and return null otherwise even before trying Double.valueOf(String). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions
[ https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613354#comment-13613354 ] Thejas M Nair commented on PIG-3259: Sounds like a good idea. The check you have here does not accept all valid double string representations (See http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#valueOf(java.lang.String) ) . (eg with exponent, or hexadecimal representation starting with 0x). But if we can avoid the performance degradation for the 'correct' [1] case (which seems to be be in range of 2-8% in the micro benchmark that ran for at least few seconds), that would be better. One way to avoid performance degradation for 'correct' case would be to start by doing .valueOf() without checks, then use the number of non-numbers encountered to decide if want to be making the sanityCheckIntegerLongDecimal() calls. [1] - by correct I mean the case where the field declared an integer or a double has correct representation. Optimize byte to Long/Integer conversions - Key: PIG-3259 URL: https://issues.apache.org/jira/browse/PIG-3259 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1 Reporter: Prashant Kommireddi Assignee: Prashant Kommireddi Fix For: 0.12 Attachments: byteToLong.xlsx These conversions can be performing better. If the input is not numeric (1234abcd) the code calls Double.valueOf(String) regardless before finally returning null. Any script that inadvertently (user's mistake or not) tries to cast non-numeric column to int or long would result in many wasteful calls. We can avoid this and only handle the cases we find the input to be a decimal number (1234.56) and return null otherwise even before trying Double.valueOf(String). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions
[ https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611905#comment-13611905 ] Prashant Kommireddi commented on PIG-3259: -- I have some results from preliminary testing. Basically, the idea is to include a check for whether the bytearray is numeric and make valueOf calls accordingly. *Test code* {code} static void testBytesToLong() throws IOException { String input = 114121.1321; int n = 1; long start = System.currentTimeMillis(); for (int i = 0; i n; i++) { Long l = bytesToLongOptimized(input); } System.out.println(Elapsed (optimized): + (System.currentTimeMillis() - start)); start = System.currentTimeMillis(); for (int i = 0; i n; i++) { Long l =bytesToLong(input); } System.out.println(Elapsed (current): + (System.currentTimeMillis() - start)); } {code} *Current implementation, logic the same minus logging* {code} static Long bytesToLong(String number) { if (sanityCheckIntegerLong(number)) { return Long.valueOf(number); } try { return Long.valueOf(Double.valueOf(number).longValue()); } catch (NumberFormatException e) { return null; } } {code} *Optimized code* {code} static Long bytesToLongOptimized(String number) { if (SanityChecker.sanityCheckIntegerLongDecimal(number)) { if(!SanityChecker.isDecimal()) { return Long.valueOf(number); } return Long.valueOf(Double.valueOf(number).longValue()); } return null; } private static class SanityChecker { // This is a counter on number of dots (period) in the string static int numDots = 0; private static boolean sanityCheckIntegerLongDecimal(String number) { // Reset counter on each call reset(); for (int i=0; i number.length(); i++){ if (number.charAt(i) = '0' number.charAt(i) ='9' || i == 0 number.charAt(i) == '-' || (number.charAt(i) == '.' ++numDots 2)){ // valid one } else{ // contains invalid characters, must not be a integer or long or decimal. return false; } } return true; } private static void reset() { numDots = 0; } private static boolean isDecimal() { return numDots == 1; } } {code} There is not much difference in runtime between current and optimized versions with respect to valid Long numbers, however the delta is significant in case of invalid Longs (for eg 123foo, 10.2.3.10). I will attach my findings soon. Optimize byte to Long/Integer conversions - Key: PIG-3259 URL: https://issues.apache.org/jira/browse/PIG-3259 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.11.1 Reporter: Prashant Kommireddi Fix For: 0.12 These conversions can be performing better. If the input is not numeric (1234abcd) the code calls Double.valueOf(String) regardless before finally returning null. Any script that inadvertently (user's mistake or not) tries to cast alpha-numeric column to int or long would result in many wasteful calls. We can avoid this and only handle the cases we find the input to be a decimal number (1234.56) and return null otherwise even before trying Double.valueOf(String). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira