[
https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13611905#comment-13611905
]
Prashant Kommireddi commented on PIG-3259:
------------------------------------------
I have some results from preliminary testing. Basically, the idea is to include
a check for whether the bytearray is numeric and make valueOf calls accordingly.
*Test code*
{code}
static void testBytesToLong() throws IOException {
String input = "114121.1321";
int n = 100000000;
long start = System.currentTimeMillis();
for (int i = 0; i < n; i++) {
Long l = bytesToLongOptimized(input);
}
System.out.println("Elapsed (optimized): " +
(System.currentTimeMillis() - start));
start = System.currentTimeMillis();
for (int i = 0; i < n; i++) {
Long l =bytesToLong(input);
}
System.out.println("Elapsed (current): " + (System.currentTimeMillis()
- start));
}
{code}
*Current implementation, logic the same minus logging*
{code}
static Long bytesToLong(String number) {
if (sanityCheckIntegerLong(number)) {
return Long.valueOf(number);
}
try {
return Long.valueOf(Double.valueOf(number).longValue());
} catch (NumberFormatException e) {
return null;
}
}
{code}
*Optimized code*
{code}
static Long bytesToLongOptimized(String number) {
if (SanityChecker.sanityCheckIntegerLongDecimal(number)) {
if(!SanityChecker.isDecimal()) {
return Long.valueOf(number);
}
return Long.valueOf(Double.valueOf(number).longValue());
}
return null;
}
private static class SanityChecker {
// This is a counter on number of dots (period) in the string
static int numDots = 0;
private static boolean sanityCheckIntegerLongDecimal(String number) {
// Reset counter on each call
reset();
for (int i=0; i < number.length(); i++){
if (number.charAt(i) >= '0' && number.charAt(i) <='9' || i == 0
&& number.charAt(i) == '-'
|| (number.charAt(i) == '.' && ++numDots < 2)){
// valid one
}
else{
// contains invalid characters, must not be a integer or
long or decimal.
return false;
}
}
return true;
}
private static void reset() {
numDots = 0;
}
private static boolean isDecimal() {
return numDots == 1;
}
}
{code}
There is not much difference in runtime between current and optimized versions
with respect to valid Long numbers, however the delta is significant in case of
invalid Longs (for eg "123foo", "10.2.3.10"). I will attach my findings soon.
> Optimize byte to Long/Integer conversions
> -----------------------------------------
>
> Key: PIG-3259
> URL: https://issues.apache.org/jira/browse/PIG-3259
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.11, 0.11.1
> Reporter: Prashant Kommireddi
> Fix For: 0.12
>
>
> These conversions can be performing better. If the input is not numeric
> (1234abcd) the code calls Double.valueOf(String) regardless before finally
> returning null. Any script that inadvertently (user's mistake or not) tries
> to cast alpha-numeric column to int or long would result in many wasteful
> calls.
> We can avoid this and only handle the cases we find the input to be a decimal
> number (1234.56) and return null otherwise even before trying
> Double.valueOf(String).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira