[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions

2014-10-14 Thread Remi Catherinot (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170728#comment-14170728
 ] 

Remi Catherinot commented on PIG-3259:
--

Make SanityChecker thread safe. Current implementation is statefull (because of 
the numDots field) and not used within a synchronized block so it's not 
thread-safe. Make sanityCheckIntegerLongDecimal so it returns a byte, 0 would 
mean long/integer/byte/short, 1 would mean double, 2 would mean NaN. Doing so 
would make it thread safe and won't slow down implementation.

another little speed up is : when doing if (str.charAt(i)='0'  
str.charAt(i)='9'   charAt(i) ... charAt(i) )
This can be replaced by declaring a char before the test, and use it in the 
test :
char c;
if ( (c=str.charAt(i))='0'  c='9'  ... c  c )
because this code only calls charAt once

Also beware, it seems to me that you change the contract of the method. The 
current one tries its best to find a Long, if it fails then it fully relies on 
the JVM parsing (and so on the full specs) which cause the performance 
degradation in case of a bad format (mostly because of the exception). In the 
optimized one, if the check fails, null is returned. We can only do this if we 
are really fully confident on the fact the checker follow strictfully all the 
JVM number format specs (like for exemple octal long values, hexadecimal values 
which use 'p' rather than 'e' as their exponent operator, etc.).

Maybe a good way would be to take the code from the src.jar shipped with the 
JVM changing the throw NumberFormatException behavior with a return null + 
rounding in case of double2long implicit cast behavior, which is what you want 
to achieve. The JVM is slow in case of bad format because of the exception but 
is the fastest in case of good format. Just changing the behaviour.

 Optimize byte to Long/Integer conversions
 -

 Key: PIG-3259
 URL: https://issues.apache.org/jira/browse/PIG-3259
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11, 0.11.1
Reporter: Prashant Kommireddi
Assignee: Prashant Kommireddi
 Fix For: 0.15.0

 Attachments: byteToLong.xlsx


 These conversions can be performing better. If the input is not numeric 
 (1234abcd) the code calls Double.valueOf(String) regardless before finally 
 returning null. Any script that inadvertently (user's mistake or not) tries 
 to cast non-numeric column to int or long would result in many wasteful 
 calls. 
 We can avoid this and only handle the cases we find the input to be a decimal 
 number (1234.56) and return null otherwise even before trying 
 Double.valueOf(String).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions

2013-03-26 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614137#comment-13614137
 ] 

Thejas M Nair commented on PIG-3259:


bq.  How do we determine the number of non-numbers without making calls to 
sanityCheck..()?
By counting the number of times exception has so far been thrown by .valueOf(). 
Once a threshold has been crossed, we can introduce the sanity check for each 
new value. This will put a limit on worst ('incorrect') case performance 
without degrading the 'correct' case performance by much. 

I wonder if there are good libraries that we can use for the sanity checks, as 
the decimal check seems bit more complicated . 

 Optimize byte to Long/Integer conversions
 -

 Key: PIG-3259
 URL: https://issues.apache.org/jira/browse/PIG-3259
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1
Reporter: Prashant Kommireddi
Assignee: Prashant Kommireddi
 Fix For: 0.12

 Attachments: byteToLong.xlsx


 These conversions can be performing better. If the input is not numeric 
 (1234abcd) the code calls Double.valueOf(String) regardless before finally 
 returning null. Any script that inadvertently (user's mistake or not) tries 
 to cast non-numeric column to int or long would result in many wasteful 
 calls. 
 We can avoid this and only handle the cases we find the input to be a decimal 
 number (1234.56) and return null otherwise even before trying 
 Double.valueOf(String).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions

2013-03-26 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614940#comment-13614940
 ] 

Prashant Kommireddi commented on PIG-3259:
--

{quote} By counting the number of times exception has so far been thrown by 
.valueOf() {quote}
I see what you mean. That could be an approach, though the heuristic for 
determining the threshold could be tricky. 

{quote}I wonder if there are good libraries that we can use for the sanity 
checks, as the decimal check seems bit more complicated{quote}
I will try and look if any such libraries are available. There's a method to 
check for Double in the javadoc you pointed before, but it could be more 
expensive than we want 
http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#valueOf%28java.lang.String%29.
 

 Optimize byte to Long/Integer conversions
 -

 Key: PIG-3259
 URL: https://issues.apache.org/jira/browse/PIG-3259
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1
Reporter: Prashant Kommireddi
Assignee: Prashant Kommireddi
 Fix For: 0.12

 Attachments: byteToLong.xlsx


 These conversions can be performing better. If the input is not numeric 
 (1234abcd) the code calls Double.valueOf(String) regardless before finally 
 returning null. Any script that inadvertently (user's mistake or not) tries 
 to cast non-numeric column to int or long would result in many wasteful 
 calls. 
 We can avoid this and only handle the cases we find the input to be a decimal 
 number (1234.56) and return null otherwise even before trying 
 Double.valueOf(String).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions

2013-03-25 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613354#comment-13613354
 ] 

Thejas M Nair commented on PIG-3259:


Sounds like a good idea. 
The check you have here does not accept all valid double string representations 
(See 
http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#valueOf(java.lang.String)
 ) . (eg with exponent, or hexadecimal representation starting with 0x).

But if we can avoid the performance degradation for the 'correct' [1] case 
(which seems to be be in range of 2-8% in the micro benchmark that ran for at 
least few seconds), that would be better. One way to avoid performance 
degradation for 'correct' case would be to start by doing .valueOf() without 
checks, then use the number of non-numbers encountered to decide if want to be 
making the sanityCheckIntegerLongDecimal() calls.

[1]  - by correct I mean the case where the field declared an integer or a 
double has correct representation.

 Optimize byte to Long/Integer conversions
 -

 Key: PIG-3259
 URL: https://issues.apache.org/jira/browse/PIG-3259
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1
Reporter: Prashant Kommireddi
Assignee: Prashant Kommireddi
 Fix For: 0.12

 Attachments: byteToLong.xlsx


 These conversions can be performing better. If the input is not numeric 
 (1234abcd) the code calls Double.valueOf(String) regardless before finally 
 returning null. Any script that inadvertently (user's mistake or not) tries 
 to cast non-numeric column to int or long would result in many wasteful 
 calls. 
 We can avoid this and only handle the cases we find the input to be a decimal 
 number (1234.56) and return null otherwise even before trying 
 Double.valueOf(String).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3259) Optimize byte to Long/Integer conversions

2013-03-23 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13611905#comment-13611905
 ] 

Prashant Kommireddi commented on PIG-3259:
--

I have some results from preliminary testing. Basically, the idea is to include 
a check for whether the bytearray is numeric and make valueOf calls accordingly.

*Test code*
{code}
static void testBytesToLong() throws IOException {
String input = 114121.1321;
int n = 1;
long start = System.currentTimeMillis();
for (int i = 0; i  n; i++) {
Long l = bytesToLongOptimized(input);
}
System.out.println(Elapsed (optimized):  + 
(System.currentTimeMillis() - start));

start = System.currentTimeMillis();
for (int i = 0; i  n; i++) {
Long l =bytesToLong(input);
}
System.out.println(Elapsed (current):  + (System.currentTimeMillis() 
- start));

}
{code}

*Current implementation, logic the same minus logging*
{code}
static Long bytesToLong(String number) {
if (sanityCheckIntegerLong(number)) {
return Long.valueOf(number);
}

try {
return Long.valueOf(Double.valueOf(number).longValue());
} catch (NumberFormatException e) {
return null;
}
}
{code}

*Optimized code*
{code}
static Long bytesToLongOptimized(String number) {
if (SanityChecker.sanityCheckIntegerLongDecimal(number)) {
if(!SanityChecker.isDecimal()) {
return Long.valueOf(number);
} 
return Long.valueOf(Double.valueOf(number).longValue());
}
   
return null;
}

private static class SanityChecker {
// This is a counter on number of dots (period) in the string 
static int numDots = 0;

private static boolean sanityCheckIntegerLongDecimal(String number) {
// Reset counter on each call
reset();
for (int i=0; i  number.length(); i++){
if (number.charAt(i) = '0'  number.charAt(i) ='9' || i == 0 
 number.charAt(i) == '-'
|| (number.charAt(i) == '.'  ++numDots  2)){
// valid one
}
else{
// contains invalid characters, must not be a integer or 
long or decimal.
return false;
}
}
return true;
}

private static void reset() {
numDots = 0;
}

private static boolean isDecimal() {
return numDots == 1;
}
}
{code}

There is not much difference in runtime between current and optimized versions 
with respect to valid Long numbers, however the delta is significant in case of 
invalid Longs (for eg 123foo, 10.2.3.10). I will attach my findings soon.


 Optimize byte to Long/Integer conversions
 -

 Key: PIG-3259
 URL: https://issues.apache.org/jira/browse/PIG-3259
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.11.1
Reporter: Prashant Kommireddi
 Fix For: 0.12


 These conversions can be performing better. If the input is not numeric 
 (1234abcd) the code calls Double.valueOf(String) regardless before finally 
 returning null. Any script that inadvertently (user's mistake or not) tries 
 to cast alpha-numeric column to int or long would result in many wasteful 
 calls. 
 We can avoid this and only handle the cases we find the input to be a decimal 
 number (1234.56) and return null otherwise even before trying 
 Double.valueOf(String).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira