Gopal V created HIVE-3992: ----------------------------- Summary: Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks Key: HIVE-3992 URL: https://issues.apache.org/jira/browse/HIVE-3992 Project: Hive Issue Type: Bug Environment: Ubuntu x86_64/java-1.6/hadoop-2.0.3 Reporter: Gopal V
The following function does some bad I/O {code} public synchronized void sync(long position) throws IOException { ... try { seek(position + 4); // skip escape in.readFully(syncCheck); int syncLen = sync.length; for (int i = 0; in.getPos() < end; i++) { int j = 0; for (; j < syncLen; j++) { if (sync[j] != syncCheck[(i + j) % syncLen]) { break; } } if (j == syncLen) { in.seek(in.getPos() - SYNC_SIZE); // position before // sync return; } syncCheck[i % syncLen] = in.readByte(); } } ... } {code} This causes a rather large number of readByte() calls which are passed onto a ByteBuffer via a single byte array. This results in rather a large amount of CPU being burnt in a the linear search for the sync pattern in the input RCFile (upto 92% for a skewed example - a trivial map-join + limit 100). This behaviour should be avoided at best or at least replaced by a rolling hash for efficient comparison, since it has a known byte-width of 16 bytes. Attached the stack trace from a Yourkit profile. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira