[ https://issues.apache.org/jira/browse/SPARK-23347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354907#comment-16354907 ]
Ted Yu commented on SPARK-23347: -------------------------------- See JDK 1.8 code: {code} class DeflaterOutputStream { public void write(int b) throws IOException { byte[] buf = new byte[1]; buf[0] = (byte)(b & 0xff); write(buf, 0, 1); } public void write(byte[] b, int off, int len) throws IOException { if (def.finished()) { throw new IOException("write beyond end of stream"); } if ((off | len | (off + len) | (b.length - (off + len))) < 0) { throw new IndexOutOfBoundsException(); } else if (len == 0) { return; } if (!def.finished()) { def.setInput(b, off, len); while (!def.needsInput()) { deflate(); } } } } class GZIPOutputStream extends DeflaterOutputStream { public synchronized void write(byte[] buf, int off, int len) throws IOException { super.write(buf, off, len); crc.update(buf, off, len); } } class Deflater { private native int deflateBytes(long addr, byte[] b, int off, int len, int flush); } class CRC32 { public void update(byte[] b, int off, int len) { if (b == null) { throw new NullPointerException(); } if (off < 0 || len < 0 || off > b.length - len) { throw new ArrayIndexOutOfBoundsException(); } crc = updateBytes(crc, b, off, len); } private native static int updateBytes(int crc, byte[] b, int off, int len); } {code} For each data byte, the code above has to allocate 1 single byte array, acquire several locks, call two native JNI methods (Deflater.deflateBytes and CRC32.updateBytes). > Introduce buffer between Java data stream and gzip stream > --------------------------------------------------------- > > Key: SPARK-23347 > URL: https://issues.apache.org/jira/browse/SPARK-23347 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.2.0 > Reporter: Ted Yu > Priority: Minor > > Currently GZIPOutputStream is used directly around ByteArrayOutputStream > e.g. from KVStoreSerializer : > {code} > ByteArrayOutputStream bytes = new ByteArrayOutputStream(); > GZIPOutputStream out = new GZIPOutputStream(bytes); > {code} > This seems inefficient. > GZIPOutputStream does not implement the write(byte) method. It only provides > a write(byte[], offset, len) method, which calls the corresponding JNI zlib > function. > BufferedOutputStream can be introduced wrapping GZIPOutputStream for better > performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org