[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-11-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18805


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-10-31 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r148184755
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. 
For more
+ * details see - http://facebook.github.io/zstd/
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  val bufferSize = 
conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
--- End diff --

done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-10-30 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r147629043
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. 
For more
+ * details see - http://facebook.github.io/zstd/
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  val bufferSize = 
conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
--- End diff --

This should be private. The intent was to lift both config values out of 
the method, so level can do here too.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-10-29 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r147618796
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. 
For more
+ * details see - http://facebook.github.io/zstd/
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+// Default compression level for zstd compression to 1 because it is
+// fastest of all with reasonably high compression ratio.
+val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", 
"1").toInt
--- End diff --

Good eye, fixed. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-10-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r147340655
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. 
For more
+ * details see - http://facebook.github.io/zstd/
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+// Default compression level for zstd compression to 1 because it is
+// fastest of all with reasonably high compression ratio.
+val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", 
"1").toInt
--- End diff --

Is this `getInt` instead of `getSizeAsBytes`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-10-26 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r147215825
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. 
For more
+ * details see - http://facebook.github.io/zstd/
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+// Default compression level for zstd compression to 1 because it is
+// fastest of all with reasonably high compression ratio.
+val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", 
"1").toInt
+val bufferSize = 
conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
--- End diff --

Sorry somehow missed these comments. Will address.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-10-26 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r147213697
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. 
For more
+ * details see - http://facebook.github.io/zstd/
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+// Default compression level for zstd compression to 1 because it is
+// fastest of all with reasonably high compression ratio.
+val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", 
"1").toInt
+val bufferSize = 
conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
--- End diff --

@sitalkedia how about comments like this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-09-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r139898945
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. 
For more
+ * details see - http://facebook.github.io/zstd/
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+// Default compression level for zstd compression to 1 because it is
+// fastest of all with reasonably high compression ratio.
+val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", 
"1").toInt
+val bufferSize = 
conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
--- End diff --

Agree, it's simpler and cleaner, as it avoids duplicating this property in 
this file


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-09-19 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r139635152
  
--- Diff: LICENSE ---
@@ -269,6 +269,7 @@ The text of each license is also included at 
licenses/LICENSE-[project].txt.
  (BSD 3 Clause) d3.min.js 
(https://github.com/mbostock/d3/blob/master/LICENSE)
  (BSD 3 Clause) DPark 
(https://github.com/douban/dpark/blob/master/LICENSE)
  (BSD 3 Clause) CloudPickle 
(https://github.com/cloudpipe/cloudpickle/blob/master/LICENSE)
+ (BSD 2 Clause) Zstd-jni 
(https://github.com/luben/zstd-jni/blob/master/LICENSE)
--- End diff --

ZStandard itself is also included, not just zstd-jni. We'd need an entry 
for it, and its LICENSE as below.
https://github.com/luben/zstd-jni/tree/master/src/main/native


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-09-19 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r139634333
  
--- Diff: dev/deps/spark-deps-hadoop-2.6 ---
@@ -186,3 +186,4 @@ xercesImpl-2.9.1.jar
 xmlenc-0.52.jar
 xz-1.0.jar
 zookeeper-3.4.6.jar
+zstd-jni-1.3.0-1.jar
--- End diff --

These need to be updated, it seems


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-02 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130892528
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,33 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]]. 
For more
+ * details see - http://facebook.github.io/zstd/
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStdCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+// Default compression level for zstd compression to 1 because it is
+// fastest of all with reasonably high compression ratio.
+val level = conf.getSizeAsBytes("spark.io.compression.zstd.level", 
"1").toInt
+val bufferSize = 
conf.getSizeAsBytes("spark.io.compression.zstd.bufferSize", "32k").toInt
--- End diff --

Would it be better to have this variable as a private variable to get this 
property only once?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-02 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130889998
  
--- Diff: docs/configuration.md ---
@@ -886,6 +887,23 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.io.compression.zstd.level
+  1
+  
+Compression leve for Zstd compression codec. Increasing the 
compression level will result in better
--- End diff --

nit: leve -> level


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-02 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130889423
  
--- Diff: docs/configuration.md ---
@@ -866,7 +866,8 @@ Apart from these, the following properties are also 
available, and may be useful
 e.g.
 org.apache.spark.io.LZ4CompressionCodec,
 org.apache.spark.io.LZFCompressionCodec,
-and org.apache.spark.io.SnappyCompressionCodec.
+org.apache.spark.io.SnappyCompressionCodec.
--- End diff --

nit: '.' -> ','


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130787262
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -50,13 +51,14 @@ private[spark] object CompressionCodec {
 
   private[spark] def supportsConcatenationOfSerializedStreams(codec: 
CompressionCodec): Boolean = {
 (codec.isInstanceOf[SnappyCompressionCodec] || 
codec.isInstanceOf[LZFCompressionCodec]
-  || codec.isInstanceOf[LZ4CompressionCodec])
+  || codec.isInstanceOf[LZ4CompressionCodec] || 
codec.isInstanceOf[ZStandardCompressionCodec])
   }
 
   private val shortCompressionCodecNames = Map(
 "lz4" -> classOf[LZ4CompressionCodec].getName,
 "lzf" -> classOf[LZFCompressionCodec].getName,
-"snappy" -> classOf[SnappyCompressionCodec].getName)
+"snappy" -> classOf[SnappyCompressionCodec].getName,
+"zstd" -> classOf[SnappyCompressionCodec].getName)
--- End diff --

Ah, my bad. Fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130787287
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+val level = 
conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130787269
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
--- End diff --

done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130787205
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+val level = 
conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt
+val compressionBuffer = 
conf.getSizeAsBytes("spark.io.compression.lz4.blockSize", "32k").toInt
--- End diff --

You are right, we should not share the config with lz4, created a new one.
Lets keep the default to 32kb which is aligned with the block size used by 
other compressions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130769482
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -50,13 +51,14 @@ private[spark] object CompressionCodec {
 
   private[spark] def supportsConcatenationOfSerializedStreams(codec: 
CompressionCodec): Boolean = {
 (codec.isInstanceOf[SnappyCompressionCodec] || 
codec.isInstanceOf[LZFCompressionCodec]
-  || codec.isInstanceOf[LZ4CompressionCodec])
+  || codec.isInstanceOf[LZ4CompressionCodec] || 
codec.isInstanceOf[ZStandardCompressionCodec])
   }
 
   private val shortCompressionCodecNames = Map(
 "lz4" -> classOf[LZ4CompressionCodec].getName,
 "lzf" -> classOf[LZFCompressionCodec].getName,
-"snappy" -> classOf[SnappyCompressionCodec].getName)
+"snappy" -> classOf[SnappyCompressionCodec].getName,
+"zstd" -> classOf[SnappyCompressionCodec].getName)
--- End diff --

you mean `ZStandardCompressionCodec` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130769858
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+val level = 
conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt
+val compressionBuffer = 
conf.getSizeAsBytes("spark.io.compression.lz4.blockSize", "32k").toInt
--- End diff --

- wondering if we should share this config value OR have a new one.
- do you want to set the default to something higher like 1mb or 4mb ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130769646
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
+ *
+ * @note The wire protocol for this codec is not guaranteed to be 
compatible across versions
+ * of Spark. This is intended for use as an internal compression utility 
within a single Spark
+ * application.
+ */
+@DeveloperApi
+class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {
+
+  override def compressedOutputStream(s: OutputStream): OutputStream = {
+val level = 
conf.getSizeAsBytes("spark.io.compression.zstandard.level", "1").toInt
--- End diff --

please add a comment explaining the reason why we chose level 1 over other 
levels


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18805#discussion_r130769548
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -216,3 +218,30 @@ private final class SnappyOutputStreamWrapper(os: 
SnappyOutputStream) extends Ou
 }
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ * ZStandard implementation of [[org.apache.spark.io.CompressionCodec]].
--- End diff --

would be good to add this link to the spec : http://facebook.github.io/zstd/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-08-01 Thread sitalkedia
GitHub user sitalkedia opened a pull request:

https://github.com/apache/spark/pull/18805

[SPARK-19112][CORE] Support for ZStandard codec

## What changes were proposed in this pull request?

Using zstd compression for Spark jobs spilling 100s of TBs of data, we 
could reduce the amount of data written to disk by as much as 50%. This 
translates to significant latency gain because of reduced disk io operations. 
There is a degradation CPU time by 2 - 5% because of zstd compression overhead, 
but for jobs which are bottlenecked by disk IO, this hit can be taken. 

## How was this patch tested?

Tested by running few jobs spilling large amount of data on the cluster and 
amount of intermediate data written to disk reduced by as much as 50%.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sitalkedia/spark skedia/upstream_zstd

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18805.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18805


commit cff558b6873a8ec184159a9df3c1e83c9cd0a6e7
Author: Sital Kedia 
Date:   2017-08-02T00:41:27Z

[SPARK-19112][CORE] Support for ZStandard codec




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org