[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...

2016-06-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13531


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...

2016-06-09 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/13531#discussion_r66539899
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala
 ---
@@ -340,6 +340,40 @@ class FileSourceStrategySuite extends QueryTest with 
SharedSQLContext with Predi
 }
   }
 
+  test("SPARK-15654 do not split non-splittable files") {
--- End diff --

Updated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...

2016-06-09 Thread clockfly
Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/13531#discussion_r66538048
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala
 ---
@@ -340,6 +340,40 @@ class FileSourceStrategySuite extends QueryTest with 
SharedSQLContext with Predi
 }
   }
 
+  test("SPARK-15654 do not split non-splittable files") {
--- End diff --

Should we also test the bin-packing the gzipped file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...

2016-06-09 Thread clockfly
Github user clockfly commented on a diff in the pull request:

https://github.com/apache/spark/pull/13531#discussion_r66533473
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -298,6 +309,28 @@ trait FileFormat {
 }
 
 /**
+ * The base class file format that is based on text file.
+ */
+abstract class TextBasedFileFormat extends FileFormat {
+  private var codecFactory: CompressionCodecFactory = null
+  override def isSplitable(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  path: Path): Boolean = {
+if (codecFactory == null) {
+  synchronized {
--- End diff --

I am not sure we need "synchronized" here or not, do we want to ensure 
FileFormat is thread safe?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...

2016-06-09 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/13531#discussion_r66478617
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -298,6 +309,28 @@ trait FileFormat {
 }
 
 /**
+ * The base class file format that is based on text file.
+ */
+abstract class TextBasedFileFormat extends FileFormat {
+  private var codecFactory: CompressionCodecFactory = null
+  override def isSplitable(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  path: Path): Boolean = {
+if (codecFactory == null) {
+  synchronized {
+if (codecFactory == null) {
+  codecFactory = new CompressionCodecFactory(
--- End diff --

There could be "io.compression.codecs" in options, it's used by getCodec().

Since all other APIs have that, it's better to have that here too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...

2016-06-09 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13531#discussion_r66451629
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -298,6 +309,28 @@ trait FileFormat {
 }
 
 /**
+ * The base class file format that is based on text file.
+ */
+abstract class TextBasedFileFormat extends FileFormat {
+  private var codecFactory: CompressionCodecFactory = null
+  override def isSplitable(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  path: Path): Boolean = {
+if (codecFactory == null) {
+  synchronized {
+if (codecFactory == null) {
+  codecFactory = new CompressionCodecFactory(
--- End diff --

sorry. It seems we can use `sparkSession.sessionState.newHadoopConf()` 
instread of `sparkSession.sessionState.newHadoopConfWithOptions(options)` (I 
checked  the `FileSourceStrategySuite` test passed without passing `options` in 
`CompressionCodecFactory`).
So, we need `options` in the arguments of `isSplitable`?

```
if (codecFactory == null) {
  codecFactory = new 
CompressionCodecFactory(sparkSession.sessionState.newHadoopConf())
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...

2016-06-08 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/13531#discussion_r66383788
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -298,6 +309,28 @@ trait FileFormat {
 }
 
 /**
+ * The base class file format that is based on text file.
+ */
+abstract class TextBasedFileFormat extends FileFormat {
+  private var codecFactory: CompressionCodecFactory = null
+  override def isSplitable(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  path: Path): Boolean = {
+if (codecFactory == null) {
+  synchronized {
+if (codecFactory == null) {
+  codecFactory = new CompressionCodecFactory(
--- End diff --

Sorry, I do not understand your question or suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...

2016-06-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13531#discussion_r66016111
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
 ---
@@ -298,6 +309,28 @@ trait FileFormat {
 }
 
 /**
+ * The base class file format that is based on text file.
+ */
+abstract class TextBasedFileFormat extends FileFormat {
+  private var codecFactory: CompressionCodecFactory = null
+  override def isSplitable(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  path: Path): Boolean = {
+if (codecFactory == null) {
+  synchronized {
+if (codecFactory == null) {
+  codecFactory = new CompressionCodecFactory(
--- End diff --

We need to pass `options` into `CompressionCodecFactory` here? It seems 
`getCodec` only checks suffixes in `input`.

https://github.com/apache/hadoop-common/blob/HADOOP-3628/src/core/org/apache/hadoop/io/compress/CompressionCodecFactory.java#L153


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13531: [SPARK-15654] [SQL] fix non-splitable files for t...

2016-06-06 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/13531

[SPARK-15654] [SQL] fix non-splitable files for text based file formats

## What changes were proposed in this pull request?

This PR is based on #13442 , fix the bug for non-splittable files for all 
text based file formats.

Closes #13442 

## How was this patch tested?

add regression tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark fix_split

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13531.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13531


commit 975a1423e6650dae8eb9bf5f22b986ade39702a4
Author: Davies Liu 
Date:   2016-06-06T19:13:07Z

fix non-splitable files for text based file formats

commit dad01936a4e891da8a3dfafd1a7e06d5952b52c2
Author: Davies Liu 
Date:   2016-06-06T20:14:50Z

add regression test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org