[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18654 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r128134922 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala --- @@ -236,7 +236,10 @@ object FileFormatWriter extends Logging { committer.setupTask(taskAttemptContext) val writeTask = - if (description.partitionColumns.isEmpty && description.bucketIdExpression.isEmpty) { + if (sparkPartitionId != 0 && !iterator.hasNext) { --- End diff -- cc @yhuai too who reviewed my similar PR before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127888746 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempPath { dir => --- End diff -- More clear :) No need to create source files in real. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127876852 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempPath { dir => --- End diff -- Could we maybe just do as below? ```scala withTempPath { path => spark.range(100).repartition(10).where("id = 50").write.parquet(path) val partFiles = path.listFiles() .filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")) assert(partFiles.length === 2) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127872988 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { +override def accept(dir: File, name: String): Boolean = { + !name.startsWith(".") && !name.startsWith("_") +} + }) + assert(allFiles.length == 10) + + withTempDir { dst_dir => +dst_dir.delete() +df.where("id = 50").write.parquet(dst_dir.toString) +val allFiles = dst_dir.listFiles(new FilenameFilter { + override def accept(dir: File, name: String): Boolean = { +!name.startsWith(".") && !name.startsWith("_") + } +}) +// First partition file and the data file --- End diff -- Can't agree more, firstly I try to implement like this but the `FileFormatWriter.write` can only see the iterator of each task self. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127869754 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { +override def accept(dir: File, name: String): Boolean = { + !name.startsWith(".") && !name.startsWith("_") +} + }) + assert(allFiles.length == 10) + + withTempDir { dst_dir => +dst_dir.delete() +df.where("id = 50").write.parquet(dst_dir.toString) --- End diff -- I mean.. for example, if we happen to have a single partition in the `df` in any event, I guess this test will become invalid ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127868378 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { +override def accept(dir: File, name: String): Boolean = { + !name.startsWith(".") && !name.startsWith("_") +} + }) + assert(allFiles.length == 10) + + withTempDir { dst_dir => +dst_dir.delete() +df.where("id = 50").write.parquet(dst_dir.toString) --- End diff -- I was thinking just in order to make sure the (previous) number of files written out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127867549 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { +override def accept(dir: File, name: String): Boolean = { + !name.startsWith(".") && !name.startsWith("_") +} + }) + assert(allFiles.length == 10) + + withTempDir { dst_dir => +dst_dir.delete() +df.where("id = 50").write.parquet(dst_dir.toString) +val allFiles = dst_dir.listFiles(new FilenameFilter { + override def accept(dir: File, name: String): Boolean = { +!name.startsWith(".") && !name.startsWith("_") + } +}) +// First partition file and the data file --- End diff -- Ideally we only need the first partition file if all other partitions are empty, but this is hard to do right now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127867486 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { +override def accept(dir: File, name: String): Boolean = { + !name.startsWith(".") && !name.startsWith("_") +} + }) + assert(allFiles.length == 10) + + withTempDir { dst_dir => +dst_dir.delete() +df.where("id = 50").write.parquet(dst_dir.toString) --- End diff -- why we need repartition? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127867380 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { --- End diff -- +1 for the shorter one --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127867341 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => --- End diff -- +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127867290 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala --- @@ -236,7 +236,10 @@ object FileFormatWriter extends Logging { committer.setupTask(taskAttemptContext) val writeTask = - if (description.partitionColumns.isEmpty && description.bucketIdExpression.isEmpty) { + if (sparkPartitionId != 0 && !iterator.hasNext) { --- End diff -- cc @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127867254 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala --- @@ -236,7 +236,10 @@ object FileFormatWriter extends Logging { committer.setupTask(taskAttemptContext) val writeTask = - if (description.partitionColumns.isEmpty && description.bucketIdExpression.isEmpty) { + if (sparkPartitionId != 0 && !iterator.hasNext) { --- End diff -- This is a little hacky but is the simplest fix I think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127865091 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { +override def accept(dir: File, name: String): Boolean = { + !name.startsWith(".") && !name.startsWith("_") +} + }) + assert(allFiles.length == 10) --- End diff -- OK, I'll remove this assert and leave a note. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127860788 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { +override def accept(dir: File, name: String): Boolean = { + !name.startsWith(".") && !name.startsWith("_") +} + }) + assert(allFiles.length == 10) --- End diff -- but I guess this one (the latter) does not test this change? If this test passes regardless of this PR change, I would rather remove this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127860327 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { --- End diff -- Yea. If both are okay, let's go for the shorter one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127856720 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { +override def accept(dir: File, name: String): Boolean = { + !name.startsWith(".") && !name.startsWith("_") +} + }) + assert(allFiles.length == 10) --- End diff -- Just make sure the source dir have many files, and the output dir only have 2 files. If this make people confuse, just leave a notes and delete the assert? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127856498 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { --- End diff -- Both is ok I think, just copy this from `HadoopFsRelationSuite`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127724518 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { +override def accept(dir: File, name: String): Boolean = { + !name.startsWith(".") && !name.startsWith("_") +} + }) + assert(allFiles.length == 10) --- End diff -- Could I ask what this test targets? I think I am lost around here ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127722941 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => --- End diff -- `withTempPath` can be used instead I believe. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127724839 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { +override def accept(dir: File, name: String): Boolean = { + !name.startsWith(".") && !name.startsWith("_") +} + }) + assert(allFiles.length == 10) + + withTempDir { dst_dir => +dst_dir.delete() +df.where("id = 50").write.parquet(dst_dir.toString) --- End diff -- I would explicitly repartition here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127723973 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{File, FilenameFilter} + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.test.SharedSQLContext + +class FileFormatWriterSuite extends QueryTest with SharedSQLContext { + + test("empty file should be skipped while write to file") { +withTempDir { dir => + dir.delete() + spark.range(1).repartition(10).write.parquet(dir.toString) + val df = spark.read.parquet(dir.toString) + val allFiles = dir.listFiles(new FilenameFilter { --- End diff -- Can we just do this simpler? for example, ``` .listFiles().filter { f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_") } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18654#discussion_r127722734 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala --- @@ -236,7 +236,10 @@ object FileFormatWriter extends Logging { committer.setupTask(taskAttemptContext) val writeTask = - if (description.partitionColumns.isEmpty && description.bucketIdExpression.isEmpty) { + if (sparkPartitionId != 0 && !iterator.hasNext) { --- End diff -- I guess this might be okay in that sense we are guaranteed partitions to be started from 0 up to my knowledge. Could you take a look and see if it makes sense to you cc @cloud-fan if you don't mind? I am not confident enough to proceed reviewing and leave a sign-off. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...
GitHub user xuanyuanking opened a pull request: https://github.com/apache/spark/pull/18654 [SPARK-21435][SQL] Empty files should be skipped while write to file ## What changes were proposed in this pull request? Add EmptyDirectoryWriteTask for empty task while writing files. ## How was this patch tested? Add new test in `FileFormatWriterSuite ` You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuanyuanking/spark SPARK-21435 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18654.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18654 commit ff92ba3ae3abaae0d8ce8bb8c08465f43c14906f Author: xuanyuanking Date: 2017-07-17T07:38:56Z empty files should be skipped while write to file commit e08fb1939d5267a38f3318af7506b6ed8628ebbf Author: xuanyuanking Date: 2017-07-17T10:20:34Z Handle the empty result of parquet commit 6153001bc42deee197030ad91fbb4f72bd1aa5d3 Author: xuanyuanking Date: 2017-07-17T12:09:08Z Modify UT and add more notes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org