[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19868 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r182077103 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala --- @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl } } + test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") { +withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) { --- End diff -- nit: we usually write `withSQLConf(SQLConf.HIVE_VERIFY_PARTITION_PATH.key -> "false")` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r182075854 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala --- @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl } } + test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") { +withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) { + sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true") + val testData = sparkContext.parallelize( +(1 to 10).map(i => TestData(i, i.toString))).toDF() + testData.createOrReplaceTempView("testData") + + val tmpDir = Files.createTempDir() + // create the table for test + sql(s"CREATE TABLE table_with_partition(key int,value string) " + + s"PARTITIONED by (ds string) location '${tmpDir.toURI}' ") + sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='1') " + + "SELECT key,value FROM testData") + sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='2') " + + "SELECT key,value FROM testData") + sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='3') " + + "SELECT key,value FROM testData") + sql("INSERT OVERWRITE TABLE table_with_partition partition (ds='4') " + + "SELECT key,value FROM testData") + + // test for the exist path + checkAnswer(sql("select key,value from table_with_partition"), +testData.toDF.collect ++ testData.toDF.collect --- End diff -- `testData.union(testData).union(testData).union(testData)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r182075379 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala --- @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl } } + test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") { +withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) { + sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true") + val testData = sparkContext.parallelize( +(1 to 10).map(i => TestData(i, i.toString))).toDF() + testData.createOrReplaceTempView("testData") + + val tmpDir = Files.createTempDir() + // create the table for test + sql(s"CREATE TABLE table_with_partition(key int,value string) " + --- End diff -- we should call `withTable` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r182075222 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala --- @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl } } + test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") { +withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) { + sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true") + val testData = sparkContext.parallelize( +(1 to 10).map(i => TestData(i, i.toString))).toDF() + testData.createOrReplaceTempView("testData") + + val tmpDir = Files.createTempDir() --- End diff -- we should call `withTempDir` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r182075129 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala --- @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl } } + test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") { +withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "false")) { + sparkContext.conf.set(IGNORE_MISSING_FILES.key, "true") + val testData = sparkContext.parallelize( +(1 to 10).map(i => TestData(i, i.toString))).toDF() + testData.createOrReplaceTempView("testData") --- End diff -- we should call `withTempView` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181939704 --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala --- @@ -195,6 +205,10 @@ class NewHadoopRDD[K, V]( e) finished = true null + case e: FileNotFoundException if ignoreMissingFiles => --- End diff -- Yes, I should change this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181939661 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -276,6 +292,12 @@ class HadoopRDD[K, V]( try { finished = !reader.next(key, value) } catch { + case e: FileNotFoundException if ignoreMissingFiles => +logWarning(s"Skipped missing file: ${split.inputSplit}", e) +finished = true +null --- End diff -- I removed this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181788258 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -276,6 +292,12 @@ class HadoopRDD[K, V]( try { finished = !reader.next(key, value) } catch { + case e: FileNotFoundException if ignoreMissingFiles => +logWarning(s"Skipped missing file: ${split.inputSplit}", e) +finished = true +null --- End diff -- the return value is not read by anyone --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181788194 --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala --- @@ -195,6 +205,10 @@ class NewHadoopRDD[K, V]( e) finished = true null + case e: FileNotFoundException if ignoreMissingFiles => --- End diff -- shall we also apply https://github.com/apache/spark/pull/19868/files#diff-83eb37f7b0ebed3c14ccb7bff0d577c2R273 here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181787994 --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala --- @@ -218,6 +232,10 @@ class NewHadoopRDD[K, V]( s"Skipped the rest content in the corrupted file: ${split.serializableHadoopSplit}", e) finished = true +case e: FileNotFoundException if ignoreMissingFiles => + logWarning(s"Skipped missing file: ${split.serializableHadoopSplit}", e) + finished = true + null --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181787857 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -276,6 +292,12 @@ class HadoopRDD[K, V]( try { finished = !reader.next(key, value) } catch { + case e: FileNotFoundException if ignoreMissingFiles => +logWarning(s"Skipped missing file: ${split.inputSplit}", e) +finished = true +null --- End diff -- the return value is not read by anyone --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181718520 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -279,6 +293,10 @@ class HadoopRDD[K, V]( case e: IOException if ignoreCorruptFiles => logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e) finished = true + case e: FileNotFoundException if ignoreMissingFiles => +logWarning(s"Skipped missing file: ${split.inputSplit}", e) +finished = true +null --- End diff -- Same logic with FileScanRDD -- `finished=true`, thus `NextIterator.hasNext` returns false. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/NextIterator.scala#L31 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181716009 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -197,17 +200,24 @@ class HadoopRDD[K, V]( val jobConf = getJobConf() // add the credentials here as this can be called before SparkContext initialized SparkHadoopUtil.get.addCredentials(jobConf) -val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions) -val inputSplits = if (ignoreEmptySplits) { - allInputSplits.filter(_.getLength > 0) -} else { - allInputSplits -} -val array = new Array[Partition](inputSplits.size) -for (i <- 0 until inputSplits.size) { - array(i) = new HadoopPartition(id, i, inputSplits(i)) +try { + val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions) + val inputSplits = if (ignoreEmptySplits) { +allInputSplits.filter(_.getLength > 0) + } else { +allInputSplits + } + val array = new Array[Partition](inputSplits.size) + for (i <- 0 until inputSplits.size) { +array(i) = new HadoopPartition(id, i, inputSplits(i)) + } + array +} catch { + case e: InvalidInputException if ignoreMissingFiles => --- End diff -- Yes, when data dir doesn't exist. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181607994 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala --- @@ -70,6 +71,45 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl } } + test("Replace spark.sql.hive.verifyPartitionPath by spark.files.ignoreMissingFiles") { --- End diff -- we should add some document for `spark.sql.hive.verifyPartitionPath`, say it's replaced by `spark.files.ignoreMissingFiles` and will be removed in future releases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181607863 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -279,6 +293,10 @@ class HadoopRDD[K, V]( case e: IOException if ignoreCorruptFiles => logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e) finished = true + case e: FileNotFoundException if ignoreMissingFiles => +logWarning(s"Skipped missing file: ${split.inputSplit}", e) +finished = true +null --- End diff -- why returning null here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181607797 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -260,6 +270,10 @@ class HadoopRDD[K, V]( logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e) finished = true null + case e: FileNotFoundException if ignoreMissingFiles => --- End diff -- `FileNotFoundException` extends `IOException` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181607746 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -197,17 +200,24 @@ class HadoopRDD[K, V]( val jobConf = getJobConf() // add the credentials here as this can be called before SparkContext initialized SparkHadoopUtil.get.addCredentials(jobConf) -val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions) -val inputSplits = if (ignoreEmptySplits) { - allInputSplits.filter(_.getLength > 0) -} else { - allInputSplits -} -val array = new Array[Partition](inputSplits.size) -for (i <- 0 until inputSplits.size) { - array(i) = new HadoopPartition(id, i, inputSplits(i)) +try { + val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions) + val inputSplits = if (ignoreEmptySplits) { +allInputSplits.filter(_.getLength > 0) + } else { +allInputSplits + } + val array = new Array[Partition](inputSplits.size) + for (i <- 0 until inputSplits.size) { +array(i) = new HadoopPartition(id, i, inputSplits(i)) + } + array +} catch { + case e: InvalidInputException if ignoreMissingFiles => --- End diff -- when will this happen? the root path doesn't exist? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181571713 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -197,17 +200,24 @@ class HadoopRDD[K, V]( val jobConf = getJobConf() // add the credentials here as this can be called before SparkContext initialized SparkHadoopUtil.get.addCredentials(jobConf) -val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions) -val inputSplits = if (ignoreEmptySplits) { - allInputSplits.filter(_.getLength > 0) -} else { - allInputSplits -} -val array = new Array[Partition](inputSplits.size) -for (i <- 0 until inputSplits.size) { - array(i) = new HadoopPartition(id, i, inputSplits(i)) +try { --- End diff -- Put original logic in try-catch --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181571746 --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala --- @@ -124,17 +126,25 @@ class NewHadoopRDD[K, V]( configurable.setConf(_conf) case _ => } -val allRowSplits = inputFormat.getSplits(new JobContextImpl(_conf, jobId)).asScala -val rawSplits = if (ignoreEmptySplits) { - allRowSplits.filter(_.getLength > 0) -} else { - allRowSplits -} -val result = new Array[Partition](rawSplits.size) -for (i <- 0 until rawSplits.size) { - result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable]) +try { --- End diff -- Put original logic in try-catch --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181538066 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- Sure, that would be great. Is there some existing pr/jira working on that? if not, I can make the change :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181371382 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- can we make `spark.sql.files.ignoreMissingFiles` work for hive scan? it seems a better solution than this check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181294385 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- I'm not sure. `spark.sql.hive.verifyPartitionPath` is only for hive table scan and `` is only for `spark.sql.files.ignoreMissingFiles` is only for datasource scan, is it ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181272430 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- Actually do we still need this check? now we have something like `spark.sql.files.ignoreMissingFiles` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181014951 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- @cloud-fan @jiangxb1987 Thanks a lot for review. > Em... It seems we have to check all the levels unless we have specified a value for each partition column. We can make some improvement here but seems that require more complicated approach. Yes, true. In this change, I only optimize when user specify for each partition column, which is very common in the production -- as our user always did: `select xxx from yyy where year=yy and month=mm and day=dd` I'm not sure about you guys idea: leave the current logic as it is(at least the code logic now is very simple)? or implement a more complicated approach and defend as many cases as possible? or do some improvement based on this pr and cover some very common cases? Thanks again for review :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r180962121 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- If the partition columns are `A/B/C/D`, unless you specify a value for `A`, you have to check that level. The same case for `B`/`C`/`D` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r180961987 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- Em... It seems we have to check all the levels unless we have specified a value for each partition column. We can make some improvement here but seems that require more complicated approach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r180741999 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- how about `select * from test where A='00'`? Shall we check one more level? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r180742133 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- and more trickily, `select * from test where B='01'`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r180733361 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- @cloud-fan Thanks a lot for review; Like I mentioned in description, "/demo/data/year/month/day" will not be converted to "/demo/data/*/*/*/", instead it's converted to "/demo/data/year/month/*/" as a path pattern. When this path pattern is passed to `updateExistPathSetByPathPattern`, less matched paths will be returned. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r180713699 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- This is a pretty old logic. Can you explain what's going on here and why your change works? It can help other people to understand your change quickly, --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/19868 [SPARK-22676] Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true ## What changes were proposed in this pull request? In current code, it will scanning all partition paths when spark.sql.hive.verifyPartitionPath=true. e.g. table like below: ``` CREATE TABLE `test`( `id` int, `age` int, `name` string) PARTITIONED BY ( `A` string, `B` string) load data local inpath '/tmp/data0' into table test partition(A='00', B='00') load data local inpath '/tmp/data1' into table test partition(A='01', B='01') load data local inpath '/tmp/data2' into table test partition(A='10', B='10') load data local inpath '/tmp/data3' into table test partition(A='11', B='11') ``` If I query with SQL â "select * from test where year=2017 and month=12 and day=03", current code will scan all partition paths including '/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', '/data/A=11/B=11'. It costs much time and memory cost. This pr proposes to avoid iterating all partition paths. Convert /demo/data/year/month/day to /demo/data/year/month/*/ when generating path pattern. ## How was this patch tested? Manually test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-22676 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19868.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19868 commit 57676609faed4512291979a8d639e3be1ec80578 Author: jinxing Date: 2017-12-03T07:07:12Z [SPARK-22676] Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org