[GitHub] spark pull request #19882: [SPARK-22672][SQL][TEST] Refactor ORC Tests
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19882#discussion_r155166536 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcPartitionDiscoverySuite.scala --- @@ -186,8 +159,8 @@ class OrcPartitionDiscoverySuite extends QueryTest with TestHiveSingleton with B makePartitionDir(base, defaultPartitionName, "pi" -> pi, "ps" -> ps)) } - read -.option(ConfVars.DEFAULTPARTITIONNAME.varname, defaultPartitionName) + spark.read +.option("hive.exec.default.partition.name", defaultPartitionName) --- End diff -- the new ORC didn't change these config names? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19901: [SPARK-22705][SQL] Case, Coalesce, and In use less globa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19901 **[Test build #84534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84534/testReport)** for PR 19901 at commit [`c81d795`](https://github.com/apache/spark/commit/c81d795565ae347243c3571885820b5ac15bf494). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155165795 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- These changes are pretty safe. In case, we move the orc to the other location, it will still refer to the right location. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155165558 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- `org.apache.spark.sql.execution.datasources.*` path is meant to be private too. But I think it's okay to leave it for practical use cases with some comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155165388 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -587,7 +601,8 @@ object DataSource extends Logging { if (provider1.toLowerCase(Locale.ROOT) == "orc" || --- End diff -- `provider1` can't be "orc" anymore. It can be only `classOf[OrcFileFormat].getCanonicalName` or `"org.apache.spark.sql.hive.orc.OrcFileFormat"`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155165377 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- I think we should rename variable and/or fix the comments there when we touch some codes around there to prevent confusion next time though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155164921 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- For parquet, this is for historical reason, see #13311. Previously you can use parquet by `org.apache.spark.sql.execution.datasources.parquet` and `org.apache.spark.sql.execution.datasources.parquet.DefaultSource`. So it is also for backward compatibility. For this new orc, it is not the same case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155163332 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- This is for safety. We also do it for parquet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155162795 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- When I received [the advice](https://github.com/apache/spark/pull/19871#discussion_r154580007), I thought it's for consistency. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155162572 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- Like `USING org.apache.spark.sql.hive.orc`, we want to use `USING org.apache.spark.sql.execution.datasources.orc`, don't we? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19905: [SPARK-22710] ConfigBuilder.fallbackConf should trigger ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19905 **[Test build #84533 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84533/testReport)** for PR 19905 at commit [`028e0fa`](https://github.com/apache/spark/commit/028e0faf87f15c9a25acc3e4f6755d3d49bb7904). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155160863 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- ah good catch! sounds like we don't need compatibility rule for the new orc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19854: [SPARK-22660][BUILD] Use position() and limit() to fix a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19854 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84525/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19854: [SPARK-22660][BUILD] Use position() and limit() to fix a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19854 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19854: [SPARK-22660][BUILD] Use position() and limit() to fix a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19854 **[Test build #84525 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84525/testReport)** for PR 19854 at commit [`dbe6b3f`](https://github.com/apache/spark/commit/dbe6b3fdfce50f25872b559018bfb1b77e81d04f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19905: [SPARK-22710] ConfigBuilder.fallbackConf should trigger ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19905 cc @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19905: [SPARK-22710] ConfigBuilder.fallbackConf should t...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/19905 [SPARK-22710] ConfigBuilder.fallbackConf should trigger onCreate function ## What changes were proposed in this pull request? I was looking at the config code today and found that configs defined using ConfigBuilder.fallbackConf didn't trigger onCreate function. This patch fixes it. This doesn't require backporting since we currently have no configs that use it. ## How was this patch tested? Added a test case for all the config final creator functions in ConfigEntrySuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-22710 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19905.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19905 commit 028e0faf87f15c9a25acc3e4f6755d3d49bb7904 Author: Reynold Xin Date: 2017-12-06T07:10:38Z [SPARK-22710] ConfigBuilder.fallbackConf should trigger onCreate function --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r155160106 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -553,6 +557,8 @@ object DataSource extends Logging { "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" -> parquet, "org.apache.spark.sql.hive.orc.DefaultSource" -> orc, "org.apache.spark.sql.hive.orc" -> orc, + "org.apache.spark.sql.execution.datasources.orc.DefaultSource" -> nativeOrc, + "org.apache.spark.sql.execution.datasources.orc" -> nativeOrc, --- End diff -- This map is for backward compatibility in case we move data sources around. I think this `datasources.orc` is newly added. Why we need to add them here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19864#discussion_r155159490 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -94,14 +94,16 @@ class CacheManager extends Logging { logWarning("Asked to cache already cached data.") } else { val sparkSession = query.sparkSession - cachedData.add(CachedData( -planToCache, -InMemoryRelation( - sparkSession.sessionState.conf.useCompression, - sparkSession.sessionState.conf.columnBatchSize, - storageLevel, - sparkSession.sessionState.executePlan(planToCache).executedPlan, - tableName))) + val inMemoryRelation = InMemoryRelation( +sparkSession.sessionState.conf.useCompression, +sparkSession.sessionState.conf.columnBatchSize, +storageLevel, +sparkSession.sessionState.executePlan(planToCache).executedPlan, +tableName) + if (planToCache.conf.cboEnabled && planToCache.stats.rowCount.isDefined) { --- End diff -- The statistics from relation is based on files size, will it easily cause OOM issue? I think in the cases other than cached query, we still use this relation's statistics. If this is an issue, doesn't it also affect the other cases? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r155159185 --- Diff: python/pyspark/sql/group.py --- @@ -89,8 +89,15 @@ def agg(self, *exprs): else: # Columns assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column" -jdf = self._jgd.agg(exprs[0]._jc, -_to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]])) +if isinstance(exprs[0], UDFColumn): +assert all(isinstance(c, UDFColumn) for c in exprs) --- End diff -- Ah so what your saying is you don't support mixed Python & Java UDAFs? That's certainly something which needs to be communicated in both the documentation and the error message. Is there a reason why we don't support this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19903 **[Test build #84532 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84532/testReport)** for PR 19903 at commit [`d7d268d`](https://github.com/apache/spark/commit/d7d268d19e69b1477e2f3ca9cc8d3f1221d3663c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19903 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19903 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19903 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84528/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19901: [SPARK-22705][SQL] Case, Coalesce, and In use les...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19901#discussion_r155158174 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -237,53 +237,58 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate { val javaDataType = ctx.javaType(value.dataType) val valueGen = value.genCode(ctx) val listGen = list.map(_.genCode(ctx)) -ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.value) -ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull) +// inValue -1:isNull, 0:false, 1:true +val inValue = ctx.freshName("value") val valueArg = ctx.freshName("valueArg") // All the blocks are meant to be inside a do { ... } while (false); loop. // The evaluation of variables can be stopped when we find a matching value. val listCode = listGen.map(x => s""" |${x.code} |if (${x.isNull}) { - | ${ev.isNull} = true; + | $inValue = -1; // isNull = true |} else if (${ctx.genEqual(value.dataType, valueArg, x.value)}) { - | ${ev.isNull} = false; - | ${ev.value} = true; + | $inValue = 1; // value = TRUE | continue; |} """.stripMargin) val codes = ctx.splitExpressionsWithCurrentInputs( expressions = listCode, funcName = "valueIn", - extraArguments = (javaDataType, valueArg) :: Nil, - makeSplitFunction = body => + extraArguments = (javaDataType, valueArg) :: (ctx.JAVA_BYTE, inValue) :: Nil, + returnType = ctx.JAVA_BYTE, + makeSplitFunction = { body => s""" |do { | $body |} while (false); - """.stripMargin, - foldFunctions = _.map { funcCall => -s""" - |$funcCall; - |if (${ev.value}) { - | continue; - |} + |return $inValue; """.stripMargin - }.mkString("\n")) + }, + foldFunctions = { funcCalls => +funcCalls.map(funcCall => + s""" + |$inValue = $funcCall; + |if ($inValue == 1) { + | continue; + |} + """.stripMargin).mkString("\n") + } +) ev.copy(code = s""" |${valueGen.code} - |${ev.value} = false; - |${ev.isNull} = ${valueGen.isNull}; - |if (!${ev.isNull}) { + |byte $inValue = (byte)(${valueGen.isNull} ? -1 : 0); // isNull or FALSE + |if ($inValue != -1) { | $javaDataType $valueArg = ${valueGen.value}; --- End diff -- `${valueGen.value}` may have an constant `1.0D`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19903 **[Test build #84528 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84528/testReport)** for PR 19903 at commit [`d7d268d`](https://github.com/apache/spark/commit/d7d268d19e69b1477e2f3ca9cc8d3f1221d3663c). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19882: [SPARK-22672][SQL][TEST] Refactor ORC Tests
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19882 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84524/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19882: [SPARK-22672][SQL][TEST] Refactor ORC Tests
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19882 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19882: [SPARK-22672][SQL][TEST] Refactor ORC Tests
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19882 **[Test build #84524 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84524/testReport)** for PR 19882 at commit [`baec5fe`](https://github.com/apache/spark/commit/baec5fece95896a28127cd2ac014582874ed5e28). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class OrcFilterSuite extends OrcTest with SharedSQLContext ` * `implicit class IntToBinary(int: Int) ` * `case class OrcParData(intField: Int, stringField: String)` * `case class OrcParDataWithKey(intField: Int, pi: Int, stringField: String, ps: String)` * `abstract class OrcPartitionDiscoveryTest extends OrcTest ` * `case class AllDataTypesWithNonPrimitiveType(` * `case class BinaryData(binaryData: Array[Byte])` * `case class Contact(name: String, phone: String)` * `case class Person(name: String, age: Int, contacts: Seq[Contact])` * `abstract class OrcQueryTest extends OrcTest ` * ` test(\"Creating case class RDD table\") ` * ` test(\"save and load case class RDD with `None`s as orc\") ` * `class OrcQuerySuite extends OrcQueryTest with SharedSQLContext ` * `case class OrcData(intField: Int, stringField: String)` * `abstract class OrcSuite extends OrcTest with BeforeAndAfterAll ` * `class OrcSourceSuite extends OrcSuite with SharedSQLContext ` * `abstract class OrcTest extends QueryTest with SQLTestUtils with BeforeAndAfterAll ` * `class HiveOrcFilterSuite extends OrcTest with TestHiveSingleton ` * `implicit class IntToBinary(int: Int) ` * `class HiveOrcPartitionDiscoverySuite extends OrcPartitionDiscoveryTest with TestHiveSingleton ` * `class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton ` * `class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton ` * `class HiveOrcHadoopFsRelationSuite extends OrcHadoopFsRelationSuite ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19873 Thanks! @gatorsmile @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18581: [SPARK-21289][SQL][ML] Supports custom line separator fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18581 **[Test build #84531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84531/testReport)** for PR 18581 at commit [`265dd48`](https://github.com/apache/spark/commit/265dd48ce16fd62058f4515a9e91c67942b45ed7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19903 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19903 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84526/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19903 **[Test build #84526 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84526/testReport)** for PR 19903 at commit [`ba1e3ae`](https://github.com/apache/spark/commit/ba1e3ae39116c53a8f0d1ed7c71f737fa2e9d0b6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19903 Thanks, @gatorsmile . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19899: [SPARK-22704][SQL] Least and Greatest use less global va...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19899 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19899: [SPARK-22704][SQL] Least and Greatest use less global va...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19899 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84527/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19899: [SPARK-22704][SQL] Least and Greatest use less global va...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19899 **[Test build #84527 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84527/testReport)** for PR 19899 at commit [`facaf1c`](https://github.com/apache/spark/commit/facaf1cda0fae41c86a860265b562f38acd63605). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r155151425 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + def testLineSeparator(lineSep: String): Unit = { +test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") { + val data = Seq( +"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 6:6.0").mkString(lineSep) + val dataWithTrailingLineSep = s"$data$lineSep" + + Seq(data, dataWithTrailingLineSep).foreach { lines => +val path0 = new File(tempDir.getCanonicalPath, "write0") +val path1 = new File(tempDir.getCanonicalPath, "write1") +try { + // Read + java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) + val df = spark.read +.option("lineSep", lineSep) +.format("libsvm") +.load(path0.getAbsolutePath) + + assert(df.columns(0) == "label") + assert(df.columns(1) == "features") + val row1 = df.first() + assert(row1.getDouble(0) == 1.0) + val v = row1.getAs[SparseVector](1) + assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0 + --- End diff -- OK, let me update it. It's easy to change anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19873 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19903 LGTM pending Jenkins on $84528 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19873 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19873 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/19876 You can see some of the past discussion on https://github.com/apache/spark/pull/9207 cc @sethah @MLnick --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19717 **[Test build #84530 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84530/testReport)** for PR 19717 at commit [`0e8ca01`](https://github.com/apache/spark/commit/0e8ca012dc6d5dfd5645f45b6d1bb84e1e9e72a5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19903 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84521/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19903 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19903 **[Test build #84521 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84521/testReport)** for PR 19903 at commit [`d6cdde7`](https://github.com/apache/spark/commit/d6cdde748fb25e51f7f100a9bbd48a1d17547ccf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19873 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84520/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19873 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19873 **[Test build #84520 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84520/testReport)** for PR 19873 at commit [`d2375e0`](https://github.com/apache/spark/commit/d2375e0185185cd3516b25c14b48c12840957cbf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19873 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19903 @cloud-fan , @HyukjinKwon , @gatorsmile . The PR is updated again. Please let me know if I need to update more. Thank you! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19873 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84518/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19873 **[Test build #84518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84518/testReport)** for PR 19873 at commit [`4775a02`](https://github.com/apache/spark/commit/4775a0249b34be16e7bc15dbd4b291fddfe92656). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class ReqAndHandler(req: Request, handler: MemberHandler)` * `trait TypeCoercionRule extends Rule[LogicalPlan] with Logging ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19899: [SPARK-22704][SQL] Least and Greatest use less global va...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19899 **[Test build #84529 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84529/testReport)** for PR 19899 at commit [`d467192`](https://github.com/apache/spark/commit/d46719234bbed9ded30d6f7db6a6a403e9e25f93). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r155148270 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala --- @@ -0,0 +1,234 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.submit + +import java.util.{Collections, UUID} + +import scala.collection.JavaConverters._ +import scala.collection.mutable + +import io.fabric8.kubernetes.api.model._ +import io.fabric8.kubernetes.client.KubernetesClient + +import org.apache.spark.SparkConf +import org.apache.spark.deploy.SparkApplication +import org.apache.spark.deploy.k8s.Config._ +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.SparkKubernetesClientFactory +import org.apache.spark.deploy.k8s.submit.steps.DriverConfigurationStep +import org.apache.spark.internal.Logging +import org.apache.spark.util.Utils + +/** + * Encapsulates arguments to the submission client. + * + * @param mainAppResource the main application resource if any + * @param mainClass the main class of the application to run + * @param driverArgs arguments to the driver + */ +private[spark] case class ClientArguments( + mainAppResource: Option[MainAppResource], + mainClass: String, + driverArgs: Array[String]) + +private[spark] object ClientArguments { + + def fromCommandLineArgs(args: Array[String]): ClientArguments = { +var mainAppResource: Option[MainAppResource] = None +var mainClass: Option[String] = None +val driverArgs = mutable.ArrayBuffer.empty[String] + +args.sliding(2, 2).toList.foreach { + case Array("--primary-java-resource", primaryJavaResource: String) => +mainAppResource = Some(JavaMainAppResource(primaryJavaResource)) + case Array("--main-class", clazz: String) => +mainClass = Some(clazz) + case Array("--arg", arg: String) => +driverArgs += arg + case other => +val invalid = other.mkString(" ") +throw new RuntimeException(s"Unknown arguments: $invalid") +} + +require(mainClass.isDefined, "Main class must be specified via --main-class") + +ClientArguments( + mainAppResource, + mainClass.get, + driverArgs.toArray) + } +} + +/** + * Submits a Spark application to run on Kubernetes by creating the driver pod and starting a + * watcher that monitors and logs the application status. Waits for the application to terminate if + * spark.kubernetes.submission.waitAppCompletion is true. + * + * @param submissionSteps steps that collectively configure the driver + * @param submissionSparkConf the submission client Spark configuration + * @param kubernetesClient the client to talk to the Kubernetes API server + * @param waitForAppCompletion a flag indicating whether the client should wait for the application + * to complete + * @param appName the application name + * @param loggingPodStatusWatcher a watcher that monitors and logs the application status + */ +private[spark] class Client( +submissionSteps: Seq[DriverConfigurationStep], +submissionSparkConf: SparkConf, +kubernetesClient: KubernetesClient, +waitForAppCompletion: Boolean, +appName: String, +loggingPodStatusWatcher: LoggingPodStatusWatcher) extends Logging { + + private val driverJavaOptions = submissionSparkConf.get( +org.apache.spark.internal.config.DRIVER_JAVA_OPTIONS) + + /** +* Run command that initializes a DriverSpec that will be updated after each +* DriverConfigurationStep in the sequence that is passed in. The final KubernetesDriverSpec +* will be used to build the Driver Container, Driver Pod, and Kubernetes Resources +*/ + def run(): Unit = {
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18853 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84519/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18853 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18853 **[Test build #84519 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84519/testReport)** for PR 18853 at commit [`7802483`](https://github.com/apache/spark/commit/7802483075e4328b3a2299413376cf2c6440a2b0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/19864#discussion_r155147478 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -94,14 +94,16 @@ class CacheManager extends Logging { logWarning("Asked to cache already cached data.") } else { val sparkSession = query.sparkSession - cachedData.add(CachedData( -planToCache, -InMemoryRelation( - sparkSession.sessionState.conf.useCompression, - sparkSession.sessionState.conf.columnBatchSize, - storageLevel, - sparkSession.sessionState.executePlan(planToCache).executedPlan, - tableName))) + val inMemoryRelation = InMemoryRelation( +sparkSession.sessionState.conf.useCompression, +sparkSession.sessionState.conf.columnBatchSize, +storageLevel, +sparkSession.sessionState.executePlan(planToCache).executedPlan, +tableName) + if (planToCache.conf.cboEnabled && planToCache.stats.rowCount.isDefined) { --- End diff -- @viirya you're right! Thanks for clearing the confusion however, to prevent using relation's stats which can be much smaller than the in-memory size and lead to a potential OOM error, we should still have this condition here (we can remove cboEnabled though), right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r155146548 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + def testLineSeparator(lineSep: String): Unit = { +test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") { + val data = Seq( +"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 6:6.0").mkString(lineSep) + val dataWithTrailingLineSep = s"$data$lineSep" + + Seq(data, dataWithTrailingLineSep).foreach { lines => +val path0 = new File(tempDir.getCanonicalPath, "write0") +val path1 = new File(tempDir.getCanonicalPath, "write1") +try { + // Read + java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) + val df = spark.read +.option("lineSep", lineSep) +.format("libsvm") +.load(path0.getAbsolutePath) + + assert(df.columns(0) == "label") + assert(df.columns(1) == "features") + val row1 = df.first() + assert(row1.getDouble(0) == 1.0) + val v = row1.getAs[SparseVector](1) + assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0 + --- End diff -- Here we change how to deal with each line in iteration. I think both comparing single line or repeated multiple lines are fine. I think many tests here already test only first line? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r155146209 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + def testLineSeparator(lineSep: String): Unit = { +test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") { + val data = Seq( +"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 6:6.0").mkString(lineSep) + val dataWithTrailingLineSep = s"$data$lineSep" + + Seq(data, dataWithTrailingLineSep).foreach { lines => +val path0 = new File(tempDir.getCanonicalPath, "write0") +val path1 = new File(tempDir.getCanonicalPath, "write1") +try { + // Read + java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) + val df = spark.read +.option("lineSep", lineSep) +.format("libsvm") +.load(path0.getAbsolutePath) + + assert(df.columns(0) == "label") + assert(df.columns(1) == "features") + val row1 = df.first() + assert(row1.getDouble(0) == 1.0) + val v = row1.getAs[SparseVector](1) + assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0 --- End diff -- OK but let me use `==`. Seems it's used in the test cases of this file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19903 **[Test build #84528 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84528/testReport)** for PR 19903 at commit [`d7d268d`](https://github.com/apache/spark/commit/d7d268d19e69b1477e2f3ca9cc8d3f1221d3663c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19899: [SPARK-22704][SQL] Least and Greatest use less gl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19899#discussion_r155145718 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala --- @@ -668,22 +681,35 @@ case class Greatest(children: Seq[Expression]) extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val evalChildren = children.map(_.genCode(ctx)) -ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull) -ctx.addMutableState(ctx.javaType(dataType), ev.value) -def updateEval(eval: ExprCode): String = { +val isNull = ctx.freshName("isNull") +ctx.addMutableState(ctx.JAVA_BOOLEAN, isNull) +val evals = evalChildren.map(eval => s""" ${eval.code} -if (!${eval.isNull} && (${ev.isNull} || +if (!${eval.isNull} && (${isNull} || ${ctx.genGreater(dataType, eval.value, ev.value)})) { - ${ev.isNull} = false; + $isNull = false; ${ev.value} = ${eval.value}; } """ -} -val codes = ctx.splitExpressionsWithCurrentInputs(evalChildren.map(updateEval)) +) + +val resultType = ctx.javaType(dataType) +val codes = ctx.splitExpressionsWithCurrentInputs( + expressions = evals, + funcName = "least", --- End diff -- greatest --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19899: [SPARK-22704][SQL] Least and Greatest use less gl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19899#discussion_r155145590 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala --- @@ -602,23 +602,36 @@ case class Least(children: Seq[Expression]) extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val evalChildren = children.map(_.genCode(ctx)) -ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull) -ctx.addMutableState(ctx.javaType(dataType), ev.value) -def updateEval(eval: ExprCode): String = { +val isNull = ctx.freshName("isNull") --- End diff -- val greatestTmpIsNull = ctx.freshName("greatestTmpIsNull") --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19899: [SPARK-22704][SQL] Least and Greatest use less gl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19899#discussion_r155145566 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala --- @@ -602,23 +602,36 @@ case class Least(children: Seq[Expression]) extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val evalChildren = children.map(_.genCode(ctx)) -ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull) -ctx.addMutableState(ctx.javaType(dataType), ev.value) -def updateEval(eval: ExprCode): String = { +val isNull = ctx.freshName("leastTmpIsNull") --- End diff -- `isNull`, `ev.isNull` and `eval.isNull` looks too close. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19899: [SPARK-22704][SQL] Least and Greatest use less gl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19899#discussion_r155145502 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala --- @@ -602,23 +602,36 @@ case class Least(children: Seq[Expression]) extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val evalChildren = children.map(_.genCode(ctx)) -ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull) -ctx.addMutableState(ctx.javaType(dataType), ev.value) -def updateEval(eval: ExprCode): String = { +val isNull = ctx.freshName("leastTmpIsNull") --- End diff -- `val leastTmpIsNull = ctx.freshName("leastTmpIsNull")` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19899: [SPARK-22704][SQL] Least and Greatest use less gl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19899#discussion_r155145453 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala --- @@ -602,23 +602,36 @@ case class Least(children: Seq[Expression]) extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val evalChildren = children.map(_.genCode(ctx)) -ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull) -ctx.addMutableState(ctx.javaType(dataType), ev.value) -def updateEval(eval: ExprCode): String = { +val isNull = ctx.freshName("isNull") --- End diff -- `greatestTmpIsNull`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19903 Yep. The test case will be updated together. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19783 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19783 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84517/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19783 **[Test build #84517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84517/testReport)** for PR 19783 at commit [`d06`](https://github.com/apache/spark/commit/d06d172897dde096c40ce70ad2584bbefc03). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable ex...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19903#discussion_r155145012 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -602,7 +602,8 @@ object DataSource extends Logging { provider1.startsWith("org.apache.spark.sql.hive.orc")) { throw new AnalysisException( "Hive-based ORC data source must be used with Hive support enabled. " + --- End diff -- Yep. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable ex...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19903#discussion_r155144800 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -602,7 +602,8 @@ object DataSource extends Logging { provider1.startsWith("org.apache.spark.sql.hive.orc")) { throw new AnalysisException( "Hive-based ORC data source must be used with Hive support enabled. " + -"Please use native ORC data source instead") +"Please use native ORC data source instead by setting 'spark.sql.orc.impl' " + +"configuration to 'native'") --- End diff -- Yep. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84523/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #84523 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84523/testReport)** for PR 19904 at commit [`8bfcc94`](https://github.com/apache/spark/commit/8bfcc948696dd88a55d808322079abae994dbaa6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19899: [SPARK-22704][SQL] Least and Greatest use less global va...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19899 **[Test build #84527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84527/testReport)** for PR 19899 at commit [`facaf1c`](https://github.com/apache/spark/commit/facaf1cda0fae41c86a860265b562f38acd63605). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19903 Do you have a test case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable ex...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19903#discussion_r155144262 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -602,7 +602,8 @@ object DataSource extends Logging { provider1.startsWith("org.apache.spark.sql.hive.orc")) { throw new AnalysisException( "Hive-based ORC data source must be used with Hive support enabled. " + --- End diff -- Hive built-in ORC data source must be used with Hive support enabled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84522/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #84522 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84522/testReport)** for PR 19904 at commit [`7725fd8`](https://github.com/apache/spark/commit/7725fd8a86dddba6c61c7d053dfa510a114bebb8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable ex...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19903#discussion_r155144120 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -602,7 +602,8 @@ object DataSource extends Logging { provider1.startsWith("org.apache.spark.sql.hive.orc")) { throw new AnalysisException( "Hive-based ORC data source must be used with Hive support enabled. " + -"Please use native ORC data source instead") +"Please use native ORC data source instead by setting 'spark.sql.orc.impl' " + +"configuration to 'native'") --- End diff -- Please use the native ORC data source by setting 'spark.sql.orc.impl' to 'native' --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r155143739 --- Diff: python/pyspark/sql/readwriter.py --- @@ -313,11 +313,13 @@ def text(self, paths): Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). +:param wholetext: if true, read each file from input path(s) as a single row. >>> df = spark.read.text('python/test_support/sql/text-test.txt') >>> df.collect() [Row(value=u'hello'), Row(value=u'this')] """ --- End diff -- Can you add a doctest for `wholetext` too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r15514 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -97,14 +109,26 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister { assert( requiredSchema.length <= 1, "Text data source only produces a single data column named \"value\".") - +val textOptions = new TextOptions(options) val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +readToUnsafeMem(broadcastedHadoopConf, requiredSchema, textOptions.wholeText) + } + + private def readToUnsafeMem(conf: Broadcast[SerializableConfiguration], + requiredSchema: StructType, wholeTextMode: Boolean): + (PartitionedFile) => Iterator[UnsafeRow] = { + (file: PartitionedFile) => { - val reader = new HadoopFileLinesReader(file, broadcastedHadoopConf.value.value) + val confValue = conf.value.value + var reader: Iterator[Text] with Closeable = null + if (!wholeTextMode) { +reader = new HadoopFileLinesReader(file, confValue) + } else { +reader = new HadoopFileWholeTextReader(file, confValue) + } --- End diff -- We can avoid using `var`: ```scala val reader = if (!wholeTextMode) { new HadoopFileLinesReader(file, confValue) } else { new HadoopFileWholeTextReader(file, confValue) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18610: [SPARK-21386] ML LinearRegression supports warm start fr...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/18610 @sethah I don't think we need to saving the initial model right now, and if its going to slow down the PR progress maybe dropping it makes sense -- but I guess it depends how complicated that is? For the second point, I also think creating an initial model from a set of coefficients could be useful, but I think that could be addressed in a follow up issue -- what do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r155141609 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + def testLineSeparator(lineSep: String): Unit = { +test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") { + val data = Seq( +"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 6:6.0").mkString(lineSep) + val dataWithTrailingLineSep = s"$data$lineSep" + + Seq(data, dataWithTrailingLineSep).foreach { lines => +val path0 = new File(tempDir.getCanonicalPath, "write0") +val path1 = new File(tempDir.getCanonicalPath, "write1") +try { + // Read + java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) + val df = spark.read +.option("lineSep", lineSep) +.format("libsvm") +.load(path0.getAbsolutePath) + + assert(df.columns(0) == "label") + assert(df.columns(1) == "features") + val row1 = df.first() + assert(row1.getDouble(0) == 1.0) + val v = row1.getAs[SparseVector](1) + assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0 --- End diff -- so I prefer to keep consistent with others. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r155141551 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + def testLineSeparator(lineSep: String): Unit = { +test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") { + val data = Seq( +"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 6:6.0").mkString(lineSep) + val dataWithTrailingLineSep = s"$data$lineSep" + + Seq(data, dataWithTrailingLineSep).foreach { lines => +val path0 = new File(tempDir.getCanonicalPath, "write0") +val path1 = new File(tempDir.getCanonicalPath, "write1") +try { + // Read + java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) + val df = spark.read +.option("lineSep", lineSep) +.format("libsvm") +.load(path0.getAbsolutePath) + + assert(df.columns(0) == "label") + assert(df.columns(1) == "features") + val row1 = df.first() + assert(row1.getDouble(0) == 1.0) + val v = row1.getAs[SparseVector](1) + assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0 + --- End diff -- The following test only include checking `df` and `readbackDF` equality. But, it seems we also need test the whole loaded `df` and raw file content equality. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19903 Thank you, @HyukjinKwon and @cloud-fan . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19903 **[Test build #84526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84526/testReport)** for PR 19903 at commit [`ba1e3ae`](https://github.com/apache/spark/commit/ba1e3ae39116c53a8f0d1ed7c71f737fa2e9d0b6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19864#discussion_r155141140 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -94,14 +94,16 @@ class CacheManager extends Logging { logWarning("Asked to cache already cached data.") } else { val sparkSession = query.sparkSession - cachedData.add(CachedData( -planToCache, -InMemoryRelation( - sparkSession.sessionState.conf.useCompression, - sparkSession.sessionState.conf.columnBatchSize, - storageLevel, - sparkSession.sessionState.executePlan(planToCache).executedPlan, - tableName))) + val inMemoryRelation = InMemoryRelation( +sparkSession.sessionState.conf.useCompression, +sparkSession.sessionState.conf.columnBatchSize, +storageLevel, +sparkSession.sessionState.executePlan(planToCache).executedPlan, +tableName) + if (planToCache.conf.cboEnabled && planToCache.stats.rowCount.isDefined) { --- End diff -- If a catalog table doesn't have statistics in its metadata, we will fill it with `defaultSizeInBytes`. https://github.com/apache/spark/blob/326f1d6728a7734c228d8bfaa69442a1c7b92e9b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L121-L134 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19903: [SPARK-20728][SQL][FOLLOWUP] Use an actionable exception...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19903 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19854: [SPARK-22660][BUILD] Use position() and limit() to fix a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19854 **[Test build #84525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84525/testReport)** for PR 19854 at commit [`dbe6b3f`](https://github.com/apache/spark/commit/dbe6b3fdfce50f25872b559018bfb1b77e81d04f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19864#discussion_r155140756 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -94,14 +94,16 @@ class CacheManager extends Logging { logWarning("Asked to cache already cached data.") } else { val sparkSession = query.sparkSession - cachedData.add(CachedData( -planToCache, -InMemoryRelation( - sparkSession.sessionState.conf.useCompression, - sparkSession.sessionState.conf.columnBatchSize, - storageLevel, - sparkSession.sessionState.executePlan(planToCache).executedPlan, - tableName))) + val inMemoryRelation = InMemoryRelation( +sparkSession.sessionState.conf.useCompression, +sparkSession.sessionState.conf.columnBatchSize, +storageLevel, +sparkSession.sessionState.executePlan(planToCache).executedPlan, +tableName) + if (planToCache.conf.cboEnabled && planToCache.stats.rowCount.isDefined) { --- End diff -- `LogicalRelation` uses the statistics from the `relation` only when there is no given `catalogTable`. In this case, it doesn't consider if CBO is enabled or not. Only `catalogTable` considers CBO when computing its statistics in `toPlanStats`. It doesn't refer to relation's statistics, IIUC. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19882: [SPARK-22672][SQL][TEST] Refactor ORC Tests
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19882 **[Test build #84524 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84524/testReport)** for PR 19882 at commit [`baec5fe`](https://github.com/apache/spark/commit/baec5fece95896a28127cd2ac014582874ed5e28). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize exist...
Github user CodingCat commented on the issue: https://github.com/apache/spark/pull/19864 @viirya sure will add --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/19864#discussion_r155140125 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -94,14 +94,16 @@ class CacheManager extends Logging { logWarning("Asked to cache already cached data.") } else { val sparkSession = query.sparkSession - cachedData.add(CachedData( -planToCache, -InMemoryRelation( - sparkSession.sessionState.conf.useCompression, - sparkSession.sessionState.conf.columnBatchSize, - storageLevel, - sparkSession.sessionState.executePlan(planToCache).executedPlan, - tableName))) + val inMemoryRelation = InMemoryRelation( +sparkSession.sessionState.conf.useCompression, +sparkSession.sessionState.conf.columnBatchSize, +storageLevel, +sparkSession.sessionState.executePlan(planToCache).executedPlan, +tableName) + if (planToCache.conf.cboEnabled && planToCache.stats.rowCount.isDefined) { --- End diff -- no, if CBO is disabled, the relation's sizeInBytes is the file size https://github.com/apache/spark/blob/5c3a1f3fad695317c2fff1243cdb9b3ceb25c317/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala#L85 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r155139949 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala --- @@ -39,6 +39,54 @@ class TextSuite extends QueryTest with SharedSQLContext { verifyFrame(spark.read.text(testFile)) } + test("reading text file with option wholetext=true") { +val df = spark.read.option("wholetext", "true") + .format("text").load(testFile) +// schema +assert(df.schema == new StructType().add("value", StringType)) + +// verify content +val data = df.collect() +assert(data(0) == + Row( +// scalastyle:off nonascii +"""This is a test file for the text data source + |1+1 + |æ°æ®ç 头 + |"doh" + |""".stripMargin)) +// scalastyle:on --- End diff -- nit: // scalastyle:on nonascii --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org