[GitHub] spark issue #19178: [SPARK-21966][SQL]ResolveMissingReference rule should no...
Github user DonnyZone commented on the issue: https://github.com/apache/spark/pull/19178 Thanks, I will make a try in our private repository as there are several such cases and the users want to in a seamless way. It is really complicated for a general support. Should I close this PR now? @gatorsmile @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16158: [SPARK-18724][ML] Add TuningSummary for TrainValidationS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16158 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81622/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16158: [SPARK-18724][ML] Add TuningSummary for TrainValidationS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16158 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16158: [SPARK-18724][ML] Add TuningSummary for TrainValidationS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16158 **[Test build #81622 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81622/testReport)** for PR 16158 at commit [`297091f`](https://github.com/apache/spark/commit/297091fcb8562bba856fadb5848c2e66e0f30bfb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19106: [SPARK-21770][ML] ProbabilisticClassificationMode...
Github user smurching commented on a diff in the pull request: https://github.com/apache/spark/pull/19106#discussion_r137988907 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala --- @@ -245,6 +245,13 @@ private[ml] object ProbabilisticClassificationModel { v.values(i) /= sum i += 1 } +} else { + var i = 0 + val size = v.size + while (i < size) { +v.values(i) = 1.0 / size --- End diff -- You could use `java.util.Arrays.fill` to update `v.values` in-place, but I'm not sure that it'll make a huge difference. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18704: [SPARK-20783][SQL] Create ColumnVector to abstract exist...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/18704 ping @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19184#discussion_r137988200 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -104,6 +124,10 @@ public void loadNext() throws IOException { if (taskContext != null) { taskContext.killTaskIfInterrupted(); } +if (this.din == null) { + // Good time to init (if all files are opened, we can get Too Many files exception) + initStreams(); +} --- End diff -- The valid fix should be to import a new config to control the concurrent number of opened spill files, it also means you should use some data structure to keep and track the request of open spill files. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19174: [SPARK-21963][CORE][TEST]Create temp file should be dele...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19174 This seems not a big problem, all the temp files are created under `target/tmp`, this can be cleaned by `mvn clean` or `sbt clean`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137988061 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala --- @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import java.io.CharArrayWriter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, GenericArrayData} +import org.apache.spark.sql.types._ + +class JacksonGeneratorSuite extends SparkFunSuite { + + val gmtId = DateTimeUtils.TimeZoneGMT.getID + val option = new JSONOptions(Map.empty, gmtId) + + test("initial with StructType and write out a row") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = InternalRow(1) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """{"a":1}""") + } + + test("initial with StructType and write out rows") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = new GenericArrayData(InternalRow(1) :: InternalRow(2) :: Nil) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """[{"a":1},{"a":2}]""") + } + + test("initial with StructType and write out an array with single empty row") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = new GenericArrayData(InternalRow(null) :: Nil) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """[{}]""") + } + + test("initial with StructType and write out an empty array") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = new GenericArrayData(Nil) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """[]""") + } + + test("initial with Map and write out a map data") { +val dataType = MapType(StringType, IntegerType) +val input = ArrayBasedMapData(Map("a" -> 1)) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """{"a":1}""") + } + + test("initial with Map and write out an array of maps") { +val dataType = MapType(StringType, IntegerType) +val input = new GenericArrayData( + ArrayBasedMapData(Map("a" -> 1)) :: ArrayBasedMapData(Map("b" -> 2)) :: Nil) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """[{"a":1},{"b":2}]""") + } --- End diff -- We need negative test cases, i.e, `JacksonGenerator` initialized with `StructType` but used to write map, array of map. Also `JacksonGenerator` initialized with `MapType` but used to write struct, array of struct. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137987843 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( --- End diff -- It will throw `ClassCastException`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18875 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81616/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137987691 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( --- End diff -- hmm, what type of exception thrown now if call `write(row)` when the generator is initialized with `MapType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18875 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81616 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81616/testReport)** for PR 18875 at commit [`b8bf38c`](https://github.com/apache/spark/commit/b8bf38cda5f3b03473f5bd1f1c4c8599e342b869). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19118: [SPARK-21882][CORE] OutputMetrics doesn't count written ...
Github user awarrior commented on the issue: https://github.com/apache/spark/pull/19118 I met a trouble when I write a test case. It seems that this issue won't be triggered in only one node. I found that Driver node do createPathFromString so that there is no problem. > java.lang.Thread.State: RUNNABLE at org.apache.hadoop.fs.FileSystem.getStatistics(FileSystem.java:3271) - locked <0x1211> (a java.lang.Class) at org.apache.hadoop.fs.FileSystem.initialize(FileSystem.java:202) at org.apache.hadoop.fs.RawLocalFileSystem.initialize(RawLocalFileSystem.java:92) at org.apache.hadoop.fs.LocalFileSystem.initialize(LocalFileSystem.java:47) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:55) Does anyone know how to test in this case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137987296 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( --- End diff -- oh I got you wrong. I thought you mean the matching in `rootFieldWeriters`. So we need to keep both of them? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137986494 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( --- End diff -- I saw you remove the pattern matching? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137986264 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,25 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case _: MapType | ArrayType(_: MapType, _) => + // TODO: let `JacksonUtils.verifySchema` verify a `MapType` + try { +val st = StructType(StructField("a", rowSchema.asInstanceOf[MapType]) :: Nil) +JacksonUtils.verifySchema(st) TypeCheckResult.TypeCheckSuccess } catch { case e: UnsupportedOperationException => TypeCheckResult.TypeCheckFailure(e.getMessage) } case _ => TypeCheckResult.TypeCheckFailure( - s"Input type ${child.dataType.simpleString} must be a struct or array of structs.") + s"Input type ${child.dataType.simpleString} must be a struct, array of structs or " + + s"an arbitrary map.") --- End diff -- `an arbitrary map` -> `a map or array of map.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137986124 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,42 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case ArrayType(mt: MapType, _) => + try { +val st = StructType(StructField("a", mt) :: Nil) +JacksonUtils.verifySchema(st) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case MapType(_: DataType, st: StructType, _: Boolean) => --- End diff -- Yeah, as you support arbitrary map. Not all map types actually, it leaves to `JacksonUtils` to verify the schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81625/testReport)** for PR 18875 at commit [`b71b56a`](https://github.com/apache/spark/commit/b71b56ab49ba048f397459a18f946422d9440b29). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19132: [SPARK-21922] Fix duration always updating when task fai...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19132 Still I have a question about history server, is you event log an incomplete event log or completed when you met such issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18592: [SPARK-21368][SQL] TPCDSQueryBenchmark can't refe...
Github user sarutak commented on a diff in the pull request: https://github.com/apache/spark/pull/18592#discussion_r137985658 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala --- @@ -99,6 +95,20 @@ object TPCDSQueryBenchmark { } def main(args: Array[String]): Unit = { +if (args.length < 1) { --- End diff -- We can pass the argument through the run-configuration even though we use IDE like IntelliJ right? Or, how about give `dataLocation` through a new property? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137984767 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,42 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case ArrayType(mt: MapType, _) => + try { +val st = StructType(StructField("a", mt) :: Nil) +JacksonUtils.verifySchema(st) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case MapType(_: DataType, st: StructType, _: Boolean) => --- End diff -- @viirya I think if we have `case mt: MapType`, we don't need this pattern to verify schema, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81624 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81624/testReport)** for PR 18875 at commit [`f567679`](https://github.com/apache/spark/commit/f567679ec724052266587e35cdf25868587de750). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19186 **[Test build #81623 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81623/testReport)** for PR 19186 at commit [`f8fa957`](https://github.com/apache/spark/commit/f8fa9573a1b40ff236e9c52cf429e2742c8f2bd0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19186 [SPARK-21972][ML] Add param handlePersistence ## What changes were proposed in this pull request? Add param handlePersistence ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhengruifeng/spark fix_double_cache Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19186.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19186 commit 45adf7e35175228821cc8f84a63855b851e25d55 Author: Zheng RuiFeng Date: 2017-03-31T03:10:45Z recreate pr commit 208e1009037aed780ca194658a68dec2221db3bc Author: Zheng RuiFeng Date: 2017-07-06T05:06:04Z del unused equation commit 936d46634c69a8f0f8a546475edd3bfbd501d1e1 Author: Zheng RuiFeng Date: 2017-08-29T09:52:28Z revert Predictor commit df4d263d21c3e4413b4d74aaabcd3a10348a0001 Author: Zheng RuiFeng Date: 2017-09-04T05:27:56Z create param HasHandlePersistence commit 971e52c4a18b4261d82ac14fefa6bb849367562c Author: Zheng RuiFeng Date: 2017-09-04T05:57:13Z fix commit f8fa9573a1b40ff236e9c52cf429e2742c8f2bd0 Author: Zheng RuiFeng Date: 2017-09-04T06:26:18Z fix mima --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17014: [SPARK-18608][ML] Fix double-caching in ML algori...
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/17014 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 @smurching OK, I will close this PR and resubmit it to the new ticket. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16158: [SPARK-18724][ML] Add TuningSummary for TrainValidationS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16158 **[Test build #81622 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81622/testReport)** for PR 16158 at commit [`297091f`](https://github.com/apache/spark/commit/297091fcb8562bba856fadb5848c2e66e0f30bfb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19184#discussion_r137982497 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -104,6 +124,10 @@ public void loadNext() throws IOException { if (taskContext != null) { taskContext.killTaskIfInterrupted(); } +if (this.din == null) { + // Good time to init (if all files are opened, we can get Too Many files exception) + initStreams(); +} --- End diff -- IIUC, this PR does not reduce the number of total open files. Since this PR tries to open files when they are required, this PR may reduce possibility of occurring an error of too may open files. As @viirya pointed out, it is necessary to provide a feature to control the number of opening files at one point (e.g. priority queue). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19169: [SPARK-21957][SQL] Add current_user function
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/19169#discussion_r137982471 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CurrentUser.scala --- @@ -0,0 +1,47 @@ +/* --- End diff -- Add this to https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19184#discussion_r137981999 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -104,6 +124,10 @@ public void loadNext() throws IOException { if (taskContext != null) { taskContext.killTaskIfInterrupted(); } +if (this.din == null) { + // Good time to init (if all files are opened, we can get Too Many files exception) + initStreams(); +} --- End diff -- I agree with @viirya , we're using priority queue to do merge sort, this will turn out to be all the readers in the priority queue is opened, so still cannot solve this issue. I think a valid fix is to control the number of concurrent merged files, like MR's `io.sort.factor`. Also we still need to address similar issue in `ExternalSorter` and other places in Shuffle. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19184#discussion_r137981905 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -104,6 +124,10 @@ public void loadNext() throws IOException { if (taskContext != null) { taskContext.killTaskIfInterrupted(); } +if (this.din == null) { + // Good time to init (if all files are opened, we can get Too Many files exception) --- End diff -- This comment looks confusing. Maybe `It is the time to initialize and hold the input stream of the spill file for loading records. Keeps the input stream open too early will very possibly encounter too many file open issue.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19184#discussion_r137981565 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -72,12 +77,27 @@ public UnsafeSorterSpillReader( bufferSizeBytes = DEFAULT_BUFFER_SIZE_BYTES; } +try (InputStream bs = new NioBufferedFileInputStream(file, (int) bufferSizeBytes); +DataInputStream dataIn = new DataInputStream(serializerManager.wrapStream(blockId, bs))) { + this.numRecords = dataIn.readInt(); + this.numRecordsRemaining = numRecords; +} + +this.buffSize = bufferSizeBytes; +this.file = file; +this.blockId = blockId; +this.serializerManager = serializerManager; + +logger.debug("bufSize: {}, file: {}, records: {}", buffSize, file, this.numRecords); --- End diff -- Is this log useful? If the number of spill readers is so many, I guess we don't want to see so many log info? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19184#discussion_r137981394 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -72,12 +77,27 @@ public UnsafeSorterSpillReader( bufferSizeBytes = DEFAULT_BUFFER_SIZE_BYTES; } +try (InputStream bs = new NioBufferedFileInputStream(file, (int) bufferSizeBytes); --- End diff -- Please add a comment here to say we don't need to hold the file open until we actually want to load the records. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19178: [SPARK-21966][SQL]ResolveMissingReference rule should no...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19178 yea, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19178: [SPARK-21966][SQL]ResolveMissingReference rule should no...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19178 Yes. We do not plan to introduce extra complexity for improving this rule. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19178: [SPARK-21966][SQL]ResolveMissingReference rule should no...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19178 Aha, you mean this is an expected design and we won't change this logic in `ResolveMissingReference` now, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16677 ping @jiangxb1987 @cloud-fan This can be a useful feature for production as @wzhfy suggested. Can you help review this if possible? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19184 cc @cloud-fan @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19184 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81614/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19184 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19068: [SPARK-21428][SQL][FOLLOWUP] Reused state should respect...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19068 cc @cloud-fan @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19184 **[Test build #81614 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81614/testReport)** for PR 19184 at commit [`dcc2960`](https://github.com/apache/spark/commit/dcc2960d5f60add9bfd9446df59b0d0d06365947). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19147 **[Test build #81621 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81621/testReport)** for PR 19147 at commit [`dbc6dd2`](https://github.com/apache/spark/commit/dbc6dd2138b427d8436eb0d8bdc4ba134f254e35). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19184#discussion_r137980253 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -104,6 +124,10 @@ public void loadNext() throws IOException { if (taskContext != null) { taskContext.killTaskIfInterrupted(); } +if (this.din == null) { + // Good time to init (if all files are opened, we can get Too Many files exception) + initStreams(); +} --- End diff -- I think you need to first describe more about how to fix this issue in the description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81620 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81620/testReport)** for PR 18875 at commit [`b8ad442`](https://github.com/apache/spark/commit/b8ad442437390611c63d8b0705d02c3ca3ba3c9c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137980084 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,27 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case ArrayType(_: MapType, _) => + TypeCheckResult.TypeCheckSuccess +case MapType(_: DataType, st: StructType, _: Boolean) => + try { +JacksonUtils.verifySchema(st) TypeCheckResult.TypeCheckSuccess } catch { case e: UnsupportedOperationException => TypeCheckResult.TypeCheckFailure(e.getMessage) } +case _: MapType => --- End diff -- ok. I just leave a TODO comment for it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16677 **[Test build #81619 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81619/testReport)** for PR 16677 at commit [`f2a7aac`](https://github.com/apache/spark/commit/f2a7aacc22caf27c1a8af612c9432586e6a86d17). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19178: [SPARK-21966][SQL]ResolveMissingReference rule sh...
Github user DonnyZone commented on a diff in the pull request: https://github.com/apache/spark/pull/19178#discussion_r137979887 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1115,6 +1115,8 @@ class Analyzer( g.copy(join = true, child = addMissingAttr(g.child, missing)) case d: Distinct => throw new AnalysisException(s"Can't add $missingAttrs to $d") +case u: Union => + u.withNewChildren(u.children.map(addMissingAttr(_, missingAttrs))) --- End diff -- Yeah, I agree with you. Current implementation only checks UnaryNode. It is necessary to take all node types into consideration. Thanks for suggestion, I will work on a general solution. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19086 Thanks for working on it! @jinxing64 Our current behavior is based on what the other traditional DBMS systems do. Moving tables across different database is rare. For example, DB2 and Oracle do not support it. - https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_9019.htm - https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r980.html So far, we do not have a clean way to support our current behavior and the Hive's behavior. Thus, we want to hold this until the community has more requests. The quality of this PR is very close to the one we can merge. You can patch it to your local forked version. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9183: [SPARK-11215] [ML] Add multiple columns support to String...
Github user minixalpha commented on the issue: https://github.com/apache/spark/pull/9183 Thanks for you job! @WeichenXu123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137979202 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,27 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case ArrayType(_: MapType, _) => + TypeCheckResult.TypeCheckSuccess +case MapType(_: DataType, st: StructType, _: Boolean) => + try { +JacksonUtils.verifySchema(st) TypeCheckResult.TypeCheckSuccess } catch { case e: UnsupportedOperationException => TypeCheckResult.TypeCheckFailure(e.getMessage) } +case _: MapType => --- End diff -- Basically I don't like to increase the size of this PR anymore. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137979158 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,27 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case ArrayType(_: MapType, _) => + TypeCheckResult.TypeCheckSuccess +case MapType(_: DataType, st: StructType, _: Boolean) => + try { +JacksonUtils.verifySchema(st) TypeCheckResult.TypeCheckSuccess } catch { case e: UnsupportedOperationException => TypeCheckResult.TypeCheckFailure(e.getMessage) } +case _: MapType => --- End diff -- Yeah, it is. Maybe we should revise `JacksonUtils` to be able to verify `MapType` directly. But I don't see `JacksonUtils.verifySchema` used in other place. We can leave it as TODO if you want to. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137979014 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,27 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case ArrayType(_: MapType, _) => + TypeCheckResult.TypeCheckSuccess +case MapType(_: DataType, st: StructType, _: Boolean) => + try { +JacksonUtils.verifySchema(st) TypeCheckResult.TypeCheckSuccess } catch { case e: UnsupportedOperationException => TypeCheckResult.TypeCheckFailure(e.getMessage) } +case _: MapType => --- End diff -- oh yes. It's a tricky way but make sense. =D --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9183: [SPARK-11215] [ML] Add multiple columns support to String...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/9183 @minixalpha Sorry for delay. Too busy recently. But I will try to finish and commit my new PR once I get time. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19172: [SPARK-21856] Add probability and rawPrediction to MLPC ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19172 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81618/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19172: [SPARK-21856] Add probability and rawPrediction to MLPC ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19172 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19172: [SPARK-21856] Add probability and rawPrediction to MLPC ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19172 **[Test build #81618 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81618/testReport)** for PR 19172 at commit [`d2ccc1b`](https://github.com/apache/spark/commit/d2ccc1b5b9fc74563635a6fdc6153080661924c1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137978728 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( --- End diff -- ok got it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137978375 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,27 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case ArrayType(_: MapType, _) => + TypeCheckResult.TypeCheckSuccess +case MapType(_: DataType, st: StructType, _: Boolean) => + try { +JacksonUtils.verifySchema(st) TypeCheckResult.TypeCheckSuccess } catch { case e: UnsupportedOperationException => TypeCheckResult.TypeCheckFailure(e.getMessage) } +case _: MapType => --- End diff -- Maybe we can just do something like: case mt: MapType => val st = StructType(StructField("a", mt) :: Nil) JacksonUtils.verifySchema(st) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137978096 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,27 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case ArrayType(_: MapType, _) => + TypeCheckResult.TypeCheckSuccess +case MapType(_: DataType, st: StructType, _: Boolean) => + try { +JacksonUtils.verifySchema(st) TypeCheckResult.TypeCheckSuccess } catch { case e: UnsupportedOperationException => TypeCheckResult.TypeCheckFailure(e.getMessage) } +case _: MapType => --- End diff -- @viirya But the `JacksonUtils.verifySchema` only verify a `StructType`, so we need to add a new one? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137977638 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( --- End diff -- `ClassCastExceptoin` may not informative as `UnsupportedOperationException` here. I'd prefer to see `UnsupportedOperationException`. Let's keep this pattern matching for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137977643 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala --- @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import java.io.CharArrayWriter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, GenericArrayData} +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +class JacksonGeneratorSuite extends SparkFunSuite { + + val gmtId = DateTimeUtils.TimeZoneGMT.getID + val option = new JSONOptions(Map.empty, gmtId) + val writer = new CharArrayWriter() + def getAndReset(gen: JacksonGenerator): UTF8String = { +gen.flush() +val json = writer.toString +writer.reset() +UTF8String.fromString(json) + } + + test("initial with StructType and write out a row") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = InternalRow(1) +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +assert(getAndReset(gen) === UTF8String.fromString("""{"a":1}""")) --- End diff -- @viirya I think we also need to do `gen.flush()`, right? So maybe we can keep `getAndReset` and modify it as below. Will it be better? ``` def getAndReset(gen: JacksonGenerator, writer: Writer): String = { gen.flush() writer.toString } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19172: [SPARK-21856] Add probability and rawPrediction to MLPC ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19172 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19172: [SPARK-21856] Add probability and rawPrediction to MLPC ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19172 **[Test build #81617 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81617/testReport)** for PR 19172 at commit [`fac9c6e`](https://github.com/apache/spark/commit/fac9c6eeace6a5ea21a017cd40e09a1c8d44e5c7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137977826 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala --- @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import java.io.CharArrayWriter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, GenericArrayData} +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +class JacksonGeneratorSuite extends SparkFunSuite { + + val gmtId = DateTimeUtils.TimeZoneGMT.getID + val option = new JSONOptions(Map.empty, gmtId) + val writer = new CharArrayWriter() + def getAndReset(gen: JacksonGenerator): UTF8String = { +gen.flush() +val json = writer.toString +writer.reset() +UTF8String.fromString(json) + } + + test("initial with StructType and write out a row") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = InternalRow(1) +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +assert(getAndReset(gen) === UTF8String.fromString("""{"a":1}""")) --- End diff -- Yeah, but it's just two lines. Depends on you to have a function or not. At least it shouldn't be called `getAndReset` now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19172: [SPARK-21856] Add probability and rawPrediction to MLPC ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19172 **[Test build #81618 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81618/testReport)** for PR 19172 at commit [`d2ccc1b`](https://github.com/apache/spark/commit/d2ccc1b5b9fc74563635a6fdc6153080661924c1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19172: [SPARK-21856] Add probability and rawPrediction to MLPC ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19172 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81617/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137977281 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( --- End diff -- Do you mean `dataType` now is a map type but `writeFields`'s 2nd parameter is a `StructType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137977153 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala --- @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import java.io.CharArrayWriter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, GenericArrayData} +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +class JacksonGeneratorSuite extends SparkFunSuite { + + val gmtId = DateTimeUtils.TimeZoneGMT.getID + val option = new JSONOptions(Map.empty, gmtId) + val writer = new CharArrayWriter() + def getAndReset(gen: JacksonGenerator): UTF8String = { +gen.flush() +val json = writer.toString +writer.reset() +UTF8String.fromString(json) + } + + test("initial with StructType and write out a row") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = InternalRow(1) +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +assert(getAndReset(gen) === UTF8String.fromString("""{"a":1}""")) --- End diff -- Yes, maybe we can just `writer.toString` and assert it with the normal string. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137977056 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala --- @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import java.io.CharArrayWriter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, GenericArrayData} +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +class JacksonGeneratorSuite extends SparkFunSuite { + + val gmtId = DateTimeUtils.TimeZoneGMT.getID + val option = new JSONOptions(Map.empty, gmtId) + val writer = new CharArrayWriter() + def getAndReset(gen: JacksonGenerator): UTF8String = { +gen.flush() +val json = writer.toString +writer.reset() --- End diff -- ok. I got it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137976953 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( --- End diff -- @viirya I found if we don't check it in here, it will throw `ClassCastExceptoin` from `writeObject()` firstly. So, maybe we don't need to check type in `rootFieldWriters` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9183: [SPARK-11215] [ML] Add multiple columns support to String...
Github user minixalpha commented on the issue: https://github.com/apache/spark/pull/9183 @WeichenXu123 Any activity for the new PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19172: [SPARK-21856] Add probability and rawPrediction to MLPC ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19172 **[Test build #81617 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81617/testReport)** for PR 19172 at commit [`fac9c6e`](https://github.com/apache/spark/commit/fac9c6eeace6a5ea21a017cd40e09a1c8d44e5c7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18592: [SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/18592 thanks, I'll make a pr later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms
Github user smurching commented on the issue: https://github.com/apache/spark/pull/17014 Hi @zhengruifeng, thanks for your work on this! Now that we're introducing a new handlePersistence parameter (a new public API), it'd be good to track work in a separate JIRA/PR as @jkbradley suggested so others are aware of the proposed change. I've created a new JIRA ticket for adding the handlePersistence param here: [SPARK-21972](https://issues.apache.org/jira/browse/SPARK-21972). Would you mind resubmitting your work as a new PR that addresses the new JIRA ticket (SPARK-21972)? Thanks & sorry for the inconvenience! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19185 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19185 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81615/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19185 **[Test build #81615 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81615/testReport)** for PR 19185 at commit [`53ac68e`](https://github.com/apache/spark/commit/53ac68e24731e405d44d50be70b1c25f2ab9df3d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137976262 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,27 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case ArrayType(_: MapType, _) => + TypeCheckResult.TypeCheckSuccess +case MapType(_: DataType, st: StructType, _: Boolean) => + try { +JacksonUtils.verifySchema(st) TypeCheckResult.TypeCheckSuccess } catch { case e: UnsupportedOperationException => TypeCheckResult.TypeCheckFailure(e.getMessage) } +case _: MapType => --- End diff -- We need to revise `JacksonUtils` to check map type too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137976065 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,27 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case ArrayType(_: MapType, _) => + TypeCheckResult.TypeCheckSuccess +case MapType(_: DataType, st: StructType, _: Boolean) => + try { +JacksonUtils.verifySchema(st) TypeCheckResult.TypeCheckSuccess } catch { case e: UnsupportedOperationException => TypeCheckResult.TypeCheckFailure(e.getMessage) } +case _: MapType => --- End diff -- Not all map types are valid for JSON. We still need verify the type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19172: [SPARK-21856] Add probability and rawPrediction t...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19172#discussion_r137975691 --- Diff: python/pyspark/ml/tests.py --- @@ -1655,6 +1655,26 @@ def test_multinomial_logistic_regression_with_bound(self): np.allclose(model.interceptVector.toArray(), [-0.9057, -1.1392, -0.0033], atol=1E-4)) +class MultilayerPerceptronClassifierTest(SparkSessionTestCase): + +def test_multilayer_perceptron_classifier(self): --- End diff -- Rename to ```test_raw_and_probability_prediction```? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137975716 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( --- End diff -- ok. I'll remove it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137975624 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala --- @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import java.io.CharArrayWriter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, GenericArrayData} +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +class JacksonGeneratorSuite extends SparkFunSuite { + + val gmtId = DateTimeUtils.TimeZoneGMT.getID + val option = new JSONOptions(Map.empty, gmtId) + val writer = new CharArrayWriter() + def getAndReset(gen: JacksonGenerator): UTF8String = { +gen.flush() +val json = writer.toString +writer.reset() --- End diff -- I don't prefer tests depending each others. I think we can just have a writer for each test, instead of reusing it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18592: [SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18592 Also sounds good to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137975414 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala --- @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import java.io.CharArrayWriter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, GenericArrayData} +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +class JacksonGeneratorSuite extends SparkFunSuite { + + val gmtId = DateTimeUtils.TimeZoneGMT.getID + val option = new JSONOptions(Map.empty, gmtId) + val writer = new CharArrayWriter() + def getAndReset(gen: JacksonGenerator): UTF8String = { +gen.flush() +val json = writer.toString +writer.reset() +UTF8String.fromString(json) + } + + test("initial with StructType and write out a row") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = InternalRow(1) +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +assert(getAndReset(gen) === UTF8String.fromString("""{"a":1}""")) --- End diff -- Can we simply compare the strings? I.e., don't need to do `UTF8String.fromString` in `getAndReset` and here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81616 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81616/testReport)** for PR 18875 at commit [`b8bf38c`](https://github.com/apache/spark/commit/b8bf38cda5f3b03473f5bd1f1c4c8599e342b869). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/19185 @gatorsmile Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19185 **[Test build #81615 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81615/testReport)** for PR 19185 at commit [`53ac68e`](https://github.com/apache/spark/commit/53ac68e24731e405d44d50be70b1c25f2ab9df3d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19185 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19185 @yanboliang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137975214 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,30 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( + s"`JacksonGenerator` can only be used to write out a row when initialized with `StructType`.") + } /** - * Transforms multiple `InternalRow`s to JSON array using Jackson + * Transforms multiple `InternalRow`s or `MapData`s to JSON array using Jackson * - * @param array The array of rows to convert + * @param array The array of rows or maps to convert */ def write(array: ArrayData): Unit = writeArray(writeArrayData(array, arrElementWriter)) + /** + * Transforms a `MapData` to JSON object using Jackson + * + * @param map a map to convert + */ + def write(map: MapData): Unit = dataType match { +case mt: MapType => writeObject(writeMapData(map, mt, mapElementWriter)) +case _ => throw new UnsupportedOperationException( --- End diff -- Similar to https://github.com/apache/spark/pull/18875/files#r137975134, we can avoid this. Doing pattern matching per writing seems too burden. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137975134 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( --- End diff -- Rethinking about this, I think we can avoid matching the data type and simply do `writeObject(writeFields(row, schema, rootFieldWriters))` like before. When accessing `rootFieldWriters`, there is an exception thrown then. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19185: [Spark-21854] Added LogisticRegressionTrainingSummary fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19185 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...
GitHub user jmwdpk opened a pull request: https://github.com/apache/spark/pull/19185 [Spark-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## What changes were proposed in this pull request? Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## How was this patch tested? Added unit test Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jmwdpk/spark SPARK-21854 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19185.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19185 commit 50cfafe9e0cda8702fa43fd27ea75bfbd042268e Author: Ming Jiang Date: 2017-09-04T04:22:33Z added probabilityCol to LogisticRegressionSummary commit 60579d5f36ef26f6e3ec675a795ccc86d6a71c8d Author: Ming Jiang Date: 2017-09-05T05:08:15Z modified LogisticRegressionSummary and LogisticRegressionModel in classification.py commit 1a73e6c6d20d6374379ec2fb237e7f596e77bc62 Author: Ming Jiang Date: 2017-09-07T22:39:08Z test on multiclass summary commit 53ac68e24731e405d44d50be70b1c25f2ab9df3d Author: Ming Jiang Date: 2017-09-08T05:53:47Z fixed numClasses --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137974036 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( + s"`JacksonGenerator` can only be used to write out a row when initialized with `StructType`.") + } /** - * Transforms multiple `InternalRow`s to JSON array using Jackson + * Transforms multiple `InternalRow`s or `MapData`s to JSON array using Jackson * - * @param array The array of rows to convert + * @param array The array of rows or maps to convert */ - def write(array: ArrayData): Unit = writeArray(writeArrayData(array, arrElementWriter)) + def write(array: ArrayData): Unit = { +writeArray(writeArrayData(array, arrElementWriter)) + } --- End diff -- We don't need to change this line. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/19184#discussion_r137973976 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -104,6 +124,10 @@ public void loadNext() throws IOException { if (taskContext != null) { taskContext.killTaskIfInterrupted(); } +if (this.din == null) { + // Good time to init (if all files are opened, we can get Too Many files exception) + initStreams(); +} --- End diff -- Good point. PR has been tried with queries involving window functions (e.g Q67) for which it worked fine. During spill merges (esp getSortedIterator), it is possible to encounter too many open files issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137973919 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -26,20 +26,55 @@ import org.apache.spark.sql.catalyst.expressions.SpecializedGetters import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, MapData} import org.apache.spark.sql.types._ +/** + * `JackGenerator` can only be initialized with a `StructType` or a `MapType`. + * Once it is initialized with `StructType`, it can be used to write out a struct or an array of + * struct. Once it is initialized with `MapType`, it can be used to write out a map. An exception --- End diff -- `Once it is initialized with ``MapType``, it can be used to write out a map or an array of map.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19174: [SPARK-21963][CORE][TEST]Create temp file should be dele...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/19174 @kiszk thank you for review it. I checked the other test cases that use ` File.createTempFile` and `Files.createTempFile `. Except as described in this PR, others test cases that already use `File.deleteOnExit` or` File.delete`. thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org