[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r238077135 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter( private var univocityGenerator: Option[UnivocityGenerator] = None - override def write(row: InternalRow): Unit = { -val gen = univocityGenerator.getOrElse { - val charset = Charset.forName(params.charset) - val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) - val newGen = new UnivocityGenerator(dataSchema, os, params) - univocityGenerator = Some(newGen) - newGen -} + if (params.headerFlag) { +val gen = getGen() +gen.writeHeaders() + } + private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse { +val charset = Charset.forName(params.charset) +val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) +val newGen = new UnivocityGenerator(dataSchema, os, params) +univocityGenerator = Some(newGen) +newGen + } + + override def write(row: InternalRow): Unit = { +val gen = getGen() --- End diff -- i will revert this change to lazy val for now since it doesnt have anything to do wit this pullreq or jira: the Option approach was created in another pullreq. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r238068538 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter( private var univocityGenerator: Option[UnivocityGenerator] = None - override def write(row: InternalRow): Unit = { -val gen = univocityGenerator.getOrElse { - val charset = Charset.forName(params.charset) - val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) - val newGen = new UnivocityGenerator(dataSchema, os, params) - univocityGenerator = Some(newGen) - newGen -} + if (params.headerFlag) { +val gen = getGen() +gen.writeHeaders() + } + private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse { +val charset = Charset.forName(params.charset) +val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) +val newGen = new UnivocityGenerator(dataSchema, os, params) +univocityGenerator = Some(newGen) +newGen + } + + override def write(row: InternalRow): Unit = { +val gen = getGen() --- End diff -- ok i changed it to lazy val and flag --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237716913 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala --- @@ -57,6 +57,9 @@ abstract class OutputWriterFactory extends Serializable { * executor side. This instance is used to persist rows to this single output file. */ abstract class OutputWriter { + /** Initializes before writing any rows. Invoked on executor size. */ + def init(): Unit = {} --- End diff -- do the init logic in the constructor for CsvOutputWriter instead? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237687091 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te assert(errMsg2.contains("'lineSep' can contain only 1 character")) } + test("SPARK-26208: write and read empty data to csv file with header") { +withTempPath { path => + val df1 = Seq.empty[(String, String)].toDF("x", "y") --- End diff -- i can do that, but i think when i write it out and read it back in it will come back in as 1 partition (one part file with header) because of SPARK-23271. is that worth checking for? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237663865 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te assert(errMsg2.contains("'lineSep' can contain only 1 character")) } + test("SPARK-26208: write and read empty data to csv file with header") { +withTempPath { path => + val df1 = Seq.empty[(String, String)].toDF("x", "y") --- End diff -- that doesnt seem to be what is happening. if i do a .repartition(4) on empty dataframe it still only writes one part file with header if i do a .repartition(4) on a dataframe with 2 elements then it writes 2 part files with headers so it seems empty partitions get pruned, except when all partitions are empty then it writes a single partition thanks to SPARK-23271 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/23052 it is pretty common for us to write empty dataframe to parquet and later read it back in same for writing to csv with header and reading it back in (with type inference disabled, we assume all strings) would this break those behaviors? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/23173#discussion_r237579324 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala --- @@ -57,6 +57,9 @@ abstract class OutputWriterFactory extends Serializable { * executor side. This instance is used to persist rows to this single output file. */ abstract class OutputWriter { + /** Initializes before writing any rows. Invoked on executor size. */ + def init(): Unit --- End diff -- yeah makes sense, i will make that change --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/23173 i was not aware of SPARK-15473. thanks. let me look at @HyukjinKwon pullreq and mark my jira as a duplicate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/23173 [SPARK-26208][SQL] add headers to empty csv files when header=true ## What changes were proposed in this pull request? Add headers to empty csv files when header=true, because otherwise these files are invalid when reading. ## How was this patch tested? Added test for roundtrip of empty dataframe to csv file with headers and back in CSVSuite Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata-opensource/spark feat-empty-csv-with-header Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23173.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23173 commit 3192656ad91d326824360f4d4890dc1f6c3f6393 Author: Koert Kuipers Date: 2018-11-28T19:03:20Z write headers to empty csv files when header=true commit aad5f09710d4b6d4aafa810307b3cae9c965babf Author: Koert Kuipers Date: 2018-11-28T20:00:22Z Merge branch 'master' into feat-empty-csv-with-header --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 it would provide a workaround i think, yes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 @HyukjinKwon see https://github.com/apache/spark/pull/22312 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22312: [SPARK-17916][SQL] Fix new behavior when quote is...
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/22312 [SPARK-17916][SQL] Fix new behavior when quote is set and fix old behavior when quote is unset ## What changes were proposed in this pull request? 1) Set nullValue to quoted empty string respecting quote value 2) Fall back to old behavior of unquoted null if quote is not set ## How was this patch tested? Two new tests that will fail without these fixes Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata-opensource/spark feat-csv-null-unquoted Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22312.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22312 commit ad3a11d5c6ead4133195e81f59852db669de5b56 Author: Koert Kuipers Date: 2018-09-01T18:59:35Z fix new behavior when quote is changed and fix old behavior when quote is unset --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 i would suggest at least that when the quote character is changed that the empty value should change accordingly. an empty value of ```""``` makes no sense if the quote character is not ```"```. also if we could agree on a quote character that means no quotes at all then i would suggest to change empty value back to null if that particular quote character is set. because no quoted empty string makes sense if the user is trying to write out unquoted values. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r211309642 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,39 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) } + test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { --- End diff -- it seems enforceSchema always kind of "works" because it simply means it ignores the headers. what do we expect to verify in the test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21345: [SPARK-24159] [SS] Enable no-data micro batches for stre...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21345 we are testing spark 2.4 internally and had some unit tests break because of this change i believe. i am not suggesting this should be changed or undone, just wanted to point out that it might have minor implications for people upgrading. so this is just an FYI. it seems that our unit tests for logic that uses flatMapGroupsWithState with GroupStateTimeout.ProcessingTimeTimeout now will hang if query.processAllAvailable() is called. so i am looking for an alternative way to test now that does not involve usage of processAllAvailable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 @HyukjinKwon see the jira for the example code that reproduces the issue. let me know if you need anything else. best, koert --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21273 to summarize my findings from jira: this breaks any usage without quoting. for example we remove all characters from our values that need to be quoted (delimiters, newlines) so we know we will always write unquoted csv, but now we suddenly find these empty quoted strings in our output. the systems we write to cannot handle these quoted values. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/22123 ``` Test Result (1 failure / +1) org.apache.spark.sql.streaming.FlatMapGroupsWithStateSuite.flatMapGroupsWithState - streaming with processing time timeout - state format version 1 ``` failure seems unrelated to this pullreq --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r210801081 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,44 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) } + test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") { + withTempPath { path => +val dir = path.getAbsolutePath +Seq(("a", "b")).toDF("columnA", "columnB").write + .format("csv") + .option("header", true) + .save(dir) +checkAnswer(spark.read + .format("csv") + .option("header", true) + .option("enforceSchema", false) + .load(dir) + .select("columnA"), + Row("a")) + } +} + } + + test("SPARK-25134: check header on parsing of dataset with projection and no column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") { --- End diff -- ok will remove --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/22123 [SPARK-25134][SQL] Csv column pruning with checking of headers throws incorrect error ## What changes were proposed in this pull request? When column pruning is turned on the checking of headers in the csv should only be for the fields in the requiredSchema, not the dataSchema, because column pruning means only requiredSchema is read. ## How was this patch tested? Added 2 unit tests where column pruning is turned on/off and csv headers are checked againt schema Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata-opensource/spark feat-csv-column-pruning-and-check-header Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22123.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22123 commit dcd9ac45673af31e59dcfb633a2b87f76f2bee03 Author: Koert Kuipers Date: 2018-08-16T15:35:16Z if csv column-pruning is turned on header should be checked with requiredSchema not dataSchema commit c4179a9f0a85b412178323e6cb881385fa644051 Author: Koert Kuipers Date: 2018-08-16T15:52:02Z update jira reference in unit test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21296: [SPARK-24244][SQL] Passing only required columns to the ...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/21296 if i do not select a schema (and i use inferSchema), and i do a select for only a few column, does this push down the column selection into the reading of data (for schema inference and for the actual data read)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18714: [SPARK-20236][SQL] dynamic partition overwrite
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/18714 @cloud-fan i created [SPARK-24860](https://issues.apache.org/jira/browse/SPARK-24860) for this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21818: [SPARK-24860][SQL] Support setting of partitionOv...
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/21818 [SPARK-24860][SQL] Support setting of partitionOverWriteMode in output options for writing DataFrame ## What changes were proposed in this pull request? Besides spark setting spark.sql.sources.partitionOverwriteMode also allow setting partitionOverWriteMode per write ## How was this patch tested? Added unit test in InsertSuite Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata-opensource/spark feat-partition-overwrite-mode-per-write Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21818.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21818 commit 7dd2eabfd4f5ca18354df85f5bf5285e3e23359d Author: Koert Kuipers Date: 2018-07-19T17:42:15Z support setting of partitionOverWriteMode in output options for writing DataFrame --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18714: [SPARK-20236][SQL] dynamic partition overwrite
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/18714 @cloud-fan OK, that works just as well --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18714: [SPARK-20236][SQL] dynamic partition overwrite
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/18714 should this be exposed per write instead of as a global variable? e.g. dataframe.write.csv.partitionMode(Dynamic).partitionBy(...).save(...) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #609: SPARK-1691: Support quoted arguments inside of spark-submi...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/609 ```OPTS+=" --driver-java-options \"-Da=b -Dx=y\""``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #609: SPARK-1691: Support quoted arguments inside of spark-submi...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/609 @ganeshm25 it seems to work in newer spark versions. i havent tried in spark 1.4.2. however its still very tricky to get it right and i would prefer a simpler solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoin optim...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/17660 @cloud-fan switching to lazy vals to avoid these predicates being evaluated when they are not used seems to work. so i think this is a better (more targeted) solution for now, and i removed my try/catch logic. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoin optim...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/17660 I see. let me check if making leftHasNonNullPredicate and rightHasNonNullPredicate lazy solves it then On Apr 17, 2017 23:44, "Wenchen Fan" <notificati...@github.com> wrote: > I think the root problem is, in EliminateOuterJoin.buildNewJoinType, we > always build leftHasNonNullPredicate and rightHasNonNullPredicate. If > it's left join, only rightHasNonNullPredicate is used, and when building > leftHasNonNullPredicate, we may pass null values to a UDF that is not > supposed to run on null values. > > â > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/17660#issuecomment-294667709>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAyIJHOCvy6X0JVDoID_H3cDGYMZKGLUks5rxDG4gaJpZM4M_gRb> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoi...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/17660#discussion_r111842598 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -124,7 +125,15 @@ case class EliminateOuterJoin(conf: SQLConf) extends Rule[LogicalPlan] with Pred val emptyRow = new GenericInternalRow(attributes.length) val boundE = BindReferences.bindReference(e, attributes) if (boundE.find(_.isInstanceOf[Unevaluable]).isDefined) return false -val v = boundE.eval(emptyRow) +val v = try { + boundE.eval(emptyRow) --- End diff -- yeah sure i can do a scan for similar problems --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoi...
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/17660 [SPARK-20359][SQL] catch NPE in EliminateOuterJoin optimization catch NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm NPE is no longer thrown ## What changes were proposed in this pull request? catch possible NPE in this line ```val v = boundE.eval(emptyRow)``` and conclude the optimization can not be performed. ## How was this patch tested? Added test in DataFrameSuite that failed before this fix and now succeeds. Note that a test in catalyst project would be better but i am unsure how to do this. Will look into it now. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-catch-npe-in-eliminate-outer-join Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17660.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17660 commit 7a759cca3fb302d55b5758e3e8cb85deca460112 Author: Koert Kuipers <ko...@tresata.com> Date: 2017-04-17T00:11:00Z catch NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm NPE is no longer thrown --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17639: [SPARK-19716][SQL][follow-up] UnresolvedMapObjects shoul...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/17639 @cloud-fan thanks for doing this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16889: [SPARK-17668][SQL] Use Expressions for conversion...
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/16889 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16889: [SPARK-17668][SQL] Use Expressions for conversions to/fr...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/16889 i am going to close this for now since i dont think this is an optimal solution --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16889: [SPARK-17668][SQL] Use Expressions for conversion...
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/16889 [SPARK-17668][SQL] Use Expressions for conversions to/from user types in UDFs ## What changes were proposed in this pull request? do not merge this is a first attempt at trying to address SPARK-17688. but i do no expect it to be sufficient. things that bother me: * i do not use codegen for the encoder expressions. instead i rely on fromRow and toRow in ExpressionEncoder. that seems inefficient. * some unnecessary wrapping in InternalRows. this is probably related to the usage of fromRow and toRow ## How was this patch tested? added TypedUDFSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-typed-udf Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16889.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16889 commit e73c84ff75e66dee9a395c9913109a965d7d68f4 Author: Koert Kuipers <ko...@tresata.com> Date: 2017-02-09T20:04:36Z got something working but not sure how good this is yet commit bd111ae1f1a721ae2664d0e8a5810f018eb4c935 Author: Koert Kuipers <ko...@tresata.com> Date: 2017-02-10T01:16:34Z pattern match to create expr1 commit 1a257c366aa1c0181dd7791fdd6eb05140c48d7e Author: Koert Kuipers <ko...@tresata.com> Date: 2017-02-10T02:04:53Z fix bug with internal row getting re-used and check results in unit tests commit e1c337ae9e747a3dc02b939a87c9bb5c8605b86c Author: Koert Kuipers <ko...@tresata.com> Date: 2017-02-10T03:26:40Z deal with annoying style rules commit ac09ad519437fe8efb071f354cc4387a4a95c206 Author: Koert Kuipers <ko...@tresata.com> Date: 2017-02-10T19:43:19Z Merge branch 'master' into feat-typed-udf --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9565: [SPARK-11593][SQL] Replace catalyst converter with RowEnc...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/9565 i think this would be very helpful. the difference in behaviour of scala udfs and scala functions used in dataset transformations is a constant source of confusion for my users. for example the lack of support for Option to declare nullable input types, and the need to use untyped Row objects in UDFs for structs are inconsistent with how things are done when Encoders are used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16479: [SPARK-19085][SQL] cleanup OutputWriterFactory and Outpu...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/16479 i will just copy the conversion code over for now thx --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16479: [SPARK-19085][SQL] cleanup OutputWriterFactory and Outpu...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/16479 how "internal" are these interfaces really? every time a change like this is made spark-avro breaks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16143: [SPARK-18711][SQL] should disable subexpression eliminat...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/16143 thanks for getting this fixed so fast --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15979 admittedly the result looks weird. it really should be: +---++ |key|count(1)| +---++ | null| 1| | [1,1]| 1| +---++ is that a separate bug or related? i remember running into this before, because serializing and then deserializing None comes back out as Some((null, null)), which causes NPE in codegen. i ran into this with Aggregator buffers. On Sun, Dec 4, 2016 at 12:13 PM, Koert Kuipers <ko...@tresata.com> wrote: > spark 2.0.x does not have mapValues. but this works: > > scala> Seq(("a", Some((1, 1))), ("a", None)).toDS.groupByKey(_._2). > count.show > +---++ > |key|count(1)| > +---++ > |[null,null]| 1| > | [1,1]| 1| > +---++ > > > > On Sun, Dec 4, 2016 at 9:59 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> Yes it worked before >> >> On Dec 4, 2016 02:33, "Wenchen Fan" <notificati...@github.com> wrote: >> >>> val x: Dataset[String, Option[(String, String)]] = ... >>> x.groupByKey(_._1).mapValues(_._2).agg(someAgg) >>> >>> Does it work before? >>> >>> Please see the discussion in the JIRA: https://issues.apache.org/jira >>> /browse/SPARK-18251 >>> Ideally we have a map between type T and catalyst schema, and Option[T] >>> maps to the same catalyst schema with T, with additional null handling. >>> We shouldn't change this mapping, which means we can't use a single field >>> struct type to represent Option[T]. >>> >>> It's still possible to support Option[T] completely(without breaking >>> backward compatibility), but that may need a lof of hacky code and special >>> handling, I don't think it worth, as we can easy work around it, by >>> Tuple1. >>> >>> â >>> You are receiving this because you commented. >>> Reply to this email directly, view it on GitHub >>> <https://github.com/apache/spark/pull/15979#issuecomment-264689198>, or mute >>> the thread >>> <https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL> >>> . >>> >> > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15979 spark 2.0.x does not have mapValues. but this works: scala> Seq(("a", Some((1, 1))), ("a", None)).toDS.groupByKey(_._2).count.show +---++ |key|count(1)| +---++ |[null,null]| 1| | [1,1]| 1| +---++ On Sun, Dec 4, 2016 at 9:59 AM, Koert Kuipers <ko...@tresata.com> wrote: > Yes it worked before > > On Dec 4, 2016 02:33, "Wenchen Fan" <notificati...@github.com> wrote: > >> val x: Dataset[String, Option[(String, String)]] = ... >> x.groupByKey(_._1).mapValues(_._2).agg(someAgg) >> >> Does it work before? >> >> Please see the discussion in the JIRA: https://issues.apache.org/jira >> /browse/SPARK-18251 >> Ideally we have a map between type T and catalyst schema, and Option[T] >> maps to the same catalyst schema with T, with additional null handling. >> We shouldn't change this mapping, which means we can't use a single field >> struct type to represent Option[T]. >> >> It's still possible to support Option[T] completely(without breaking >> backward compatibility), but that may need a lof of hacky code and special >> handling, I don't think it worth, as we can easy work around it, by >> Tuple1. >> >> â >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <https://github.com/apache/spark/pull/15979#issuecomment-264689198>, or mute >> the thread >> <https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL> >> . >> > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15979 Yes it worked before On Dec 4, 2016 02:33, "Wenchen Fan" <notificati...@github.com> wrote: > val x: Dataset[String, Option[(String, String)]] = ... > x.groupByKey(_._1).mapValues(_._2).agg(someAgg) > > Does it work before? > > Please see the discussion in the JIRA: https://issues.apache.org/ > jira/browse/SPARK-18251 > Ideally we have a map between type T and catalyst schema, and Option[T] > maps to the same catalyst schema with T, with additional null handling. > We shouldn't change this mapping, which means we can't use a single field > struct type to represent Option[T]. > > It's still possible to support Option[T] completely(without breaking > backward compatibility), but that may need a lof of hacky code and special > handling, I don't think it worth, as we can easy work around it, by Tuple1 > . > > â > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/15979#issuecomment-264689198>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15979 this means anything that uses an encoder can no longer use Option[_ <: Product]. encoders are not just used for the top level Dataset creation. Dataset.groupByKey[K] requires an encoder for K. KeyValueGroupedDataset.mapValues[W] requires an encoder for V Aggregator[A, B, C] requires encoders for B and C none of these always create top level row objects (for which this pullreq creates the restriction that they cannot be null). for an aggregator it is sometimes the case. ```dataset.select(aggregator)``` does create a top level row object, but ```dataset.groupByKey(...).agg(aggregator)``` does not. so i am not sure it makes sense to put this restriction on the encoder. it seems to belong on the dataset. another example of something that won't work anymore: ``` val x: Dataset[String, Option[(String, String)]] = ... x.groupByKey(_._1).mapValues(_._2).agg(someAgg) ``` in this case the mapValues requires``` Encoder[Option[(String, String)]]``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15979: [SPARK-18251][SQL] the type of Dataset can't be O...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/15979#discussion_r90770855 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala --- @@ -47,16 +47,26 @@ object ExpressionEncoder { // We convert the not-serializable TypeTag into StructType and ClassTag. val mirror = typeTag[T].mirror val tpe = typeTag[T].tpe + +if (ScalaReflection.optionOfProductType(tpe)) { + throw new UnsupportedOperationException( +"Cannot create encoder for Option of Product type, because Product type is represented " + --- End diff -- this strikes me more as a limitation on Dataset[X] then on Encoder[X] --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15979: [SPARK-18251][SQL] the type of Dataset can't be O...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/15979#discussion_r90770824 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala --- @@ -47,16 +47,26 @@ object ExpressionEncoder { // We convert the not-serializable TypeTag into StructType and ClassTag. val mirror = typeTag[T].mirror val tpe = typeTag[T].tpe + +if (ScalaReflection.optionOfProductType(tpe)) { + throw new UnsupportedOperationException( +"Cannot create encoder for Option of Product type, because Product type is represented " + --- End diff -- this also means an Aggregator cannot use an Option of Product Type for its intermediate type. e.g. Aggregator[Int, Option[(Int, Int)], Int] is now invalid. but i see no good reason why such an Aggregator wouldnt exist? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15918 It can be done with shapeless (which perhaps uses macros under hood, I don't know). On Dec 1, 2016 19:56, "Michael Armbrust" <notificati...@github.com> wrote: I don't think you can limit the implicit. What type would pick up case classes, but not case classes that contain invalid things? I think you would need a macros for this kind of introspection. (I'd be happy to be proven wrong with a PR.) I'd recommend you only import the implicits you need rather than using the wildcard. â You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/apache/spark/pull/15918#issuecomment-264342773>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyIJLwL-MWdzQGb6Ioe2fr_GgP0rP05ks5rD2zNgaJpZM4K1D7U> . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15918 if we do a flag i would also prefer it if the current implicits are more narrow if the flag is not set, if possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15918 @srowen and @rxin what is the default behavior that is changed here? i see a current situation where an implicit encoder is provided that simply cannot handle the task at hand and this leads to failure. either the implicits for ExpressionEncoder need to be more narrow so that they do not claim types they cannot handle (and then other implicit encoders can be used), or they need to be able to handle these types, for example by falling back to kryo as is suggested in this JIRA. currrently ```implicitly[Encoder[Option[Set[Int``` gives you an ExpressionEncoder that cannot handle it. that is undesired and makes it difficult to provide an alternative implicit by the user. i proposed making the ExpressionEncoders more narrow (that seemed the easier fix to me at first) but @marmbrus preferred the approach of falling back to kryo and broadening it. see: http://apache-spark-developers-list.1001551.n3.nabble.com/getting-encoder-implicits-to-be-more-accurate-td19561.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 if they chain like that then i think i know how to do the optimization. but do they? look for example at dataset.groupByKey(...).mapValues(...) Dataset[T].groupByKey[K] uses function T => K and creates KeyValueGroupedDataset[K, T] KeyValueGroupedDataset[K, T].mapValues[W] uses function T => W and creates KeyValueGroupedDataset[K, W] so i have T => K and then T => W On Thu, Oct 20, 2016 at 8:26 PM, Wenchen Fan <notificati...@github.com> wrote: > 2 chained AppendColumns will have 2 functions: T => U and U => W, so we > can combine them this way: > convert UnsafeRow to T > apply func to T to generate U > apply func to U to generate W > convert W to UnsafeRow > append the new UnsafeRow to the original one > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/13526#issuecomment-255263458>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAyIJL19KTgysYd5dRAsDtIseY0jwlm3ks5q2AbLgaJpZM4IvEGP> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 @cloud-fan that makes sense to me, but its definitely not a quick win to create that optimization. let me think about it some more --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 @cloud-fan i can try to optimize ```grouped.mapValues(...).mapValues(...)``` but its a bit of an anti-pattern (there should be no need to do mapValues twice) so i dont think there is much gain in optimizing this. what do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 @rxin i can give it a try (the optimizer rule) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.d...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/15382#discussion_r83921525 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -741,7 +741,7 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging { def variableSubstituteDepth: Int = getConf(VARIABLE_SUBSTITUTE_DEPTH) - def warehousePath: String = new Path(getConf(WAREHOUSE_PATH)).toString + def warehousePath: String = Utils.resolveURI(getConf(WAREHOUSE_PATH)).toString --- End diff -- i agree with @vanzin about dislike for resolveURI. i expect paths without schemes to on my default filesystem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.dir is r...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15382 i don't think there is such a thing as a HDFS working directory, but that probably means it just uses the home dir on hdfs (/user/) for any relative paths --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.dir is r...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15382 i think working dir makes more sense than home dir. but could this catch people by surprise because we now expect write permission in the working dir? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13868: [SPARK-15899] [SQL] Fix the construction of the f...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13868#discussion_r82216818 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -55,7 +56,7 @@ object SQLConf { val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir") .doc("The default location for managed databases and tables.") .stringConf -.createWithDefault("file:${system:user.dir}/spark-warehouse") +.createWithDefault("${system:user.dir}/spark-warehouse") --- End diff -- or use FileSystem.getHomeDirectory? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 @cloud-fan i thought about this a little more, and my suggested changes to the Aggregator api does not allow one to use a different encoder when applying a typed operation on Dataset. so i do not think it is dangerous as such. it does enable usage within the untyped grouping, which is where type conversions are already customary anyhow. its not more dangerous than say using a udaf in a DataFrame. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14576: [SPARK-16391][SQL] Support partial aggregation fo...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r75186632 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.expressions + +import org.apache.spark.sql.Encoder +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder + +/** + * An aggregator that uses a single associative and commutative reduce function. This reduce + * function can be used to go through all input values and reduces them to a single value. + * If there is no input, a null value is returned. + * + * @since 2.1.0 + */ +private[sql] class ReduceAggregator[T: Encoder](func: (T, T) => T) + extends Aggregator[T, (Boolean, T), T] { + + private val encoder = implicitly[Encoder[T]] + + override def zero: (Boolean, T) = (false, null.asInstanceOf[T]) + + override def bufferEncoder: Encoder[(Boolean, T)] = +ExpressionEncoder.tuple( + ExpressionEncoder[Boolean](), + encoder.asInstanceOf[ExpressionEncoder[T]]) + + override def outputEncoder: Encoder[T] = encoder + + override def reduce(b: (Boolean, T), a: T): (Boolean, T) = { +if (b._1) { + (true, func(b._2, a)) +} else { + (true, a) +} + } + + override def merge(b1: (Boolean, T), b2: (Boolean, T)): (Boolean, T) = { +if (!b1._1) { + b2 +} else if (!b2._1) { + b1 +} else { + (true, func(b1._2, b2._2)) +} + } + + override def finish(reduction: (Boolean, T)): T = reduction._2 --- End diff -- this can not happen since ReduceAggregator is private? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14576: [SPARK-16391][SQL] Support partial aggregation fo...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r75152186 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.expressions + +import org.apache.spark.sql.Encoder +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder + +/** + * An aggregator that uses a single associative and commutative reduce function. This reduce + * function can be used to go through all input values and reduces them to a single value. + * If there is no input, a null value is returned. + * + * @since 2.1.0 + */ +private[sql] class ReduceAggregator[T: Encoder](func: (T, T) => T) + extends Aggregator[T, (Boolean, T), T] { + + private val encoder = implicitly[Encoder[T]] + + override def zero: (Boolean, T) = (false, null.asInstanceOf[T]) + + override def bufferEncoder: Encoder[(Boolean, T)] = +ExpressionEncoder.tuple( + ExpressionEncoder[Boolean](), + encoder.asInstanceOf[ExpressionEncoder[T]]) + + override def outputEncoder: Encoder[T] = encoder + + override def reduce(b: (Boolean, T), a: T): (Boolean, T) = { +if (b._1) { + (true, func(b._2, a)) +} else { + (true, a) +} + } + + override def merge(b1: (Boolean, T), b2: (Boolean, T)): (Boolean, T) = { +if (!b1._1) { + b2 +} else if (!b2._1) { + b1 +} else { + (true, func(b1._2, b2._2)) +} + } + + override def finish(reduction: (Boolean, T)): T = reduction._2 --- End diff -- since it should never happen, how about an assertion? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r74361702 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.expressions + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.Encoder +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder + +/** + * :: Experimental :: + * An aggregator that uses a single associative and commutative reduce function. This reduce + * function can be used to go through all input values and reduces them to a single value. + * If there is no input, a null value is returned. + * + * @since 2.1.0 + */ +@Experimental +abstract class ReduceAggregator[T] extends Aggregator[T, (Boolean, T), T] { --- End diff -- i mean, i like it to be public but that changes what is important in the design --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r74316735 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.expressions + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.Encoder +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder + +/** + * :: Experimental :: + * An aggregator that uses a single associative and commutative reduce function. This reduce + * function can be used to go through all input values and reduces them to a single value. + * If there is no input, a null value is returned. + * + * @since 2.1.0 + */ +@Experimental +abstract class ReduceAggregator[T] extends Aggregator[T, (Boolean, T), T] { + + // Question 1: Should func and encoder be parameters rather than abstract methods? + // rxin: abstract method has better java compatibility and forces naming the concrete impl, + // whereas parameter has better type inference (infer encoders via context bounds). + // Question 2: Should finish throw an exception or return null if there is no input? + // rxin: null might be more "SQL" like, whereas exception is more Scala like. --- End diff -- i prefer null over an exception ideally this would return an option. the operation really should be: ```def reduceGroups(f: (V, V) => V): Dataset[(K, Option[V])]``` and the class: ```abstract class ReduceAggregator[T] extends Aggregator[T, (Boolean, T), Option[T]]``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r74314375 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.expressions + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.Encoder +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder + +/** + * :: Experimental :: + * An aggregator that uses a single associative and commutative reduce function. This reduce + * function can be used to go through all input values and reduces them to a single value. + * If there is no input, a null value is returned. + * + * @since 2.1.0 + */ +@Experimental +abstract class ReduceAggregator[T] extends Aggregator[T, (Boolean, T), T] { + + // Question 1: Should func and encoder be parameters rather than abstract methods? + // rxin: abstract method has better java compatibility and forces naming the concrete impl, + // whereas parameter has better type inference (infer encoders via context bounds). --- End diff -- i like the parameters better --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/14576#discussion_r74313912 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.expressions + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.Encoder +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder + +/** + * :: Experimental :: + * An aggregator that uses a single associative and commutative reduce function. This reduce + * function can be used to go through all input values and reduces them to a single value. + * If there is no input, a null value is returned. + * + * @since 2.1.0 + */ +@Experimental +abstract class ReduceAggregator[T] extends Aggregator[T, (Boolean, T), T] { --- End diff -- i like factoring this class out of the reduce operation instead of making it an anonymous class, but do we expect people to use this directly? does it need to be public? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14222: [SPARK-16391][SQL] KeyValueGroupedDataset.reduceGroups s...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/14222 there is a usefulness to this `ReduceAggregator` beyond `.reduceGroups`. basically you can take any Aggregator without a zero and turn it into a valid Aggregator, with the caveat being that the result is nullable and will be null if no inputs are provided to the Aggregator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13526#discussion_r71042267 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -312,6 +312,17 @@ class DatasetSuite extends QueryTest with SharedSQLContext { "a", "30", "b", "3", "c", "1") } + test("groupBy function, mapValues, flatMap") { +val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() --- End diff -- it seems the other tests all use ```toDS()``` so i will stick to that convention --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13526#discussion_r71041725 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala --- @@ -65,6 +65,46 @@ class KeyValueGroupedDataset[K, V] private[sql]( groupingAttributes) /** + * Returns a new [[KeyValueGroupedDataset]] where the given function has been applied to the + * data. The grouping key is unchanged by this. + * + * {{{ + * // Create values grouped by key from a Dataset[(K, V)] + * ds.groupByKey(_._1).mapValues(_._2) // Scala + * }}} + * @since 2.0.0 + */ + def mapValues[W: Encoder](func: V => W): KeyValueGroupedDataset[K, W] = { --- End diff -- i think the convention is spaces before, i will fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13532#discussion_r69397207 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala --- @@ -305,4 +305,13 @@ class DatasetAggregatorSuite extends QueryTest with SharedSQLContext { val ds = Seq(1, 2, 3).toDS() checkDataset(ds.select(MapTypeBufferAgg.toColumn), 1) } + + test("spark-15204 improve nullability inference for Aggregator") { +val ds1 = Seq(1, 3, 2, 5).toDS() +assert(ds1.select(typed.sum((i: Int) => i)).schema.head.nullable === false) +val ds2 = Seq(AggData(1, "a"), AggData(2, "a")).toDS() +assert(ds2.groupByKey(_.b).agg(SeqAgg.toColumn).schema(1).nullable === true) --- End diff -- the last assert with NameAgg tests String as output of the Aggregator. is that good enough? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13933: [SPARK-16236] [SQL] Add Path Option back to Load API in ...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13933 For parquet, json etc. path not being put in options is not an issue since they don't retrieve it from the options On Jun 29, 2016 2:31 AM, "Xiao Li" <notificati...@github.com> wrote: > @zsxwing <https://github.com/zsxwing> If we just provide one path in the > function input, it will not put path into the options. The API parquet(path: > String)still calls load(paths : _*), instead of load(path). Thus, we will > introduce inconsistent behavior, compared with load(path: String). > > Could you review the new PR I just submitted? Let me know if anything is > not appropriate. #13965 <https://github.com/apache/spark/pull/13965>. > Thanks! > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/13933#issuecomment-229268367>, or mute > the thread > <https://github.com/notifications/unsubscribe/AAyIJPT3agmWQyqBUqrzFW5F2eE095Wtks5qQhFPgaJpZM4I_nrj> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13727#discussion_r68672691 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * @since 1.4.0 */ def load(path: String): DataFrame = { -option("path", path).load() +load(Seq(path): _*) // force invocation of `load(...varargs...)` --- End diff -- it will also break users code in an upgrade --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13727#discussion_r68645998 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * @since 1.4.0 */ def load(path: String): DataFrame = { -option("path", path).load() +load(Seq(path): _*) // force invocation of `load(...varargs...)` --- End diff -- i believe that works as expected (i am running into some other issues now, but they seem unrelated). however from a DSL perspective this is not very pretty? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13727#discussion_r68624316 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * @since 1.4.0 */ def load(path: String): DataFrame = { -option("path", path).load() +load(Seq(path): _*) // force invocation of `load(...varargs...)` --- End diff -- with this change path is no longer available in the options. this makes it hard (impossible?) for non-file based DataSources (not implementing FileFormat) to use load(...) For example for elasticsearch we use: ``` sqlContext.read.format("org.elasticsearch.spark.sql").load(resource) ``` i do not think this can be implemented anymore now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #8416: [SPARK-10185] [SQL] Feat sql comma separated paths
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/8416 this patch should not have broken reading files that include comma. i also added unit test for this: https://github.com/apache/spark/pull/8416/files#diff-5d2ebf4e9ca5a990136b276859769289R896 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 could we "rewind"/undo the append for the key and change it to a map that inserts new values and key? so remove one append and replace it with another operation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 the tricky part with that is that (ds: Dataset[(K, V)]).groupBy(_._1).mapValues(_._2) should return a KeyValueGroupedDataset[K, V] On Tue, Jun 7, 2016 at 8:22 PM, Wenchen Fan <notificati...@github.com> wrote: > A possible approach maybe just keep the function given by mapValues, and > apply it before calling the function given by mapGroups. By doing this, > we at least won't make the performance worse, as the underlying plan > doesn't change. > > â > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/13526#issuecomment-224453064>, or mute > the thread > <https://github.com/notifications/unsubscribe/AAyIJCOywIHS-XfwsPytJXcYZf7AHkqFks5qJgtWgaJpZM4IvEGP> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 ``` scala> val x = Seq(("a", 1), ("b", 2)).toDS x: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] scala> x.groupByKey(_._1).mapValues(_._2).reduceGroups(_ + _).explain == Physical Plan == *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2, true]._1, true) AS value#36, input[0, scala.Tuple2, true]._2 AS value#37] +- MapGroups , value#32.toString, value#34: int, [value#32], [value#34], obj#35: scala.Tuple2 +- *Sort [value#32 ASC], false, 0 +- Exchange hashpartitioning(value#32, 200) +- *Project [value#34, value#32] +- AppendColumns , newInstance(class scala.Tuple2), [input[0, int, true] AS value#34] +- AppendColumns , newInstance(class scala.Tuple2), [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#32] +- LocalTableScan [_1#28, _2#29] ``` it seems to AppendColumns are not collapsed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 ok i will study the physical plans for both and try to understand why one would be slower --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 can you explain a bit what is inefficient and would need an optimizer rule? is it mapValues being called twice? once for the key and then for the new values? thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 see this conversation: https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccaaswr-7kqfmxd_cpr-_wdygafh+rarecm9olm5jkxfk14fc...@mail.gmail.com%3E mapGroups is not a very interesting API, since without support for secondary sort and hence no need for fold operations pushing all the value into the reducer never really makes sense. so the interesting APIs are reduce (when its fixed to be efficient and not use mapGroups) and agg. how do you transform the values before they go into reduce? you can not do this currently, which is why we need something like mapValues. with Aggregators you can indeed do something similar inside the Aggregator (since the input type is not equal to the buffer type), but this leads to all Aggregators currently taking in some kind of input transform function, which hints at a suboptimal API and a pattern that should be generalized and extracted. i am curious to know why appending a column is inefficient? i am open to different designs about this being a rare case: i would argue the opposite. i expect to see a lot of key-value datasets (```Dataset[(K, V)]```) in our codebase, and on those a lot of operations like ```ds.groupByKey(_._1).mapValues.(_._2).reduce(...)```. since this is the most natural translation of many RDD algos. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13532#discussion_r65986613 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala --- @@ -51,7 +52,8 @@ object TypedAggregateExpression { bufferDeserializer, outputEncoder.serializer, outputEncoder.deserializer.dataType, - outputType) + outputType, + outputNullable) --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13526#discussion_r65972115 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala --- @@ -65,6 +65,44 @@ class KeyValueGroupedDataset[K, V] private[sql]( groupingAttributes) /** + * Returns a new [[KeyValueGroupedDataset]] where the given function has been applied to the + * data. The grouping key is unchanged by this. + * + * {{{ + * // Create values grouped by key from a Dataset[(K, V)] + * ds.groupBy(_._1).mapValues(_._2) + * }}} + * @since 2.0.0 + */ + def mapValues[W: Encoder](func: V => W): KeyValueGroupedDataset[K, W] = { +val withNewData = AppendColumns(func, dataAttributes, logicalPlan) +val projected = Project(withNewData.newColumns ++ groupingAttributes, withNewData) +val executed = sparkSession.sessionState.executePlan(projected) + +new KeyValueGroupedDataset( + encoderFor[K], + encoderFor[W], + executed, + withNewData.newColumns, + groupingAttributes) + } + + /** + * Returns a new [[KeyValueGroupedDataset]] where the given function has been applied to the + * data. The grouping key is unchanged by this. + * + * {{{ + * // Create values grouped by key from a Dataset[(K, V)] + * ds.groupBy(_._1).mapValues(_._2) --- End diff -- oh.. it should be groupByKey, not groupBy. woops i will also comment that its scala i guess you want a java 8 lamba example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/13532 [SPARK-15204][SQL] improve nullability inference for Aggregator ## What changes were proposed in this pull request? TypedAggregateExpression sets nullable based on the schema of the outputEncoder ## How was this patch tested? Add test in DatasetAggregatorSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-aggregator-nullable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13532.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13532 commit 32cfcb7dcf0e42331a9ef29a2f0c611538cc4063 Author: Koert Kuipers <ko...@tresata.com> Date: 2016-06-06T20:34:39Z improve nullability inference for Aggregator --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 for example with this branch you can do: ``` val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i", "j", "k") df3.groupBy("i").agg( ComplexResultAgg.apply("i", "k"), SumAgg.apply("j"), AverageAgg.apply("j") ) so these are multiple Aggregators applied in an untyped groupBy, each on selected columns. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 well that was sort of what i was trying to achieve. the unit tests i added were for using Aggregator for untyped grouping(```groupBy```). and i think for it to be useful within that context one should also be able to select the columns that the Aggregator operates on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13512: [SPARK-15769][SQL] Add Encoder for input type to ...
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/13512 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 If Aggregator is designed for typed Dataset only then that is a bit of a shame, because its a elegant and generic api that should be useful for DataFrame too. this causes fragmentation (Aggregator versus UserDefinedAggregationFunction). I am not sure what the better way to do this is, but i would like a single high level Aggregator-like api that i can use in Dataset and DataFrame. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/13526 [SPARK-15780][SQL] Support mapValues on KeyValueGroupedDataset ## What changes were proposed in this pull request? Add mapValues to KeyValueGroupedDataset ## How was this patch tested? New test in DatasetSuite for groupBy function, mapValues, flatMap You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-keyvaluegroupeddataset-mapvalues Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13526.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13526 commit 3494ec5b913ef6c1314a4b96279096a18a7fe5a1 Author: Koert Kuipers <ko...@tresata.com> Date: 2016-06-06T16:09:35Z add mapValues to KeyValueGroupedDataset --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 @cloud-fan i am running into some trouble updating my branch to the latest master. i get errors in tests due to Analyzer.validateTopLevelTupleFields the issue seems to be that in KeyValueGroupedDataset[K, T] the Aggregators are supposed to operate on T, but the logicalPlan at this point already has K appended to T because AppendColumns(func, inputPlan) is applied to the plan before its passed into KeyValueGroupedDataset. so validateTopLevelTupleFields also sees the column for the key in the inputs and believes the deserializer for T is missing a field. any suggestions on how to get around this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 @cloud-fan from the (added) unit tests: ``` val df2 = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDF("i", "j") checkAnswer(df2.groupBy("i").agg(ComplexResultAgg.toColumn), Row("a", Row(2, 4)) :: Row("b", Row(1, 3)) :: Nil) ``` this shows how the underlying type is Row (with a schema consisting of Strings and Ints), and it gets converted to the input type of the Aggregator which is (String, Long), so this involves both conversion and upcast. and: ``` val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i", "j", "k") checkAnswer(df3.groupBy("i").agg(ComplexResultAgg("i", "k")), Row("a", Row(2, 4)) :: Row("b", Row(1, 3)) :: Nil) ``` this is similar to the previous example but i also select the columns i want the Aggregator to operate on (namely columns "i" and "k") --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 **[Test build #5 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)** for PR 13512 at commit [`077f782`](https://github.com/apache/spark/commit/077f782cbf1e64439b8d5bb738819faebbf5914b). * This patch **fails MiMa tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/5/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 **[Test build #5 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)** for PR 13512 at commit [`077f782`](https://github.com/apache/spark/commit/077f782cbf1e64439b8d5bb738819faebbf5914b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13512: [SPARK-15769][SQL] Add Encoder for input type to ...
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/13512 [SPARK-15769][SQL] Add Encoder for input type to Aggregator ## What changes were proposed in this pull request? Aggregator also has an Encoder for the input type ## How was this patch tested? Add tests to DatasetAggregatorSuite to upcast aggregator input types and apply Aggregator only to selected columns You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-aggregator-input-encoder Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13512.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13512 commit 6da0aef2885170ff4ecc6394463de6aedadbf867 Author: Koert Kuipers <ko...@tresata.com> Date: 2016-06-03T04:13:14Z add input encoder to aggregator commit 31f7b68ef9db3261a378272f714086afbd0e3208 Author: Koert Kuipers <ko...@tresata.com> Date: 2016-06-03T04:32:38Z make inputDeserializer mandatory in TypedAggregateExpression commit 76b8878a84b11b37e50b30c87822ac34546393f2 Author: Koert Kuipers <ko...@tresata.com> Date: 2016-06-03T04:53:27Z show aggregator can now be used in DataFrame without needing to handle row objects directly commit 077f782cbf1e64439b8d5bb738819faebbf5914b Author: Koert Kuipers <ko...@tresata.com> Date: 2016-06-03T05:31:44Z allow aggregator to operate on subset of columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14139 Dataset loses nullability in opera...
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/11980 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15204][SQL] Nullable is not correct for...
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/13012#issuecomment-218053856 blackbox transformations infer nullable=false when you return a primitive. for example: ``` scala> sc.parallelize(List(1,2,3)).toDS.map(i => i * 2).schema res0: org.apache.spark.sql.types.StructType = StructType(StructField(value,IntegerType,false)) ``` why would aggregator be any different? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/12877#issuecomment-216678197 yup needs to be transient, will fix On Tue, May 3, 2016 at 5:58 PM, andrewor14 <notificati...@github.com> wrote: > I think it's OK for it to be lazy; just wanted to understand why. But it > should be transient though since sparkSession is also transient. > > â > You are receiving this because you authored the thread. > Reply to this email directly or view it on GitHub > <https://github.com/apache/spark/pull/12877#issuecomment-216677161> > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/12877#issuecomment-216675245 if a SparkSession sits inside a Dataset does that mean _wrapped is always already initialized (because you cannot have a Dataset without a SparkContext)? if so, i should probably make it a val instead of lazy val On Tue, May 3, 2016 at 5:31 PM, Koert Kuipers <ko...@tresata.com> wrote: > i made it lazy val since SparkSession.wrapped is effectively lazy too: > protected[sql] def wrapped: SQLContext = { > if (_wrapped == null) { > _wrapped = new SQLContext(self, isRootContext = false) > } > _wrapped > } > > > On Tue, May 3, 2016 at 5:29 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> oh since since sparkSession is just a normal val i guess it can also be >> >> On Tue, May 3, 2016 at 5:25 PM, andrewor14 <notificati...@github.com> >> wrote: >> >>> Looks good otherwise. >>> >>> â >>> You are receiving this because you authored the thread. >>> Reply to this email directly or view it on GitHub >>> <https://github.com/apache/spark/pull/12877#issuecomment-216668953> >>> >> >> > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/12877#issuecomment-216670925 i made it lazy val since SparkSession.wrapped is effectively lazy too: protected[sql] def wrapped: SQLContext = { if (_wrapped == null) { _wrapped = new SQLContext(self, isRootContext = false) } _wrapped } On Tue, May 3, 2016 at 5:29 PM, Koert Kuipers <ko...@tresata.com> wrote: > oh since since sparkSession is just a normal val i guess it can also be > > On Tue, May 3, 2016 at 5:25 PM, andrewor14 <notificati...@github.com> > wrote: > >> Looks good otherwise. >> >> â >> You are receiving this because you authored the thread. >> Reply to this email directly or view it on GitHub >> <https://github.com/apache/spark/pull/12877#issuecomment-216668953> >> > > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/12877#issuecomment-216670423 oh since since sparkSession is just a normal val i guess it can also be On Tue, May 3, 2016 at 5:25 PM, andrewor14 <notificati...@github.com> wrote: > Looks good otherwise. > > â > You are receiving this because you authored the thread. > Reply to this email directly or view it on GitHub > <https://github.com/apache/spark/pull/12877#issuecomment-216668953> > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/12877 [SPARK-15097][SQL] make Dataset.sqlContext a stable identifier for imports ## What changes were proposed in this pull request? Make Dataset.sqlContext a lazy val so that its a stable identifier and can be used for imports. Now this works again: import someDataset.sqlContext.implicits._ ## How was this patch tested? Add unit test to DatasetSuite that uses the import show above. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-sqlcontext-stable-import Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12877.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12877 commit 3804b53d849ede69aea74b4dfe309bf76d0b2cda Author: Koert Kuipers <ko...@tresata.com> Date: 2016-05-03T20:36:09Z make Dataset.sqlContext a lazy val so that its a stable identifier and can be used for imports (e.g. import someDataset.sqlContext.implicits._) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org