from:"koertkuipers"

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237716913
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala
 ---
@@ -57,6 +57,9 @@ abstract class OutputWriterFactory extends Serializable {
  * executor side.  This instance is used to persist rows to this single 
output file.
  */
 abstract class OutputWriter {
+  /** Initializes before writing any rows. Invoked on executor size. */
+  def init(): Unit = {}
--- End diff --

do the init logic in the constructor for CsvOutputWriter instead?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237687091
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 assert(errMsg2.contains("'lineSep' can contain only 1 character"))
   }
 
+  test("SPARK-26208: write and read empty data to csv file with header") {
+withTempPath { path =>
+  val df1 = Seq.empty[(String, String)].toDF("x", "y")
--- End diff --

i can do that, but i think when i write it out and read it back in it will 
come back in as 1 partition (one part file with header) because of SPARK-23271. 
is that worth checking for?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237663865
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1987,6 +1987,18 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
 assert(errMsg2.contains("'lineSep' can contain only 1 character"))
   }
 
+  test("SPARK-26208: write and read empty data to csv file with header") {
+withTempPath { path =>
+  val df1 = Seq.empty[(String, String)].toDF("x", "y")
--- End diff --

that doesnt seem to be what is happening.

if i do a .repartition(4) on empty dataframe it still only writes one part 
file with header

if i do a .repartition(4) on a dataframe with 2 elements then it writes 2 
part files with headers

so it seems empty partitions get pruned, except when all partitions are 
empty then it writes a single partition thanks to SPARK-23271


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/23052
  
it is pretty common for us to write empty dataframe to parquet and later 
read it back in
same for writing to csv with header and reading it back in (with type 
inference disabled, we assume all strings)

would this break those behaviors? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237579324
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala
 ---
@@ -57,6 +57,9 @@ abstract class OutputWriterFactory extends Serializable {
  * executor side.  This instance is used to persist rows to this single 
output file.
  */
 abstract class OutputWriter {
+  /** Initializes before writing any rows. Invoked on executor size. */
+  def init(): Unit
--- End diff --

yeah makes sense, i will make that change


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...

2018-11-28 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/23173
  
i was not aware of SPARK-15473. thanks. let me look at @HyukjinKwon pullreq 
and mark my jira as a duplicate. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-28 Thread koertkuipers

GitHub user koertkuipers opened a pull request:

https://github.com/apache/spark/pull/23173

[SPARK-26208][SQL] add headers to empty csv files when header=true

## What changes were proposed in this pull request?

Add headers to empty csv files when header=true, because otherwise these 
files are invalid when reading.

## How was this patch tested?

Added test for roundtrip of empty dataframe to csv file with headers and 
back in CSVSuite

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata-opensource/spark 
feat-empty-csv-with-header

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23173


commit 3192656ad91d326824360f4d4890dc1f6c3f6393
Author: Koert Kuipers 
Date:   2018-11-28T19:03:20Z

write headers to empty csv files when header=true

commit aad5f09710d4b6d4aafa810307b3cae9c965babf
Author: Koert Kuipers 
Date:   2018-11-28T20:00:22Z

Merge branch 'master' into feat-empty-csv-with-header




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-09-02 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/21273
  
it would provide a workaround i think, yes. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-09-01 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/21273
  
@HyukjinKwon see https://github.com/apache/spark/pull/22312


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22312: [SPARK-17916][SQL] Fix new behavior when quote is...

2018-09-01 Thread koertkuipers

GitHub user koertkuipers opened a pull request:

https://github.com/apache/spark/pull/22312

[SPARK-17916][SQL] Fix new behavior when quote is set and fix old behavior 
when quote is unset

## What changes were proposed in this pull request?

1) Set nullValue to quoted empty string respecting quote value
2) Fall back to old behavior of unquoted null if quote is not set

## How was this patch tested?

Two new tests that will fail without these fixes

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata-opensource/spark 
feat-csv-null-unquoted

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22312.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22312


commit ad3a11d5c6ead4133195e81f59852db669de5b56
Author: Koert Kuipers 
Date:   2018-09-01T18:59:35Z

fix new behavior when quote is changed and fix old behavior when quote is 
unset




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-08-23 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/21273
  
i would suggest at least that when the quote character is changed that the 
empty value should change accordingly. an empty value of ```""``` makes no 
sense if the quote character is not ```"```.

also if we could agree on a quote character that means no quotes at all 
then i would suggest to change empty value back to null if that particular 
quote character is set. because no quoted empty string makes sense if the user 
is trying to write out unquoted values.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-20 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r211309642
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1603,6 +1603,39 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
   .exists(msg => msg.getRenderedMessage.contains("CSV header does not 
conform to the schema")))
   }
 
+  test("SPARK-25134: check header on parsing of dataset with projection 
and column pruning") {
--- End diff --

it seems enforceSchema always kind of "works" because it simply means it 
ignores the headers.
what do we expect to verify in the test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21345: [SPARK-24159] [SS] Enable no-data micro batches for stre...

2018-08-20 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/21345
  
we are testing spark 2.4 internally and had some unit tests break because 
of this change i believe.

i am not suggesting this should be changed or undone, just wanted to point 
out that it might have minor implications for people upgrading. so this is just 
an FYI.

it seems that our unit tests for logic that uses flatMapGroupsWithState 
with GroupStateTimeout.ProcessingTimeTimeout now will hang if 
query.processAllAvailable() is called. so i am looking for an alternative way 
to test now that does not involve usage of processAllAvailable.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-08-19 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/21273
  
@HyukjinKwon see the jira for the example code that reproduces the issue.
let me know if you need anything else. best, koert


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21273: [SPARK-17916][SQL] Fix empty string being parsed as null...

2018-08-19 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/21273
  
to summarize my findings from jira:
this breaks any usage without quoting. for example we remove all characters 
from our values that need to be quoted (delimiters, newlines) so we know we 
will always write unquoted csv, but now we suddenly find these empty quoted 
strings in our output. the systems we write to cannot handle these quoted 
values.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...

2018-08-17 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/22123
  
```
Test Result (1 failure / +1)

org.apache.spark.sql.streaming.FlatMapGroupsWithStateSuite.flatMapGroupsWithState
 - streaming with processing time timeout - state format version 1
```

failure seems unrelated to this pullreq


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-16 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r210801081
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1603,6 +1603,44 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
   .exists(msg => msg.getRenderedMessage.contains("CSV header does not 
conform to the schema")))
   }
 
+  test("SPARK-25134: check header on parsing of dataset with projection 
and column pruning") {
+withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") {
+  withTempPath { path =>
+val dir = path.getAbsolutePath
+Seq(("a", "b")).toDF("columnA", "columnB").write
+  .format("csv")
+  .option("header", true)
+  .save(dir)
+checkAnswer(spark.read
+  .format("csv")
+  .option("header", true)
+  .option("enforceSchema", false)
+  .load(dir)
+  .select("columnA"),
+  Row("a"))
+  }
+}
+  }
+
+  test("SPARK-25134: check header on parsing of dataset with projection 
and no column pruning") {
+withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "false") {
--- End diff --

ok will remove


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-16 Thread koertkuipers

GitHub user koertkuipers opened a pull request:

https://github.com/apache/spark/pull/22123

[SPARK-25134][SQL] Csv column pruning with checking of headers throws 
incorrect error

## What changes were proposed in this pull request?

When column pruning is turned on the checking of headers in the csv should 
only be for the fields in the requiredSchema, not the dataSchema, because 
column pruning means only requiredSchema is read.

## How was this patch tested?

Added 2 unit tests where column pruning is turned on/off and csv headers 
are checked againt schema 

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata-opensource/spark 
feat-csv-column-pruning-and-check-header

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22123


commit dcd9ac45673af31e59dcfb633a2b87f76f2bee03
Author: Koert Kuipers 
Date:   2018-08-16T15:35:16Z

if csv column-pruning is turned on header should be checked with 
requiredSchema not dataSchema

commit c4179a9f0a85b412178323e6cb881385fa644051
Author: Koert Kuipers 
Date:   2018-08-16T15:52:02Z

update jira reference in unit test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21296: [SPARK-24244][SQL] Passing only required columns to the ...

2018-08-15 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/21296
  
if i do not select a schema (and i use inferSchema), and i do a select for 
only a few column, does this push down the column selection into the reading of 
data (for schema inference and for the actual data read)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18714: [SPARK-20236][SQL] dynamic partition overwrite

2018-07-19 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/18714
  
@cloud-fan i created 
[SPARK-24860](https://issues.apache.org/jira/browse/SPARK-24860) for this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21818: [SPARK-24860][SQL] Support setting of partitionOv...

2018-07-19 Thread koertkuipers

GitHub user koertkuipers opened a pull request:

https://github.com/apache/spark/pull/21818

[SPARK-24860][SQL] Support setting of partitionOverWriteMode in output 
options for writing DataFrame

## What changes were proposed in this pull request?

Besides spark setting spark.sql.sources.partitionOverwriteMode also allow 
setting partitionOverWriteMode per write 

## How was this patch tested?

Added unit test in InsertSuite

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata-opensource/spark 
feat-partition-overwrite-mode-per-write

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21818.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21818


commit 7dd2eabfd4f5ca18354df85f5bf5285e3e23359d
Author: Koert Kuipers 
Date:   2018-07-19T17:42:15Z

support setting of partitionOverWriteMode in output options for writing 
DataFrame




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18714: [SPARK-20236][SQL] dynamic partition overwrite

2018-07-16 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/18714
  
@cloud-fan OK, that works just as well 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18714: [SPARK-20236][SQL] dynamic partition overwrite

2018-07-15 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/18714
  
should this be exposed per write instead of as a global variable?
e.g.
dataframe.write.csv.partitionMode(Dynamic).partitionBy(...).save(...)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #609: SPARK-1691: Support quoted arguments inside of spark-submi...

2017-07-07 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/609
  
```OPTS+=" --driver-java-options \"-Da=b -Dx=y\""```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #609: SPARK-1691: Support quoted arguments inside of spark-submi...

2017-07-06 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/609
  
@ganeshm25 it seems to work in newer spark versions. i havent tried in 
spark 1.4.2. however its still very tricky to get it right and i would prefer a 
simpler solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoin optim...

2017-04-18 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/17660
  
@cloud-fan switching to lazy vals to avoid these predicates being evaluated 
when they are not used seems to work.
so i think this is a better (more targeted) solution for now, and i removed 
my try/catch logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoin optim...

2017-04-18 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/17660
  
I see. let me check if making leftHasNonNullPredicate and
rightHasNonNullPredicate lazy solves it then

On Apr 17, 2017 23:44, "Wenchen Fan" <notificati...@github.com> wrote:

> I think the root problem is, in EliminateOuterJoin.buildNewJoinType, we
> always build leftHasNonNullPredicate and rightHasNonNullPredicate. If
> it's left join, only rightHasNonNullPredicate is used, and when building
> leftHasNonNullPredicate, we may pass null values to a UDF that is not
> supposed to run on null values.
>
> â
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/17660#issuecomment-294667709>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AAyIJHOCvy6X0JVDoID_H3cDGYMZKGLUks5rxDG4gaJpZM4M_gRb>
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoi...

2017-04-17 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/17660#discussion_r111842598
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -124,7 +125,15 @@ case class EliminateOuterJoin(conf: SQLConf) extends 
Rule[LogicalPlan] with Pred
 val emptyRow = new GenericInternalRow(attributes.length)
 val boundE = BindReferences.bindReference(e, attributes)
 if (boundE.find(_.isInstanceOf[Unevaluable]).isDefined) return false
-val v = boundE.eval(emptyRow)
+val v = try {
+  boundE.eval(emptyRow)
--- End diff --

yeah sure i can do a scan for similar problems


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17660: [SPARK-20359][SQL] catch NPE in EliminateOuterJoi...

2017-04-17 Thread koertkuipers

GitHub user koertkuipers opened a pull request:

https://github.com/apache/spark/pull/17660

[SPARK-20359][SQL] catch NPE in EliminateOuterJoin optimization

catch NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm 
NPE is no longer thrown

## What changes were proposed in this pull request?
catch possible NPE in this line
```val v = boundE.eval(emptyRow)```
and conclude the optimization can not be performed.

## How was this patch tested?

Added test in DataFrameSuite that failed before this fix and now succeeds. 
Note that a test in catalyst project would be better but i am unsure how to do 
this. Will look into it now.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata/spark 
feat-catch-npe-in-eliminate-outer-join

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17660.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17660


commit 7a759cca3fb302d55b5758e3e8cb85deca460112
Author: Koert Kuipers <ko...@tresata.com>
Date:   2017-04-17T00:11:00Z

catch NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm 
NPE is no longer thrown




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17639: [SPARK-19716][SQL][follow-up] UnresolvedMapObjects shoul...

2017-04-14 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/17639
  
@cloud-fan thanks for doing this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16889: [SPARK-17668][SQL] Use Expressions for conversion...

2017-03-29 Thread koertkuipers

Github user koertkuipers closed the pull request at:

https://github.com/apache/spark/pull/16889


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16889: [SPARK-17668][SQL] Use Expressions for conversions to/fr...

2017-03-29 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/16889
  
i am going to close this for now since i dont think this is an optimal 
solution


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16889: [SPARK-17668][SQL] Use Expressions for conversion...

2017-02-10 Thread koertkuipers

GitHub user koertkuipers opened a pull request:

https://github.com/apache/spark/pull/16889

[SPARK-17668][SQL] Use Expressions for conversions to/from user types in 
UDFs

## What changes were proposed in this pull request?
do not merge

this is a first attempt at trying to address SPARK-17688. but i do no 
expect it to be sufficient.

things that bother me:
* i do not use codegen for the encoder expressions. instead i rely on 
fromRow and toRow in ExpressionEncoder. that seems inefficient.
* some unnecessary wrapping in InternalRows. this is probably related to 
the usage of fromRow and toRow

## How was this patch tested?

added TypedUDFSuite


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata/spark feat-typed-udf

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16889.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16889


commit e73c84ff75e66dee9a395c9913109a965d7d68f4
Author: Koert Kuipers <ko...@tresata.com>
Date:   2017-02-09T20:04:36Z

got something working but not sure how good this is yet

commit bd111ae1f1a721ae2664d0e8a5810f018eb4c935
Author: Koert Kuipers <ko...@tresata.com>
Date:   2017-02-10T01:16:34Z

pattern match to create expr1

commit 1a257c366aa1c0181dd7791fdd6eb05140c48d7e
Author: Koert Kuipers <ko...@tresata.com>
Date:   2017-02-10T02:04:53Z

fix bug with internal row getting re-used and check results in unit tests

commit e1c337ae9e747a3dc02b939a87c9bb5c8605b86c
Author: Koert Kuipers <ko...@tresata.com>
Date:   2017-02-10T03:26:40Z

deal with annoying style rules

commit ac09ad519437fe8efb071f354cc4387a4a95c206
Author: Koert Kuipers <ko...@tresata.com>
Date:   2017-02-10T19:43:19Z

Merge branch 'master' into feat-typed-udf




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #9565: [SPARK-11593][SQL] Replace catalyst converter with RowEnc...

2017-02-04 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/9565
  
i think this would be very helpful. the difference in behaviour of scala 
udfs and scala functions used in dataset transformations is a constant source 
of confusion for my users. 

for example the lack of support for Option to declare nullable input types, 
and the need to use untyped Row objects in UDFs for structs are inconsistent 
with how things are done when Encoders are used.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16479: [SPARK-19085][SQL] cleanup OutputWriterFactory and Outpu...

2017-01-22 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/16479
  
i will just copy the conversion code over for now thx


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16479: [SPARK-19085][SQL] cleanup OutputWriterFactory and Outpu...

2017-01-22 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/16479
  
how "internal" are these interfaces really? every time a change like this 
is made spark-avro breaks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16143: [SPARK-18711][SQL] should disable subexpression eliminat...

2016-12-05 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/16143
  
thanks for getting this fixed so fast


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-12-04 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/15979
  
admittedly the result looks weird. it really should be:

+---++
|key|count(1)|
+---++
|   null|   1|
|  [1,1]|   1|
+---++

is that a separate bug or related? i remember running into this before,
because serializing and then deserializing None comes back out as
Some((null, null)), which causes NPE in codegen. i ran into this with
Aggregator buffers.


On Sun, Dec 4, 2016 at 12:13 PM, Koert Kuipers <ko...@tresata.com> wrote:

> spark 2.0.x does not have mapValues. but this works:
>
> scala> Seq(("a", Some((1, 1))), ("a", None)).toDS.groupByKey(_._2).
> count.show
> +---++
> |key|count(1)|
> +---++
> |[null,null]|   1|
> |  [1,1]|   1|
> +---++
>
>
>
> On Sun, Dec 4, 2016 at 9:59 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> Yes it worked before
>>
>> On Dec 4, 2016 02:33, "Wenchen Fan" <notificati...@github.com> wrote:
>>
>>> val x: Dataset[String, Option[(String, String)]] = ...
>>> x.groupByKey(_._1).mapValues(_._2).agg(someAgg)
>>>
>>> Does it work before?
>>>
>>> Please see the discussion in the JIRA: https://issues.apache.org/jira
>>> /browse/SPARK-18251
>>> Ideally we have a map between type T and catalyst schema, and Option[T]
>>> maps to the same catalyst schema with T, with additional null handling.
>>> We shouldn't change this mapping, which means we can't use a single 
field
>>> struct type to represent Option[T].
>>>
>>> It's still possible to support Option[T] completely(without breaking
>>> backward compatibility), but that may need a lof of hacky code and 
special
>>> handling, I don't think it worth, as we can easy work around it, by
>>> Tuple1.
>>>
>>> â
>>> You are receiving this because you commented.
>>> Reply to this email directly, view it on GitHub
>>> <https://github.com/apache/spark/pull/15979#issuecomment-264689198>, or 
mute
>>> the thread
>>> 
<https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL>
>>> .
>>>
>>
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-12-04 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/15979
  
spark 2.0.x does not have mapValues. but this works:

scala> Seq(("a", Some((1, 1))), ("a",
None)).toDS.groupByKey(_._2).count.show
+---++
|key|count(1)|
+---++
|[null,null]|   1|
|  [1,1]|   1|
+---++



On Sun, Dec 4, 2016 at 9:59 AM, Koert Kuipers <ko...@tresata.com> wrote:

> Yes it worked before
>
> On Dec 4, 2016 02:33, "Wenchen Fan" <notificati...@github.com> wrote:
>
>> val x: Dataset[String, Option[(String, String)]] = ...
>> x.groupByKey(_._1).mapValues(_._2).agg(someAgg)
>>
>> Does it work before?
>>
>> Please see the discussion in the JIRA: https://issues.apache.org/jira
>> /browse/SPARK-18251
>> Ideally we have a map between type T and catalyst schema, and Option[T]
>> maps to the same catalyst schema with T, with additional null handling.
>> We shouldn't change this mapping, which means we can't use a single field
>> struct type to represent Option[T].
>>
>> It's still possible to support Option[T] completely(without breaking
>> backward compatibility), but that may need a lof of hacky code and 
special
>> handling, I don't think it worth, as we can easy work around it, by
>> Tuple1.
>>
>> â
>> You are receiving this because you commented.
>> Reply to this email directly, view it on GitHub
>> <https://github.com/apache/spark/pull/15979#issuecomment-264689198>, or 
mute
>> the thread
>> 
<https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL>
>> .
>>
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-12-04 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/15979
  
Yes it worked before

On Dec 4, 2016 02:33, "Wenchen Fan" <notificati...@github.com> wrote:

> val x: Dataset[String, Option[(String, String)]] = ...
> x.groupByKey(_._1).mapValues(_._2).agg(someAgg)
>
> Does it work before?
>
> Please see the discussion in the JIRA: https://issues.apache.org/
> jira/browse/SPARK-18251
> Ideally we have a map between type T and catalyst schema, and Option[T]
> maps to the same catalyst schema with T, with additional null handling.
> We shouldn't change this mapping, which means we can't use a single field
> struct type to represent Option[T].
>
> It's still possible to support Option[T] completely(without breaking
> backward compatibility), but that may need a lof of hacky code and special
> handling, I don't think it worth, as we can easy work around it, by Tuple1
> .
>
> â
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/15979#issuecomment-264689198>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL>
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/15979
  
this means anything that uses an encoder can no longer use Option[_ <: 
Product].
encoders are not just used for the top level Dataset creation.

Dataset.groupByKey[K] requires an encoder for K.
KeyValueGroupedDataset.mapValues[W] requires an encoder for V
Aggregator[A, B, C] requires encoders for B and C

none of these always create top level row objects (for which this pullreq 
creates the restriction that they cannot be null).

for an aggregator it is sometimes the case. 
```dataset.select(aggregator)``` does create a top level row object, but 
```dataset.groupByKey(...).agg(aggregator)``` does not.

so i am not sure it makes sense to put this restriction on the encoder. it 
seems to belong on the dataset.

another example of something that won't work anymore:
```
val x: Dataset[String, Option[(String, String)]] = ...
x.groupByKey(_._1).mapValues(_._2).agg(someAgg)
```
in this case the mapValues requires``` Encoder[Option[(String, String)]]```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15979: [SPARK-18251][SQL] the type of Dataset can't be O...

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/15979#discussion_r90770855
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 ---
@@ -47,16 +47,26 @@ object ExpressionEncoder {
 // We convert the not-serializable TypeTag into StructType and 
ClassTag.
 val mirror = typeTag[T].mirror
 val tpe = typeTag[T].tpe
+
+if (ScalaReflection.optionOfProductType(tpe)) {
+  throw new UnsupportedOperationException(
+"Cannot create encoder for Option of Product type, because Product 
type is represented " +
--- End diff --

this strikes me more as a limitation on Dataset[X] then on Encoder[X]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15979: [SPARK-18251][SQL] the type of Dataset can't be O...

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/15979#discussion_r90770824
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 ---
@@ -47,16 +47,26 @@ object ExpressionEncoder {
 // We convert the not-serializable TypeTag into StructType and 
ClassTag.
 val mirror = typeTag[T].mirror
 val tpe = typeTag[T].tpe
+
+if (ScalaReflection.optionOfProductType(tpe)) {
+  throw new UnsupportedOperationException(
+"Cannot create encoder for Option of Product type, because Product 
type is represented " +
--- End diff --

this also means an Aggregator cannot use an Option of Product Type for its 
intermediate type. e.g.
Aggregator[Int, Option[(Int, Int)], Int] is now invalid. but i see no good 
reason why such an Aggregator wouldnt exist?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/15918
  
It can be done with shapeless (which perhaps uses macros under hood, I
don't know).

On Dec 1, 2016 19:56, "Michael Armbrust" <notificati...@github.com> wrote:

I don't think you can limit the implicit. What type would pick up case
classes, but not case classes that contain invalid things? I think you
would need a macros for this kind of introspection. (I'd be happy to be
proven wrong with a PR.)

I'd recommend you only import the implicits you need rather than using the
wildcard.

â
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/apache/spark/pull/15918#issuecomment-264342773>, or mute
the thread

<https://github.com/notifications/unsubscribe-auth/AAyIJLwL-MWdzQGb6Ioe2fr_GgP0rP05ks5rD2zNgaJpZM4K1D7U>
.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...

2016-11-30 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/15918
  
if we do a flag i would also prefer it if the current implicits are more 
narrow if the flag is not set, if possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...

2016-11-27 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/15918

@srowen and @rxin what is the default behavior that is changed here? i see
a current situation where an implicit encoder is provided that simply cannot
handle the task at hand and this leads to failure.

either the implicits for ExpressionEncoder need to be more narrow so that
they do not claim types they cannot handle (and then other implicit encoders
can be used), or they need to be able to handle these types, for example by
falling back to kryo as is suggested in this JIRA.

currrently ```implicitly[Encoder[Option[Set[Int``` gives you an
ExpressionEncoder that cannot handle it. that is undesired and makes it
difficult to provide an alternative implicit by the user.

i proposed making the ExpressionEncoders more narrow (that seemed the
easier fix to me at first) but @marmbrus preferred the approach of falling back
to kryo and broadening it. see:

http://apache-spark-developers-list.1001551.n3.nabble.com/getting-encoder-implicits-to-be-more-accurate-td19561.html

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-25 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13526
  
if they chain like that then i think i know how to do the optimization.

but do they? look for example at dataset.groupByKey(...).mapValues(...)

Dataset[T].groupByKey[K] uses function T => K and creates
KeyValueGroupedDataset[K, T]

KeyValueGroupedDataset[K, T].mapValues[W] uses function T => W and creates
KeyValueGroupedDataset[K, W]

so i have T => K and then T => W


On Thu, Oct 20, 2016 at 8:26 PM, Wenchen Fan <notificati...@github.com>
wrote:

> 2 chained AppendColumns will have 2 functions: T => U and U => W, so we
> can combine them this way:
> convert UnsafeRow to T
> apply func to T to generate U
> apply func to U to generate W
> convert W to UnsafeRow
> append the new UnsafeRow to the original one
>
> â
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/13526#issuecomment-255263458>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AAyIJL19KTgysYd5dRAsDtIseY0jwlm3ks5q2AbLgaJpZM4IvEGP>
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-24 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13526
  
@cloud-fan that makes sense to me, but its definitely not a quick win to 
create that optimization.
let me think about it some more


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-21 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13526
  
@cloud-fan i can try to optimize 
```grouped.mapValues(...).mapValues(...)``` but its a bit of an anti-pattern 
(there should be no need to do mapValues twice) so i dont think there is much 
gain in optimizing this. what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13526
  
@rxin i can give it a try (the optimizer rule)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.d...

2016-10-18 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/15382#discussion_r83921525
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -741,7 +741,7 @@ private[sql] class SQLConf extends Serializable with 
CatalystConf with Logging {
 
   def variableSubstituteDepth: Int = getConf(VARIABLE_SUBSTITUTE_DEPTH)
 
-  def warehousePath: String = new Path(getConf(WAREHOUSE_PATH)).toString
+  def warehousePath: String = 
Utils.resolveURI(getConf(WAREHOUSE_PATH)).toString
--- End diff --

i agree with @vanzin about dislike for resolveURI. i expect paths without 
schemes to on my default filesystem.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.dir is r...

2016-10-07 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/15382
  
i don't think there is such a thing as a HDFS working directory, but that 
probably means it just uses the home dir on hdfs (/user/) for any 
relative paths


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15382: [SPARK-17810] [SQL] Default spark.sql.warehouse.dir is r...

2016-10-06 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/15382
  
i think working dir makes more sense than home dir. but could this catch 
people by surprise because we now expect write permission in the working dir?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13868: [SPARK-15899] [SQL] Fix the construction of the f...

2016-10-06 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/13868#discussion_r82216818
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -55,7 +56,7 @@ object SQLConf {
   val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir")
 .doc("The default location for managed databases and tables.")
 .stringConf
-.createWithDefault("file:${system:user.dir}/spark-warehouse")
+.createWithDefault("${system:user.dir}/spark-warehouse")
--- End diff --

or use FileSystem.getHomeDirectory?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-09-21 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
@cloud-fan i thought about this a little more, and my suggested changes to 
the Aggregator api does not allow one to use a different encoder when applying 
a typed operation on Dataset. so i do not think it is dangerous as such.

it does enable usage within the untyped grouping, which is where type 
conversions are already customary anyhow. its not more dangerous than say using 
a udaf in a DataFrame.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14576: [SPARK-16391][SQL] Support partial aggregation fo...

2016-08-17 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/14576#discussion_r75186632
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala 
---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+
+/**
+ * An aggregator that uses a single associative and commutative reduce 
function. This reduce
+ * function can be used to go through all input values and reduces them to 
a single value.
+ * If there is no input, a null value is returned.
+ *
+ * @since 2.1.0
+ */
+private[sql] class ReduceAggregator[T: Encoder](func: (T, T) => T)
+  extends Aggregator[T, (Boolean, T), T] {
+
+  private val encoder = implicitly[Encoder[T]]
+
+  override def zero: (Boolean, T) = (false, null.asInstanceOf[T])
+
+  override def bufferEncoder: Encoder[(Boolean, T)] =
+ExpressionEncoder.tuple(
+  ExpressionEncoder[Boolean](),
+  encoder.asInstanceOf[ExpressionEncoder[T]])
+
+  override def outputEncoder: Encoder[T] = encoder
+
+  override def reduce(b: (Boolean, T), a: T): (Boolean, T) = {
+if (b._1) {
+  (true, func(b._2, a))
+} else {
+  (true, a)
+}
+  }
+
+  override def merge(b1: (Boolean, T), b2: (Boolean, T)): (Boolean, T) = {
+if (!b1._1) {
+  b2
+} else if (!b2._1) {
+  b1
+} else {
+  (true, func(b1._2, b2._2))
+}
+  }
+
+  override def finish(reduction: (Boolean, T)): T = reduction._2
--- End diff --

this can not happen since ReduceAggregator is private?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14576: [SPARK-16391][SQL] Support partial aggregation fo...

2016-08-17 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/14576#discussion_r75152186
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala 
---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+
+/**
+ * An aggregator that uses a single associative and commutative reduce 
function. This reduce
+ * function can be used to go through all input values and reduces them to 
a single value.
+ * If there is no input, a null value is returned.
+ *
+ * @since 2.1.0
+ */
+private[sql] class ReduceAggregator[T: Encoder](func: (T, T) => T)
+  extends Aggregator[T, (Boolean, T), T] {
+
+  private val encoder = implicitly[Encoder[T]]
+
+  override def zero: (Boolean, T) = (false, null.asInstanceOf[T])
+
+  override def bufferEncoder: Encoder[(Boolean, T)] =
+ExpressionEncoder.tuple(
+  ExpressionEncoder[Boolean](),
+  encoder.asInstanceOf[ExpressionEncoder[T]])
+
+  override def outputEncoder: Encoder[T] = encoder
+
+  override def reduce(b: (Boolean, T), a: T): (Boolean, T) = {
+if (b._1) {
+  (true, func(b._2, a))
+} else {
+  (true, a)
+}
+  }
+
+  override def merge(b1: (Boolean, T), b2: (Boolean, T)): (Boolean, T) = {
+if (!b1._1) {
+  b2
+} else if (!b2._1) {
+  b1
+} else {
+  (true, func(b1._2, b2._2))
+}
+  }
+
+  override def finish(reduction: (Boolean, T)): T = reduction._2
--- End diff --

since it should never happen, how about an assertion?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/14576#discussion_r74361702
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala 
---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+
+/**
+ * :: Experimental ::
+ * An aggregator that uses a single associative and commutative reduce 
function. This reduce
+ * function can be used to go through all input values and reduces them to 
a single value.
+ * If there is no input, a null value is returned.
+ *
+ * @since 2.1.0
+ */
+@Experimental
+abstract class ReduceAggregator[T] extends Aggregator[T, (Boolean, T), T] {
--- End diff --

i mean, i like it to be public but that changes what is important in the 
design


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/14576#discussion_r74316735
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala 
---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+
+/**
+ * :: Experimental ::
+ * An aggregator that uses a single associative and commutative reduce 
function. This reduce
+ * function can be used to go through all input values and reduces them to 
a single value.
+ * If there is no input, a null value is returned.
+ *
+ * @since 2.1.0
+ */
+@Experimental
+abstract class ReduceAggregator[T] extends Aggregator[T, (Boolean, T), T] {
+
+  // Question 1: Should func and encoder be parameters rather than 
abstract methods?
+  //  rxin: abstract method has better java compatibility and forces 
naming the concrete impl,
+  //  whereas parameter has better type inference (infer encoders via 
context bounds).
+  // Question 2: Should finish throw an exception or return null if there 
is no input?
+  //  rxin: null might be more "SQL" like, whereas exception is more Scala 
like.
--- End diff --

i prefer null over an exception

ideally this would return an option. the operation really should be:
```def reduceGroups(f: (V, V) => V): Dataset[(K, Option[V])]```
and the class:
```abstract class ReduceAggregator[T] extends Aggregator[T, (Boolean, T), 
Option[T]]```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/14576#discussion_r74314375
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala 
---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+
+/**
+ * :: Experimental ::
+ * An aggregator that uses a single associative and commutative reduce 
function. This reduce
+ * function can be used to go through all input values and reduces them to 
a single value.
+ * If there is no input, a null value is returned.
+ *
+ * @since 2.1.0
+ */
+@Experimental
+abstract class ReduceAggregator[T] extends Aggregator[T, (Boolean, T), T] {
+
+  // Question 1: Should func and encoder be parameters rather than 
abstract methods?
+  //  rxin: abstract method has better java compatibility and forces 
naming the concrete impl,
+  //  whereas parameter has better type inference (infer encoders via 
context bounds).
--- End diff --

i like the parameters better


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14576: [SPARK-16391][SQL] ReduceAggregator and partial a...

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/14576#discussion_r74313912
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/ReduceAggregator.scala 
---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.expressions
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+
+/**
+ * :: Experimental ::
+ * An aggregator that uses a single associative and commutative reduce 
function. This reduce
+ * function can be used to go through all input values and reduces them to 
a single value.
+ * If there is no input, a null value is returned.
+ *
+ * @since 2.1.0
+ */
+@Experimental
+abstract class ReduceAggregator[T] extends Aggregator[T, (Boolean, T), T] {
--- End diff --

i like factoring this class out of the reduce operation instead of making 
it an anonymous class, but do we expect people to use this directly? does it 
need to be public?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14222: [SPARK-16391][SQL] KeyValueGroupedDataset.reduceGroups s...

2016-07-17 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/14222
  
there is a usefulness to this `ReduceAggregator` beyond `.reduceGroups`. 
basically you can take any Aggregator without a zero and turn it into a valid 
Aggregator, with the caveat being that the result is nullable and will be null 
if no inputs are provided to the Aggregator.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-07-15 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/13526#discussion_r71042267
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -312,6 +312,17 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   "a", "30", "b", "3", "c", "1")
   }
 
+  test("groupBy function, mapValues, flatMap") {
+val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS()
--- End diff --

it seems the other tests all use ```toDS()``` so i will stick to that 
convention


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-07-15 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/13526#discussion_r71041725
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala ---
@@ -65,6 +65,46 @@ class KeyValueGroupedDataset[K, V] private[sql](
   groupingAttributes)
 
   /**
+   * Returns a new [[KeyValueGroupedDataset]] where the given function has 
been applied to the
+   * data. The grouping key is unchanged by this.
+   *
+   * {{{
+   *   // Create values grouped by key from a Dataset[(K, V)]
+   *   ds.groupByKey(_._1).mapValues(_._2) // Scala
+   * }}}
+   * @since 2.0.0
+   */
+  def mapValues[W: Encoder](func: V => W): KeyValueGroupedDataset[K, W] = {
--- End diff --

i think the convention is spaces before, i will fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...

2016-07-03 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/13532#discussion_r69397207
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala ---
@@ -305,4 +305,13 @@ class DatasetAggregatorSuite extends QueryTest with 
SharedSQLContext {
 val ds = Seq(1, 2, 3).toDS()
 checkDataset(ds.select(MapTypeBufferAgg.toColumn), 1)
   }
+
+  test("spark-15204 improve nullability inference for Aggregator") {
+val ds1 = Seq(1, 3, 2, 5).toDS()
+assert(ds1.select(typed.sum((i: Int) => i)).schema.head.nullable === 
false)
+val ds2 = Seq(AggData(1, "a"), AggData(2, "a")).toDS()
+assert(ds2.groupByKey(_.b).agg(SeqAgg.toColumn).schema(1).nullable === 
true)
--- End diff --

the last assert with NameAgg tests String as output of the Aggregator. is 
that good enough?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13933: [SPARK-16236] [SQL] Add Path Option back to Load API in ...

2016-06-30 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13933
  
For parquet, json etc. path not being put in options is not an issue since
they don't retrieve it from the options
On Jun 29, 2016 2:31 AM, "Xiao Li" <notificati...@github.com> wrote:

> @zsxwing <https://github.com/zsxwing> If we just provide one path in the
> function input, it will not put path into the options. The API 
parquet(path:
> String)still calls load(paths : _*), instead of load(path). Thus, we will
> introduce inconsistent behavior, compared with load(path: String).
>
> Could you review the new PR I just submitted? Let me know if anything is
> not appropriate. #13965 <https://github.com/apache/spark/pull/13965>.
> Thanks!
>
> â
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/13933#issuecomment-229268367>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe/AAyIJPT3agmWQyqBUqrzFW5F2eE095Wtks5qQhFPgaJpZM4I_nrj>
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...

2016-06-27 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/13727#discussion_r68672691
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def load(path: String): DataFrame = {
-option("path", path).load()
+load(Seq(path): _*) // force invocation of `load(...varargs...)`
--- End diff --

it will also break users code in an upgrade


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...

2016-06-27 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/13727#discussion_r68645998
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def load(path: String): DataFrame = {
-option("path", path).load()
+load(Seq(path): _*) // force invocation of `load(...varargs...)`
--- End diff --

i believe that works as expected (i am running into some other issues now, 
but they seem unrelated). 
however from a DSL perspective this is not very pretty?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...

2016-06-27 Thread koertkuipers

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/13727#discussion_r68624316
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def load(path: String): DataFrame = {
-option("path", path).load()
+load(Seq(path): _*) // force invocation of `load(...varargs...)`
--- End diff --

with this change path is no longer available in the options. this makes it 
hard (impossible?) for non-file based DataSources (not implementing FileFormat) 
to use load(...)

For example for elasticsearch we use:
```
sqlContext.read.format("org.elasticsearch.spark.sql").load(resource)
```
i do not think this can be implemented anymore now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8416: [SPARK-10185] [SQL] Feat sql comma separated paths

2016-06-11 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/8416
  
this patch should not have broken reading files that include comma.
i also added unit test for this:

https://github.com/apache/spark/pull/8416/files#diff-5d2ebf4e9ca5a990136b276859769289R896


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13526
  
could we "rewind"/undo the append for the key and change it to a map that 
inserts new values and key? so remove one append and replace it with another 
operation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13526
  
the tricky part with that is that (ds: Dataset[(K,
V)]).groupBy(_._1).mapValues(_._2) should return a
KeyValueGroupedDataset[K, V]

On Tue, Jun 7, 2016 at 8:22 PM, Wenchen Fan <notificati...@github.com>
wrote:

> A possible approach maybe just keep the function given by mapValues, and
> apply it before calling the function given by mapGroups. By doing this,
> we at least won't make the performance worse, as the underlying plan
> doesn't change.
>
> â
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/13526#issuecomment-224453064>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe/AAyIJCOywIHS-XfwsPytJXcYZf7AHkqFks5qJgtWgaJpZM4IvEGP>
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13526
  
```
scala> val x = Seq(("a", 1), ("b", 2)).toDS
x: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

scala> x.groupByKey(_._1).mapValues(_._2).reduceGroups(_ + _).explain
== Physical Plan ==
*SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
scala.Tuple2, true]._1, true) AS value#36, input[0, scala.Tuple2, true]._2 AS 
value#37]
+- MapGroups , value#32.toString, value#34: int, [value#32], 
[value#34], obj#35: scala.Tuple2
   +- *Sort [value#32 ASC], false, 0
  +- Exchange hashpartitioning(value#32, 200)
 +- *Project [value#34, value#32]
+- AppendColumns , newInstance(class scala.Tuple2), 
[input[0, int, true] AS value#34]
   +- AppendColumns , newInstance(class 
scala.Tuple2), [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
StringType, fromString, input[0, java.lang.String, true], true) AS value#32]
  +- LocalTableScan [_1#28, _2#29]
```

it seems to AppendColumns are not collapsed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13526
  
ok i will study the physical plans for both and try to understand why one 
would be slower


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13526
  
can you explain a bit what is inefficient and would need an optimizer rule? 
is it mapValues being called twice? once for the key and then for the new 
values?
thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13526

see this conversation:

https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccaaswr-7kqfmxd_cpr-_wdygafh+rarecm9olm5jkxfk14fc...@mail.gmail.com%3E

mapGroups is not a very interesting API, since without support for
secondary sort and hence no need for fold operations pushing all the value into
the reducer never really makes sense. so the interesting APIs are reduce (when
its fixed to be efficient and not use mapGroups) and agg.
how do you transform the values before they go into reduce? you can not do
this currently, which is why we need something like mapValues. with Aggregators
you can indeed do something similar inside the Aggregator (since the input type
is not equal to the buffer type), but this leads to all Aggregators currently
taking in some kind of input transform function, which hints at a suboptimal
API and a pattern that should be generalized and extracted.

i am curious to know why appending a column is inefficient? i am open to
different designs

about this being a rare case: i would argue the opposite. i expect to see a
lot of key-value datasets (```Dataset[(K, V)]```) in our codebase, and on those
a lot of operations like ```ds.groupByKey(_._1).mapValues.(_._2).reduce(...)```.
since this is the most natural translation of many RDD algos.

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/13532#discussion_r65986613
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala
 ---
@@ -51,7 +52,8 @@ object TypedAggregateExpression {
   bufferDeserializer,
   outputEncoder.serializer,
   outputEncoder.deserializer.dataType,
-  outputType)
+  outputType,
+  outputNullable)
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/13526#discussion_r65972115
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala ---
@@ -65,6 +65,44 @@ class KeyValueGroupedDataset[K, V] private[sql](
   groupingAttributes)
 
   /**
+   * Returns a new [[KeyValueGroupedDataset]] where the given function has 
been applied to the
+   * data. The grouping key is unchanged by this.
+   *
+   * {{{
+   *   // Create values grouped by key from a Dataset[(K, V)]
+   *   ds.groupBy(_._1).mapValues(_._2)
+   * }}}
+   * @since 2.0.0
+   */
+  def mapValues[W: Encoder](func: V => W): KeyValueGroupedDataset[K, W] = {
+val withNewData = AppendColumns(func, dataAttributes, logicalPlan)
+val projected = Project(withNewData.newColumns ++ groupingAttributes, 
withNewData)
+val executed = sparkSession.sessionState.executePlan(projected)
+
+new KeyValueGroupedDataset(
+  encoderFor[K],
+  encoderFor[W],
+  executed,
+  withNewData.newColumns,
+  groupingAttributes)
+  }
+
+  /**
+   * Returns a new [[KeyValueGroupedDataset]] where the given function has 
been applied to the
+   * data. The grouping key is unchanged by this.
+   *
+   * {{{
+   *   // Create values grouped by key from a Dataset[(K, V)]
+   *   ds.groupBy(_._1).mapValues(_._2)
--- End diff --

oh.. it should be groupByKey, not groupBy. woops
i will also comment that its scala i guess

you want a java 8 lamba example?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...

GitHub user koertkuipers opened a pull request:

https://github.com/apache/spark/pull/13532

[SPARK-15204][SQL] improve nullability inference for Aggregator

## What changes were proposed in this pull request?

TypedAggregateExpression sets nullable based on the schema of the 
outputEncoder 

## How was this patch tested?

Add test in DatasetAggregatorSuite


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata/spark feat-aggregator-nullable

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13532.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13532


commit 32cfcb7dcf0e42331a9ef29a2f0c611538cc4063
Author: Koert Kuipers <ko...@tresata.com>
Date:   2016-06-06T20:34:39Z

improve nullability inference for Aggregator




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
for example with this branch you can do:
```
val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i", "j", 
"k")
df3.groupBy("i").agg(
  ComplexResultAgg.apply("i", "k"),
  SumAgg.apply("j"),
  AverageAgg.apply("j")
)

so these are multiple Aggregators applied in an untyped groupBy, each on 
selected columns.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
well that was sort of what i was trying to achieve. the unit tests i added 
were for using Aggregator for untyped grouping(```groupBy```). 
and i think for it to be useful within that context one should also be able 
to select the columns that the Aggregator operates on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13512: [SPARK-15769][SQL] Add Encoder for input type to ...

Github user koertkuipers closed the pull request at:

https://github.com/apache/spark/pull/13512


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
If Aggregator is designed for typed Dataset only then that is a bit of a 
shame, because its a elegant and generic api that should be useful for 
DataFrame too. this causes fragmentation (Aggregator versus 
UserDefinedAggregationFunction). 
I am not sure what the better way to do this is, but i would like a single 
high level Aggregator-like api that i can use in Dataset and DataFrame.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

GitHub user koertkuipers opened a pull request:

https://github.com/apache/spark/pull/13526

[SPARK-15780][SQL] Support mapValues on KeyValueGroupedDataset

## What changes were proposed in this pull request?

Add mapValues to KeyValueGroupedDataset

## How was this patch tested?

New test in DatasetSuite for groupBy function, mapValues, flatMap


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata/spark 
feat-keyvaluegroupeddataset-mapvalues

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13526.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13526


commit 3494ec5b913ef6c1314a4b96279096a18a7fe5a1
Author: Koert Kuipers <ko...@tresata.com>
Date:   2016-06-06T16:09:35Z

add mapValues to KeyValueGroupedDataset




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
@cloud-fan i am running into some trouble updating my branch to the latest 
master. i get errors in tests due to Analyzer.validateTopLevelTupleFields

the issue seems to be that in KeyValueGroupedDataset[K, T] the Aggregators 
are supposed to operate on T, but the logicalPlan at this point already has K 
appended to T because AppendColumns(func, inputPlan) is applied to the plan 
before its passed into KeyValueGroupedDataset. so validateTopLevelTupleFields 
also sees the column for the key in the inputs and believes the deserializer 
for T is missing a field.

any suggestions on how to get around this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
@cloud-fan from the (added) unit tests:
```
val df2 = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDF("i", "j")
checkAnswer(df2.groupBy("i").agg(ComplexResultAgg.toColumn),
  Row("a", Row(2, 4)) :: Row("b", Row(1, 3)) :: Nil)
```
this shows how the underlying type is Row (with a schema consisting of 
Strings and Ints), and it gets converted to the input type of the Aggregator 
which is (String, Long), so this involves both conversion and upcast.

and:
```
val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i", "j", 
"k")
checkAnswer(df3.groupBy("i").agg(ComplexResultAgg("i", "k")),
  Row("a", Row(2, 4)) :: Row("b", Row(1, 3)) :: Nil)
```
this is similar to the previous example but i also select the columns i 
want the Aggregator to operate on (namely columns "i" and "k")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
**[Test build #5 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)**
 for PR 13512 at commit 
[`077f782`](https://github.com/apache/spark/commit/077f782cbf1e64439b8d5bb738819faebbf5914b).
 * This patch **fails MiMa tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/5/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
**[Test build #5 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)**
 for PR 13512 at commit 
[`077f782`](https://github.com/apache/spark/commit/077f782cbf1e64439b8d5bb738819faebbf5914b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13512: [SPARK-15769][SQL] Add Encoder for input type to ...

GitHub user koertkuipers opened a pull request:

https://github.com/apache/spark/pull/13512

[SPARK-15769][SQL] Add Encoder for input type to Aggregator

## What changes were proposed in this pull request?
Aggregator also has an Encoder for the input type

## How was this patch tested?
Add tests to DatasetAggregatorSuite to upcast aggregator input types and 
apply Aggregator only to selected columns

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata/spark feat-aggregator-input-encoder

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13512.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13512


commit 6da0aef2885170ff4ecc6394463de6aedadbf867
Author: Koert Kuipers <ko...@tresata.com>
Date:   2016-06-03T04:13:14Z

add input encoder to aggregator

commit 31f7b68ef9db3261a378272f714086afbd0e3208
Author: Koert Kuipers <ko...@tresata.com>
Date:   2016-06-03T04:32:38Z

make inputDeserializer mandatory in TypedAggregateExpression

commit 76b8878a84b11b37e50b30c87822ac34546393f2
Author: Koert Kuipers <ko...@tresata.com>
Date:   2016-06-03T04:53:27Z

show aggregator can now be used in DataFrame without needing to handle row 
objects directly

commit 077f782cbf1e64439b8d5bb738819faebbf5914b
Author: Koert Kuipers <ko...@tresata.com>
Date:   2016-06-03T05:31:44Z

allow aggregator to operate on subset of columns




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14139 Dataset loses nullability in opera...

2016-05-19 Thread koertkuipers

Github user koertkuipers closed the pull request at:

https://github.com/apache/spark/pull/11980


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15204][SQL] Nullable is not correct for...

2016-05-09 Thread koertkuipers

Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/13012#issuecomment-218053856
  
blackbox transformations infer nullable=false when you return a primitive. 
for example:
```
scala> sc.parallelize(List(1,2,3)).toDS.map(i => i * 2).schema
res0: org.apache.spark.sql.types.StructType = 
StructType(StructField(value,IntegerType,false))
```
why would aggregator be any different?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/12877#issuecomment-216678197
  
yup needs to be transient, will fix

On Tue, May 3, 2016 at 5:58 PM, andrewor14 <notificati...@github.com> wrote:

> I think it's OK for it to be lazy; just wanted to understand why. But it
> should be transient though since sparkSession is also transient.
>
> â
> You are receiving this because you authored the thread.
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/12877#issuecomment-216677161>
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/12877#issuecomment-216675245
  
if a SparkSession sits inside a Dataset does that mean _wrapped is always
already initialized (because you cannot have a Dataset without a
SparkContext)? if so, i should probably make it a val instead of lazy val

On Tue, May 3, 2016 at 5:31 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i made it lazy val since SparkSession.wrapped is effectively lazy too:
>   protected[sql] def wrapped: SQLContext = {
> if (_wrapped == null) {
>   _wrapped = new SQLContext(self, isRootContext = false)
> }
> _wrapped
>   }
>
>
> On Tue, May 3, 2016 at 5:29 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> oh since since sparkSession is just a normal val i guess it can also be
>>
>> On Tue, May 3, 2016 at 5:25 PM, andrewor14 <notificati...@github.com>
>> wrote:
>>
>>> Looks good otherwise.
>>>
>>> â
>>> You are receiving this because you authored the thread.
>>> Reply to this email directly or view it on GitHub
>>> <https://github.com/apache/spark/pull/12877#issuecomment-216668953>
>>>
>>
>>
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/12877#issuecomment-216670925
  
i made it lazy val since SparkSession.wrapped is effectively lazy too:
  protected[sql] def wrapped: SQLContext = {
if (_wrapped == null) {
  _wrapped = new SQLContext(self, isRootContext = false)
}
_wrapped
  }


On Tue, May 3, 2016 at 5:29 PM, Koert Kuipers <ko...@tresata.com> wrote:

> oh since since sparkSession is just a normal val i guess it can also be
>
> On Tue, May 3, 2016 at 5:25 PM, andrewor14 <notificati...@github.com>
> wrote:
>
>> Looks good otherwise.
>>
>> â
>> You are receiving this because you authored the thread.
>> Reply to this email directly or view it on GitHub
>> <https://github.com/apache/spark/pull/12877#issuecomment-216668953>
>>
>
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/12877#issuecomment-216670423
  
oh since since sparkSession is just a normal val i guess it can also be

On Tue, May 3, 2016 at 5:25 PM, andrewor14 <notificati...@github.com> wrote:

> Looks good otherwise.
>
> â
> You are receiving this because you authored the thread.
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/12877#issuecomment-216668953>
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...