[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-08-01 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50890613 redudeByKey being the same as reduce, and cartesian being the same as broadcast is the whole point, the difference being that redudeByKey and cartesian

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-08-01 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50896323 why do you use treeReduce + broadcast? the data per partition is small no? only a few aggregates per partition --- If your project is set up for it, you can reply

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

2014-08-01 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/1698#issuecomment-50900320 i can see your point of 10M columns. would be really nice if we have a lazy and efficient allReduce(RDD[T], (T, T) = T): RDD[T] a RDD transform

[GitHub] spark pull request: implement secondary sort: sorting by values in...

2014-10-27 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/2962 implement secondary sort: sorting by values in addition to keys see: https://issues.apache.org/jira/browse/SPARK-3655 this is the first of 2 competing pullreqs that try to address

[GitHub] spark pull request: add foldLeftByKey to PairRDDFunctions for redu...

2014-10-27 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/2963 add foldLeftByKey to PairRDDFunctions for reduce algorithms that by key ... ...need to process values in a particular order see: https://issues.apache.org/jira/browse/SPARK-3655

[GitHub] spark pull request: Feat kryo max buffersize

2014-06-20 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/735#issuecomment-46716428 hey sorry somehow misses this conversation thread. sure will update defaults and docs On Wed, Jun 4, 2014 at 1:48 AM, Patrick Wendell notificati

[GitHub] spark pull request: Feat kryo max buffersize

2014-06-23 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/735#issuecomment-46865956 not sure if i am supposed to deal with these failures? On Sat, Jun 21, 2014 at 1:52 PM, UCB AMPLab notificati...@github.com wrote: Refer

[GitHub] spark pull request: Feat kryo max buffersize

2014-07-09 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/735#issuecomment-48548156 i updated docs and defaults as requested. currently waiting for feedback or a merge On Wed, Jul 9, 2014 at 6:46 PM, mingyukim notificati

[GitHub] spark pull request: Feat kryo max buffersize

2014-07-16 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/735#issuecomment-49251684 https://issues.apache.org/jira/browse/SPARK-2543 On Wed, Jul 16, 2014 at 9:53 PM, Apache Spark QA notificati...@github.com wrote: QA tests

[GitHub] spark pull request: SPARK-1691: Support quoted arguments inside of...

2014-07-20 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/609#issuecomment-49559153 on the command line i can get this to work now, but its still way beyond my bash skills to use exec spark-submit inside a script with multiple java options

[GitHub] spark pull request: SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225.

2014-03-19 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/113#issuecomment-38056370 thanks On Tue, Mar 18, 2014 at 2:55 AM, Reynold Xin notificati...@github.comwrote: We are reverting this pull request in #167https

[GitHub] spark pull request: Feat kryo max buffersize

2014-05-11 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/735 Feat kryo max buffersize You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-kryo-max-buffersize Alternatively you can

[GitHub] spark pull request: Feat kryo max buffersize

2014-05-12 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/735#issuecomment-42896061 hey matei, i think they always had this feature in kryo, at least in 2.x. created jira here: https://issues.apache.org/jira/browse/SPARK-1811

[GitHub] spark pull request: SPARK-1801. expose InterruptibleIterator and T...

2014-05-13 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/764 SPARK-1801. expose InterruptibleIterator and TaskKilledException in deve... ...loper api You can merge this pull request into a Git repository by running: $ git pull https://github.com

[GitHub] spark pull request: add foldLeftByKey to PairRDDFunctions for redu...

2014-12-05 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/2963#discussion_r21387829 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -460,6 +461,63 @@ class PairRDDFunctions[K, V](self: RDD[(K, V

[GitHub] spark pull request: add foldLeftByKey to PairRDDFunctions for redu...

2014-12-05 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/2963#issuecomment-65828969 Hey @zsxwing, In Scala Seq the order in which the values get processed in foldLeft is well defined. But can we make any assumptions at all about

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-07 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/3632 SPARK-3655 GroupByKeyAndSortValues See https://issues.apache.org/jira/browse/SPARK-3655 This pullreq is based on the approach that uses repartitionAndSortWithinPartition, but only

[GitHub] spark pull request: implement secondary sort: sorting by values in...

2014-12-20 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/2962#issuecomment-67754336 i am going to close this pulllreq. i get the impression there is no interest in changing spark internal sort routines to support sorting by (key, value) pairs

[GitHub] spark pull request: implement secondary sort: sorting by values in...

2014-12-20 Thread koertkuipers
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/2962 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: add foldLeftByKey to PairRDDFunctions for redu...

2014-12-20 Thread koertkuipers
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/2963 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: add foldLeftByKey to PairRDDFunctions for redu...

2014-12-20 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/2963#issuecomment-67754464 i am going to close this pullreq. i hope to pick up foldLeft later again (together with a proper java version), but for SPARK-3655 the focus for now

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-23 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68011681 hey @markhamstra i assume you are referring to the one method groupByKeyAndSortValues that has an implicit Ordering[V] parameter, since the other

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68081488 mhhh i dont really agree with you. i find OrderedRDD confusing because: 1) you kind of have to know that there is an implicit conversion to OrderedRDD somewhere

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread koertkuipers
GitHub user koertkuipers reopened a pull request: https://github.com/apache/spark/pull/3632 SPARK-3655 GroupByKeyAndSortValues See https://issues.apache.org/jira/browse/SPARK-3655 This pullreq is based on the approach that uses repartitionAndSortWithinPartition, but only

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68081763 i will work on updated version early januari --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread koertkuipers
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/3632 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68081928 woops sorry i hit the wrong button there. didnt mean to close this pullreq. @markhamstra i will try to update this pullreq sometime in first few weeks

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-26 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68150977 @markhamstra take a look now. i ignored the situation of K and V having same type, since i think it can be dealt with by using a simple wrapper (value) class

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22428452 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22423736 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22423573 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-11 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22770867 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request: SPARK-4644 blockjoin

2015-06-18 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/6883 SPARK-4644 blockjoin Although the discussion (and design doc) under SPARK-4644 seem focussed on other aspects of skew (OOM mostly) than this pullreq (which focusses on avoiding a single

[GitHub] spark pull request: SPARK-8398 hadoop input/output format advanced...

2015-06-20 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6848#issuecomment-113789091 i see MiMa failed. what binary compatibility promise does spark make? all minor versions are binary compatible? --- If your project is set up for it, you can

[GitHub] spark pull request: SPARK-8398 hadoop input/output format advanced...

2015-06-18 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6848#issuecomment-113242162 ok i will look into JavaSparkContext and a few simple regression tests. will probably need some help with python. On Wed, Jun 17, 2015 at 12:34 AM

[GitHub] spark pull request: SPARK-8398 hadoop input/output format advanced...

2015-06-22 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6848#issuecomment-114350614 i see: [info] spark-core: found 2 potential binary incompatibilities (filtered 488) [error] * method saveAsTextFile(java.lang.String

[GitHub] spark pull request: SPARK 8398 hadoop input/output format advanced...

2015-06-16 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/6848 SPARK 8398 hadoop input/output format advanced control You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-hadoop-input

[GitHub] spark pull request: [SPARK-7708] [Core] [WIP] Fixes for Kryo closu...

2015-06-29 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6361#issuecomment-116739363 @JoshRosen one issue i see with publishing a modified chill package: we read files in spark that were written by scalding using chill/kryo for serialization

[GitHub] spark pull request: [SPARK-7708] [Core] [WIP] Fixes for Kryo closu...

2015-07-05 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6361#issuecomment-118637079 thats fair enough. however keep in mind that kryo is a transitive dependency of spark, and one that does not upgrade well and has not been shaded, so you

[GitHub] spark pull request: [SPARK-7708] [Core] [WIP] Fixes for Kryo closu...

2015-06-28 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6361#issuecomment-116299142 i am not so sure you it is safe to bump the kryo version like that. chill 0.5.0 doesnt compile against kryo 2.24.0, so what guarantees do you have that chill

[GitHub] spark pull request: SPARK-4644 blockjoin

2015-08-22 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-133778896 i put this in a spark package together with skewjoin in case anyone wants to use it. see here: http://spark-packages.org/package/tresata/spark-skewjoin

[GitHub] spark pull request: SPARK-8398 hadoop input/output format advanced...

2015-07-14 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6848#issuecomment-121468577 @andrewor14 @JoshRosen anything i need to do, besides fixing trivial conflicts? --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request: [SPARK-10185] [SQL] Feat sql comma separated p...

2015-10-29 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/8416#issuecomment-152095666 You could create dataframe per path and then union them. On Oct 28, 2015 19:14, "Jon Edvald" <notificati...@github.com> wrote: >

[GitHub] spark pull request: [SPARK-10185] [SQL] Feat sql comma separated p...

2015-10-15 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/8416#issuecomment-148589922 i believe this is done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-10185] [SQL] Feat sql comma separated p...

2015-10-17 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/8416#discussion_r42308755 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -123,6 +124,24 @@ class DataFrameReader private[sql](sqlContext

[GitHub] spark pull request: SPARK-4644 blockjoin

2015-10-17 Thread koertkuipers
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/6883 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: SPARK-4644 blockjoin

2015-10-17 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6883#issuecomment-148917448 sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-10185] [SQL] Feat sql comma separated p...

2015-10-06 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/8416#issuecomment-145902167 goals (copied over from SPARK-5741 comments by @marmbrus ): It was originally just parquet that would support more than one file, but now all HadoopFSRelations

[GitHub] spark pull request: SPARK-8398 hadoop input/output format advanced...

2015-07-10 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6848#issuecomment-120564648 i will fix resolve conflicts when someone says this is good to go. otherwise i keep merging from master every few days. --- If your project is set up for it, you

[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-07-10 Thread koertkuipers
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/3632 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-10185] [SQL] Feat sql comma separated p...

2015-08-27 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/8416#issuecomment-135486744 Can you point me to the jira where that decision was made? Hadoop globbing only covers a small subset of all use cases. For example for timeseries analysis

[GitHub] spark pull request: [SPARK-10185] [SQL] Feat sql comma separated p...

2015-08-27 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/8416#issuecomment-135488186 I am not sure Union is a good idea at all, since i would have to union DataFrames for hundreds of partitions and the Union logical operator only takes left

[GitHub] spark pull request: [SPARK-10185] [SQL] Feat sql comma separated p...

2015-08-29 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/8416#issuecomment-136034683 i updated this pullreq based on the conversation at https://issues.apache.org/jira/browse/SPARK-5741 --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-1061] assumePartitioned

2015-09-08 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/4449#issuecomment-138716589 i would like to have something like this in core On Fri, Sep 4, 2015 at 6:22 AM, rapen <notificati...@github.com> wrote: > @danielhav

[GitHub] spark pull request: [SPARK-10185] [SQL] Feat sql comma separated p...

2015-08-25 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/8416 [SPARK-10185] [SQL] Feat sql comma separated paths Make sure comma-separated paths get processed correcly in ResolvedDataSource for a HadoopFsRelationProvider You can merge this pull request

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 **[Test build #5 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)** for PR 13512 at commit [`077f782`](https://github.com/apache/spark

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 **[Test build #5 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)** for PR 13512 at commit [`077f782`](https://github.com/apache/spark

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/5/ Test FAILed

[GitHub] spark pull request #13512: [SPARK-15769][SQL] Add Encoder for input type to ...

2016-06-04 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/13512 [SPARK-15769][SQL] Add Encoder for input type to Aggregator ## What changes were proposed in this pull request? Aggregator also has an Encoder for the input type ## How

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 can you explain a bit what is inefficient and would need an optimizer rule? is it mapValues being called twice? once for the key and then for the new values? thanks! --- If your

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 see this conversation: https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccaaswr-7kqfmxd_cpr-_wdygafh+rarecm9olm5jkxfk14fc...@mail.gmail.com%3E mapGroups

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 could we "rewind"/undo the append for the key and change it to a map that inserts new values and key? so remove one append and replace it with another operation? --- If your proj

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 the tricky part with that is that (ds: Dataset[(K, V)]).groupBy(_._1).mapValues(_._2) should return a KeyValueGroupedDataset[K, V] On Tue, Jun 7, 2016 at 8:22 PM, Wenchen Fan

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 ``` scala> val x = Seq(("a", 1), ("b", 2)).toDS x: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] scala> x.groupByKey(_._1).ma

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 If Aggregator is designed for typed Dataset only then that is a bit of a shame, because its a elegant and generic api that should be useful for DataFrame too. this causes fragmentation

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-06-06 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/13526 [SPARK-15780][SQL] Support mapValues on KeyValueGroupedDataset ## What changes were proposed in this pull request? Add mapValues to KeyValueGroupedDataset ## How

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 ok i will study the physical plans for both and try to understand why one would be slower --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 well that was sort of what i was trying to achieve. the unit tests i added were for using Aggregator for untyped grouping(```groupBy```). and i think for it to be useful within

[GitHub] spark pull request #13512: [SPARK-15769][SQL] Add Encoder for input type to ...

2016-06-06 Thread koertkuipers
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/13512 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 for example with this branch you can do: ``` val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i"

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13526#discussion_r65972115 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala --- @@ -65,6 +65,44 @@ class KeyValueGroupedDataset[K, V] private

[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...

2016-06-06 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/13532 [SPARK-15204][SQL] improve nullability inference for Aggregator ## What changes were proposed in this pull request? TypedAggregateExpression sets nullable based on the schema

[GitHub] spark issue #8416: [SPARK-10185] [SQL] Feat sql comma separated paths

2016-06-11 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/8416 this patch should not have broken reading files that include comma. i also added unit test for this: https://github.com/apache/spark/pull/8416/files#diff

[GitHub] spark pull request #13532: [SPARK-15204][SQL] improve nullability inference ...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13532#discussion_r65986613 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala --- @@ -51,7 +52,8 @@ object

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 @cloud-fan from the (added) unit tests: ``` val df2 = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDF("i", "j") checkAnswer(df2.grou

[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 @cloud-fan i am running into some trouble updating my branch to the latest master. i get errors in tests due to Analyzer.validateTopLevelTupleFields the issue seems

[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...

2016-06-27 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13727#discussion_r68672691 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession

[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...

2016-06-27 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13727#discussion_r68624316 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession

[GitHub] spark pull request #13727: [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harm...

2016-06-27 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/13727#discussion_r68645998 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -135,7 +129,7 @@ class DataFrameReader private[sql](sparkSession

[GitHub] spark pull request: SPARK-14139 Dataset loses nullability in opera...

2016-03-28 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11980#issuecomment-202588020 @cloud-fan i tried to do that, but i don't think i am familiar enough with the code gen, because it breaks other unit tests. it seems to me i am messing up

[GitHub] spark pull request: SPARK-14139 Dataset loses nullability in opera...

2016-03-26 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/11980 SPARK-14139 Dataset loses nullability in operations with RowEncoder ## What changes were proposed in this pull request? RowEncoder now respects nullability for struct fields when

[GitHub] spark pull request: [SPARK-13531] [SQL] Avoid call defaultSize of ...

2016-03-27 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11508#issuecomment-202111907 it might seem easiest to put a defaultSize on ObjectType, but i think that is masking the real problem, which is that the optimizer replaces the real types

[GitHub] spark pull request: [SPARK-13531] [SQL] Avoid call defaultSize of ...

2016-03-27 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11508#issuecomment-202136461 would it be possible to have a variation of ObjectType that can take in info like defaultSize which it takes from the real type? --- If your project is set up

[GitHub] spark pull request: [SPARK-13665][SQL] Separate the concerns of Ha...

2016-03-08 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11509#issuecomment-193877543 i believe the need to pass all files along (e.g. inputFiles: Array[FileStatus]) instead of just the input paths came from the need to cache it so that stuff

[GitHub] spark pull request: [SPARK-13665][SQL] Separate the concerns of Ha...

2016-03-08 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11509#issuecomment-193921325 if it did then it was not always in the apis i think? i remember the apis having paths: Seq[String] instead of files: Seq[FileStatus]. by explicitly

[GitHub] spark pull request: SPARK-14139 Dataset loses nullability in opera...

2016-04-03 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/11980#discussion_r58319194 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala --- @@ -120,17 +120,19 @@ object RowEncoder

[GitHub] spark pull request: SPARK-14139 Dataset loses nullability in opera...

2016-03-29 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11980#issuecomment-203070056 @cloud-fan i pushed at attempt at this, but i am having trouble with RowEncoderSuite encode/decode: Product this test uses a Product value with a StructType

[GitHub] spark pull request: SPARK-14139 Dataset loses nullability in opera...

2016-03-29 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/11980#discussion_r57829829 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala --- @@ -120,17 +120,19 @@ object RowEncoder

[GitHub] spark pull request: SPARK-14139 Dataset loses nullability in opera...

2016-03-29 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/11980#discussion_r57829840 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects.scala --- @@ -680,3 +680,54 @@ case class AssertNotNull(child

[GitHub] spark pull request: [SPARK-13363][SQL] support Aggregator in Relat...

2016-04-13 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/12359#issuecomment-209535957 great, thanks for this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-8398][CORE] Hadoop input/output format ...

2016-04-22 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/6848#issuecomment-213661475 @holdenk ok i tried to make it look all pretty --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...

2016-04-27 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215192562 hello! why is there no stringNullValue? basically i want for a column with type string to read in all empty strings as nulls. this is what the old option

[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...

2016-04-27 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215194241 do these settings roundtrip correctly? say i set doubleNaNValue to "XY", and i create a dataframe with a Double.NaN in it, does it get written out corre

[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...

2016-04-27 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215196735 i personally would have been happy with a simple single values for nulls for all datatypes. and the usage of that single value should be consistent across

[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...

2016-04-30 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215979899 please also provide a way for strings to be converted to null upon reading --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: SPARK-14139 Dataset loses nullability in opera...

2016-05-19 Thread koertkuipers
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/11980 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

2016-05-03 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/12877#issuecomment-216670925 i made it lazy val since SparkSession.wrapped is effectively lazy too: protected[sql] def wrapped: SQLContext = { if (_wrapped == null

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

2016-05-03 Thread koertkuipers
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/12877 [SPARK-15097][SQL] make Dataset.sqlContext a stable identifier for imports ## What changes were proposed in this pull request? Make Dataset.sqlContext a lazy val so that its a stable

[GitHub] spark pull request: [SPARK-15097][SQL] make Dataset.sqlContext a s...

2016-05-03 Thread koertkuipers
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/12877#issuecomment-216675245 if a SparkSession sits inside a Dataset does that mean _wrapped is always already initialized (because you cannot have a Dataset without a SparkContext

  1   2   >