[jira] [Resolved] (SPARK-11117) PhysicalRDD.outputsUnsafeRows should return true when the underlying data source produces UnsafeRows
[ https://issues.apache.org/jira/browse/SPARK-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-7. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9305 [https://github.com/apache/spark/pull/9305] > PhysicalRDD.outputsUnsafeRows should return true when the underlying data > source produces UnsafeRows > > > Key: SPARK-7 > URL: https://issues.apache.org/jira/browse/SPARK-7 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 1.6.0 > > > {{PhysicalRDD}} doesn't override {{SparkPlan.outputsUnsafeRows}}, and thus > can't avoid {{ConvertToUnsafe}} when upper level operators only support > {{UnsafeRow}} even if the underlying data source produces {{UnsafeRow}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11345) Make HadoopFsRelation always outputs UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-11345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-11345. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9305 [https://github.com/apache/spark/pull/9305] > Make HadoopFsRelation always outputs UnsafeRow > -- > > Key: SPARK-11345 > URL: https://issues.apache.org/jira/browse/SPARK-11345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10158) ALS should print better errors when given Long IDs
[ https://issues.apache.org/jira/browse/SPARK-10158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983644#comment-14983644 ] Bryan Cutler edited comment on SPARK-10158 at 10/31/15 7:05 AM: I think the best way to handle this from the PySpark side is to add something like the following to {{ALS._prepare}} ([link|https://github.com/apache/spark/blob/master/python/pyspark/mllib/recommendation.py#L215]) which is called before training {noformat} MAX_ID_VALUE = ratings.ctx._gateway.jvm.Integer.MAX_VALUE if ratings.filter(lambda x: x.user > MAX_ID_VALUE or x.product > MAX_ID_VALUE).count() > 0: raise ValueError("Rating IDs must be less than max Java int %s." % str(MAX_ID_VALUE)) {noformat} But any operations on the data are probably not worth the hit for this issue Edit: I meant the above as an alternative to checking values for 2^31 explicitly, which could be done in the Ratings constructor but seems like too much of a hack to me was (Author: bryanc): The only way I can see handling this from the PySpark side is to add something like the following to {{ALS._prepare}} ([link|https://github.com/apache/spark/blob/master/python/pyspark/mllib/recommendation.py#L215]) which is called before training {noformat} MAX_ID_VALUE = ratings.ctx._gateway.jvm.Integer.MAX_VALUE if ratings.filter(lambda x: x.user > MAX_ID_VALUE or x.product > MAX_ID_VALUE).count() > 0: raise ValueError("Rating IDs must be less than max Java int %s." % str(MAX_ID_VALUE)) {noformat} But any operations on the data are probably not worth the hit for this issue > ALS should print better errors when given Long IDs > -- > > Key: SPARK-10158 > URL: https://issues.apache.org/jira/browse/SPARK-10158 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > See [SPARK-10115] for the very confusing messages you get when you try to use > ALS with Long IDs. We should catch and identify these errors and print > meaningful error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11436) we should rebind right encoder when join 2 datasets
[ https://issues.apache.org/jira/browse/SPARK-11436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11436: Assignee: Apache Spark > we should rebind right encoder when join 2 datasets > --- > > Key: SPARK-11436 > URL: https://issues.apache.org/jira/browse/SPARK-11436 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11436) we should rebind right encoder when join 2 datasets
[ https://issues.apache.org/jira/browse/SPARK-11436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983965#comment-14983965 ] Apache Spark commented on SPARK-11436: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/9391 > we should rebind right encoder when join 2 datasets > --- > > Key: SPARK-11436 > URL: https://issues.apache.org/jira/browse/SPARK-11436 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11436) we should rebind right encoder when join 2 datasets
[ https://issues.apache.org/jira/browse/SPARK-11436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11436: Assignee: (was: Apache Spark) > we should rebind right encoder when join 2 datasets > --- > > Key: SPARK-11436 > URL: https://issues.apache.org/jira/browse/SPARK-11436 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980 ] Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:51 PM: Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): ``` val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show ``` I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) was (Author: tycho01): Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): ``` val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show ``` I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This feature would
[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980 ] Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:51 PM: Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) was (Author: tycho01): Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): ``` val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show ``` I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This
[jira] [Assigned] (SPARK-10500) sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable
[ https://issues.apache.org/jira/browse/SPARK-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10500: Assignee: Apache Spark > sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable > --- > > Key: SPARK-10500 > URL: https://issues.apache.org/jira/browse/SPARK-10500 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Jonathan Kelly >Assignee: Apache Spark > > As of SPARK-6797, sparkr.zip is re-created each time spark-submit is run with > an R application, which fails if Spark has been installed into a directory to > which the current user doesn't have write permissions. (e.g., on EMR's > emr-4.0.0 release, Spark is installed at /usr/lib/spark, which is only > writable by root.) > Would it be possible to skip creating sparkr.zip if it already exists? That > would enable sparkr.zip to be pre-created by the root user and then reused > each time spark-submit is run, which I believe is similar to how pyspark > works. > Another option would be to make the location configurable, as it's currently > hardcoded to $SPARK_HOME/R/lib/sparkr.zip. Allowing it to be configured to > something like the user's home directory or a random path in /tmp would get > around the permissions issue. > By the way, why does spark-submit even need to re-create sparkr.zip every > time a new R application is launched? This seems unnecessary and inefficient, > unless you are actively developing the SparkR libraries and expect the > contents of sparkr.zip to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11436) we should rebind right encoder when join 2 datasets
Wenchen Fan created SPARK-11436: --- Summary: we should rebind right encoder when join 2 datasets Key: SPARK-11436 URL: https://issues.apache.org/jira/browse/SPARK-11436 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11226) Empty line in json file should be skipped
[ https://issues.apache.org/jira/browse/SPARK-11226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11226. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9211 [https://github.com/apache/spark/pull/9211] > Empty line in json file should be skipped > - > > Key: SPARK-11226 > URL: https://issues.apache.org/jira/browse/SPARK-11226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > Fix For: 1.6.0 > > > Currently the empty line in json file will be parsed into Row with all null > field values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11226) Empty line in json file should be skipped
[ https://issues.apache.org/jira/browse/SPARK-11226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11226: -- Assignee: Jeff Zhang > Empty line in json file should be skipped > - > > Key: SPARK-11226 > URL: https://issues.apache.org/jira/browse/SPARK-11226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > Fix For: 1.6.0 > > > Currently the empty line in json file will be parsed into Row with all null > field values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983945#comment-14983945 ] Michael Armbrust commented on SPARK-11431: -- Have you look at the explode that works on a column? {code} import org.apache.spark.sql.functions._ df.select(explode($"arrayOfStructs")) {code} > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This feature would build upon the existing `explode` functionality added to > DataFrames by [~marmbrus], which currently errors when you call it on such > arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor > function to infer column types -- this approach is insufficient in the case > of Rows, since their type does not contain the required info. The alternative > here would be to instead grab the schema info from the existing schema for > such cases. > I'm trying to implement a patch that might add this functionality, so stay > tuned until I've figured that out. I'm new here though so I'll probably have > use for some feedback... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10500) sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable
[ https://issues.apache.org/jira/browse/SPARK-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10500: Assignee: (was: Apache Spark) > sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable > --- > > Key: SPARK-10500 > URL: https://issues.apache.org/jira/browse/SPARK-10500 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Jonathan Kelly > > As of SPARK-6797, sparkr.zip is re-created each time spark-submit is run with > an R application, which fails if Spark has been installed into a directory to > which the current user doesn't have write permissions. (e.g., on EMR's > emr-4.0.0 release, Spark is installed at /usr/lib/spark, which is only > writable by root.) > Would it be possible to skip creating sparkr.zip if it already exists? That > would enable sparkr.zip to be pre-created by the root user and then reused > each time spark-submit is run, which I believe is similar to how pyspark > works. > Another option would be to make the location configurable, as it's currently > hardcoded to $SPARK_HOME/R/lib/sparkr.zip. Allowing it to be configured to > something like the user's home directory or a random path in /tmp would get > around the permissions issue. > By the way, why does spark-submit even need to re-create sparkr.zip every > time a new R application is launched? This seems unnecessary and inefficient, > unless you are actively developing the SparkR libraries and expect the > contents of sparkr.zip to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10500) sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable
[ https://issues.apache.org/jira/browse/SPARK-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983925#comment-14983925 ] Apache Spark commented on SPARK-10500: -- User 'sun-rui' has created a pull request for this issue: https://github.com/apache/spark/pull/9390 > sparkr.zip cannot be created if $SPARK_HOME/R/lib is unwritable > --- > > Key: SPARK-10500 > URL: https://issues.apache.org/jira/browse/SPARK-10500 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Jonathan Kelly > > As of SPARK-6797, sparkr.zip is re-created each time spark-submit is run with > an R application, which fails if Spark has been installed into a directory to > which the current user doesn't have write permissions. (e.g., on EMR's > emr-4.0.0 release, Spark is installed at /usr/lib/spark, which is only > writable by root.) > Would it be possible to skip creating sparkr.zip if it already exists? That > would enable sparkr.zip to be pre-created by the root user and then reused > each time spark-submit is run, which I believe is similar to how pyspark > works. > Another option would be to make the location configurable, as it's currently > hardcoded to $SPARK_HOME/R/lib/sparkr.zip. Allowing it to be configured to > something like the user's home directory or a random path in /tmp would get > around the permissions issue. > By the way, why does spark-submit even need to re-create sparkr.zip every > time a new R application is launched? This seems unnecessary and inefficient, > unless you are actively developing the SparkR libraries and expect the > contents of sparkr.zip to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980 ] Tycho Grouwstra commented on SPARK-11431: - Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): ``` val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show ``` I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This feature would build upon the existing `explode` functionality added to > DataFrames by [~marmbrus], which currently errors when you call it on such > arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor > function to infer column types -- this approach is insufficient in the case > of Rows, since their type does not contain the required info. The alternative > here would be to instead grab the schema info from the existing schema for > such cases. > I'm trying to implement a patch that might add this functionality, so stay > tuned until I've figured that out. I'm new here though so I'll probably have > use for some feedback... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980 ] Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:53 PM: Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): {code} val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show {code} I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) was (Author: tycho01): Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This
[jira] [Updated] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tycho Grouwstra updated SPARK-11431: Description: I am creating DataFrames from some [JSON data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to explode an array of structs (as are common in JSON) to their own rows so I could start analyzing the data using GraphX. I believe many others might have use for this as well, since most web data is in JSON format. This feature would build upon the existing `explode` functionality added to DataFrames by [~marmbrus], which currently errors when you call it on such arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor function to infer column types -- this approach is insufficient in the case of Rows, since their type does not contain the required info. The alternative here would be to instead grab the schema info from the existing schema for such cases. I'm trying to implement a patch that might add this functionality, so stay tuned until I've figured that out. I'm new here though so I'll probably have use for some feedback... was: I am creating DataFrames from some [JSON data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to explode an array of structs (as are common in JSON) to their own rows so I could start analyzing the data using GraphX. I believe many others might have use for this as well, since most web data is in JSON format. This feature would build upon the existing `explode` functionality added to DataFrames by [~marmbrus], which currently errors when you call it on such arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor function to infer column types -- this approach is insufficient in the case of Rows, since their type does not contain the required info. The alternative here would be to instead grab the schema info from the existing schema for such cases. I'm trying to implement a patch that might add this functionality, so stay tuned until I've figured that out. I'm new here though so I'll probably have use for some feedback... > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This feature would build upon the existing `explode` functionality added to > DataFrames by [~marmbrus], which currently errors when you call it on such > arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor > function to infer column types -- this approach is insufficient in the case > of Rows, since their type does not contain the required info. The alternative > here would be to instead grab the schema info from the existing schema for > such cases. > I'm trying to implement a patch that might add this functionality, so stay > tuned until I've figured that out. I'm new here though so I'll probably have > use for some feedback... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983980#comment-14983980 ] Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:54 PM: Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here|http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html], only just realized this was you): {code} val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show {code} I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) was (Author: tycho01): Hah, I actually missed that, so my bad, thanks! What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you): {code} val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}""" val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil)) import org.apache.spark.sql.functions._ peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show {code} I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'. Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible? Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice. Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :) > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This
[jira] [Created] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
Jason White created SPARK-11437: --- Summary: createDataFrame shouldn't .take() when provided schema Key: SPARK-11437 URL: https://issues.apache.org/jira/browse/SPARK-11437 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Jason White When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided. Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect. https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService
[ https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984140#comment-14984140 ] Jeffrey Turpin commented on SPARK-6373: --- [~jlewandowski], are you willing to do a review of my changes? I can create a pull request if you would like? > Add SSL/TLS for the Netty based BlockTransferService > - > > Key: SPARK-6373 > URL: https://issues.apache.org/jira/browse/SPARK-6373 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Shuffle >Affects Versions: 1.2.1 >Reporter: Jeffrey Turpin > > Add the ability to allow for secure communications (SSL/TLS) for the Netty > based BlockTransferService and the ExternalShuffleClient. This ticket will > hopefully start the conversation around potential designs... Below is a > reference to a WIP prototype which implements this functionality > (prototype)... I have attempted to disrupt as little code as possible and > tried to follow the current code structure (for the most part) in the areas I > modified. I also studied how Hadoop achieves encrypted shuffle > (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html) > https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11437: Assignee: (was: Apache Spark) > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984166#comment-14984166 ] Apache Spark commented on SPARK-11437: -- User 'JasonMWhite' has created a pull request for this issue: https://github.com/apache/spark/pull/9392 > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11437: Assignee: Apache Spark > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White >Assignee: Apache Spark > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation
[ https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10978: - Priority: Critical (was: Minor) > Allow PrunedFilterScan to eliminate predicates from further evaluation > -- > > Key: SPARK-10978 > URL: https://issues.apache.org/jira/browse/SPARK-10978 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.3.0, 1.4.0, 1.5.0 >Reporter: Russell Alexander Spitzer >Priority: Critical > > Currently PrunedFilterScan allows implementors to push down predicates to an > underlying datasource. This is done solely as an optimization as the > predicate will be reapplied on the Spark side as well. This allows for > bloom-filter like operations but ends up doing a redundant scan for those > sources which can do accurate pushdowns. > In addition it makes it difficult for underlying sources to accept queries > which reference non-existent to provide ancillary function. In our case we > allow a solr query to be passed in via a non-existent solr_query column. > Since this column is not returned when Spark does a filter on "solr_query" > nothing passes. > Suggestion on the ML from [~marmbrus] > {quote} > We have to try and maintain binary compatibility here, so probably the > easiest thing to do here would be to add a method to the class. Perhaps > something like: > def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters > By default, this could return all filters so behavior would remain the same, > but specific implementations could override it. There is still a chance that > this would conflict with existing methods, but hopefully that would not be a > problem in practice. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11024) Optimize NULL in by folding it to Literal(null)
[ https://issues.apache.org/jira/browse/SPARK-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-11024: - Assignee: Dilip Biswal > Optimize NULL in by folding it to Literal(null) > > > Key: SPARK-11024 > URL: https://issues.apache.org/jira/browse/SPARK-11024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Minor > Fix For: 1.6.0 > > > Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to > Literal(null). > This is a follow up defect to SPARK-8654 and suggested by Wenchen Fan. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11024) Optimize NULL in by folding it to Literal(null)
[ https://issues.apache.org/jira/browse/SPARK-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-11024. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9348 [https://github.com/apache/spark/pull/9348] > Optimize NULL in by folding it to Literal(null) > > > Key: SPARK-11024 > URL: https://issues.apache.org/jira/browse/SPARK-11024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dilip Biswal >Priority: Minor > Fix For: 1.6.0 > > > Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to > Literal(null). > This is a follow up defect to SPARK-8654 and suggested by Wenchen Fan. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984009#comment-14984009 ] Jason White commented on SPARK-11437: - [~marmbrus] We briefly discussed this at SparkSummitEU this week. > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11431) Allow exploding arrays of structs in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tycho Grouwstra resolved SPARK-11431. - Resolution: Implemented > Allow exploding arrays of structs in DataFrames > --- > > Key: SPARK-11431 > URL: https://issues.apache.org/jira/browse/SPARK-11431 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tycho Grouwstra > Labels: features > Original Estimate: 24h > Remaining Estimate: 24h > > I am creating DataFrames from some [JSON > data|http://www.kayak.com/h/explore/api?airport=AMS], and would like to > explode an array of structs (as are common in JSON) to their own rows so I > could start analyzing the data using GraphX. I believe many others might have > use for this as well, since most web data is in JSON format. > This feature would build upon the existing `explode` functionality added to > DataFrames by [~marmbrus], which currently errors when you call it on such > arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor > function to infer column types -- this approach is insufficient in the case > of Rows, since their type does not contain the required info. The alternative > here would be to instead grab the schema info from the existing schema for > such cases. > I'm trying to implement a patch that might add this functionality, so stay > tuned until I've figured that out. I'm new here though so I'll probably have > use for some feedback... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11439) Optiomization of creating sparse feature without dense one
Kai Sasaki created SPARK-11439: -- Summary: Optiomization of creating sparse feature without dense one Key: SPARK-11439 URL: https://issues.apache.org/jira/browse/SPARK-11439 Project: Spark Issue Type: Improvement Components: ML Reporter: Kai Sasaki Priority: Minor Currently, sparse feature generated in {{LinearDataGenerator}} needs to create dense vectors once. It is cost efficient to prevent generating sparse feature without generating dense vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11439) Optiomization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai Sasaki updated SPARK-11439: --- Description: Currently, sparse feature generated in {{LinearDataGenerator}} needs to create dense vectors once. It is cost efficient to prevent from generating dense feature when creating sparse features. (was: Currently, sparse feature generated in {{LinearDataGenerator}} needs to create dense vectors once. It is cost efficient to prevent generating sparse feature without generating dense vectors.) > Optiomization of creating sparse feature without dense one > -- > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11427) DataFrame's intersect method does not work, returns 1
[ https://issues.apache.org/jira/browse/SPARK-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Kandasamy resolved SPARK-11427. --- Resolution: Duplicate > DataFrame's intersect method does not work, returns 1 > - > > Key: SPARK-11427 > URL: https://issues.apache.org/jira/browse/SPARK-11427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ram Kandasamy > > Hello, > I was working with dataframes and I found the intersect() method seems to > always return '1'. The RDD's intersection() method does work properly. > Consider this example: > scala> val firstFile = > sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-07-25/*").select("id").distinct > firstFile: org.apache.spark.sql.DataFrame = [id: string] > scala> firstFile.count > res4: Long = 1072046 > scala> firstFile.intersect(firstFile).count > res5: Long = 1 > scala> firstFile.rdd.intersection(firstFile.rdd).count > res6: Long = 1072046 > I have tried various different cases, and for some reason, the dataframe's > intersect method always returns 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11427) DataFrame's intersect method does not work, returns 1
[ https://issues.apache.org/jira/browse/SPARK-11427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984195#comment-14984195 ] Ram Kandasamy commented on SPARK-11427: --- So it looks like this issue has been resolved in spark version 1.5.1, I will mark this as duplicate as it was fixed in https://issues.apache.org/jira/browse/SPARK-10539. > DataFrame's intersect method does not work, returns 1 > - > > Key: SPARK-11427 > URL: https://issues.apache.org/jira/browse/SPARK-11427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Ram Kandasamy > > Hello, > I was working with dataframes and I found the intersect() method seems to > always return '1'. The RDD's intersection() method does work properly. > Consider this example: > scala> val firstFile = > sqlContext.read.parquet("/Users/ramkandasamy/sparkData/2015-07-25/*").select("id").distinct > firstFile: org.apache.spark.sql.DataFrame = [id: string] > scala> firstFile.count > res4: Long = 1072046 > scala> firstFile.intersect(firstFile).count > res5: Long = 1 > scala> firstFile.rdd.intersection(firstFile.rdd).count > res6: Long = 1072046 > I have tried various different cases, and for some reason, the dataframe's > intersect method always returns 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService
[ https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984219#comment-14984219 ] Jeffrey Turpin commented on SPARK-6373: --- Any comments/feedback would be appreciated... https://github.com/turp1twin/spark/commit/fd2980ab8cc1fc5b4626bb7a0d1e94128ca3874d > Add SSL/TLS for the Netty based BlockTransferService > - > > Key: SPARK-6373 > URL: https://issues.apache.org/jira/browse/SPARK-6373 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Shuffle >Affects Versions: 1.2.1 >Reporter: Jeffrey Turpin > > Add the ability to allow for secure communications (SSL/TLS) for the Netty > based BlockTransferService and the ExternalShuffleClient. This ticket will > hopefully start the conversation around potential designs... Below is a > reference to a WIP prototype which implements this functionality > (prototype)... I have attempted to disrupt as little code as possible and > tried to follow the current code structure (for the most part) in the areas I > modified. I also studied how Hadoop achieves encrypted shuffle > (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html) > https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11438) Allow users to define nondeterministic UDFs
Yin Huai created SPARK-11438: Summary: Allow users to define nondeterministic UDFs Key: SPARK-11438 URL: https://issues.apache.org/jira/browse/SPARK-11438 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yin Huai Right now, all UDFs are deterministic. It will be great if we allow users to define nondeterministic UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11438) Allow users to define nondeterministic UDFs
[ https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984223#comment-14984223 ] Apache Spark commented on SPARK-11438: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/9393 > Allow users to define nondeterministic UDFs > --- > > Key: SPARK-11438 > URL: https://issues.apache.org/jira/browse/SPARK-11438 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, all UDFs are deterministic. It will be great if we allow users to > define nondeterministic UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11438) Allow users to define nondeterministic UDFs
[ https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11438: Assignee: Yin Huai (was: Apache Spark) > Allow users to define nondeterministic UDFs > --- > > Key: SPARK-11438 > URL: https://issues.apache.org/jira/browse/SPARK-11438 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, all UDFs are deterministic. It will be great if we allow users to > define nondeterministic UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11438) Allow users to define nondeterministic UDFs
[ https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-11438: Assignee: Yin Huai > Allow users to define nondeterministic UDFs > --- > > Key: SPARK-11438 > URL: https://issues.apache.org/jira/browse/SPARK-11438 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, all UDFs are deterministic. It will be great if we allow users to > define nondeterministic UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11438) Allow users to define nondeterministic UDFs
[ https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11438: Assignee: Apache Spark (was: Yin Huai) > Allow users to define nondeterministic UDFs > --- > > Key: SPARK-11438 > URL: https://issues.apache.org/jira/browse/SPARK-11438 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > > Right now, all UDFs are deterministic. It will be great if we allow users to > define nondeterministic UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11265) YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster
[ https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-11265. Resolution: Fixed Assignee: Steve Loughran Fix Version/s: 1.6.0 > YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster > - > > Key: SPARK-11265 > URL: https://issues.apache.org/jira/browse/SPARK-11265 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 > Environment: Kerberized Hadoop cluster >Reporter: Steve Loughran >Assignee: Steve Loughran > Fix For: 1.6.0 > > > As reported on the dev list, trying to run a YARN client which wants to talk > to Hive in a Kerberized hadoop cluster fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org