[jira] [Updated] (SPARK-26879) Inconsistency in default column names for functions like inline and stack
[ https://issues.apache.org/jira/browse/SPARK-26879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jash Gala updated SPARK-26879: -- Description: In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed columns). Ex: {code|borderStyle=solid} scala> spark.sql("SELECT stack(2, 1, 2, 3)").show | col0 | col1 | |1 | 2| |3 | null | scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show | col1 | col2 | |1 | a| |2 | b| {/code} This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing) for these and other similar functions. was: In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed columns). Ex: ``` scala> spark.sql("SELECT stack(2, 1, 2, 3)").show | col0 | col1 | |1 | 2| |3 | null | scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show | col1 | col2 | |1 | a| |2 | b| ``` This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing) for these and other similar functions. > Inconsistency in default column names for functions like inline and stack > - > > Key: SPARK-26879 > URL: https://issues.apache.org/jira/browse/SPARK-26879 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jash Gala >Priority: Minor > > In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. > 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed > columns). > Ex: > {code|borderStyle=solid} > scala> spark.sql("SELECT stack(2, 1, 2, 3)").show > | col0 | col1 | > |1 | 2| > |3 | null | > scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, > 'b')))").show > | col1 | col2 | > |1 | a| > |2 | b| > {/code} > This feels like an issue with consistency. As discussed on [PR > #23748|https://github.com/apache/spark/pull/23748], it might be a good idea > to standardize this to something specific (like zero-based indexing) for > these and other similar functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26879) Inconsistency in default column names for functions like inline and stack
[ https://issues.apache.org/jira/browse/SPARK-26879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jash Gala updated SPARK-26879: -- Description: In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed columns). {code:title=spark-shell|borderStyle=solid} scala> spark.sql("SELECT stack(2, 1, 2, 3)").show +++ |col0|col1| +++ | 1| 2| | 3|null| +++ scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show +++ |col1|col2| +++ | 1| a| | 2| b| +++ {code} This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing) for these and other similar functions. was: In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed columns). {code:title=spark-shell|borderStyle=solid} scala> spark.sql("SELECT stack(2, 1, 2, 3)").show | col0 | col1 | |1 | 2| |3 | null | scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show | col1 | col2 | |1 | a| |2 | b| {code} This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing) for these and other similar functions. > Inconsistency in default column names for functions like inline and stack > - > > Key: SPARK-26879 > URL: https://issues.apache.org/jira/browse/SPARK-26879 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jash Gala >Priority: Minor > > In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. > 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed > columns). > {code:title=spark-shell|borderStyle=solid} > scala> spark.sql("SELECT stack(2, 1, 2, 3)").show > +++ > |col0|col1| > +++ > | 1| 2| > | 3|null| > +++ > scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, > 'b')))").show > +++ > |col1|col2| > +++ > | 1| a| > | 2| b| > +++ > {code} > This feels like an issue with consistency. As discussed on [PR > #23748|https://github.com/apache/spark/pull/23748], it might be a good idea > to standardize this to something specific (like zero-based indexing) for > these and other similar functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26879) Inconsistency in default column names for functions like inline and stack
[ https://issues.apache.org/jira/browse/SPARK-26879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jash Gala updated SPARK-26879: -- Description: In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed columns). Ex: ``` scala> spark.sql("SELECT stack(2, 1, 2, 3)").show | col0 | col1 | |1 | 2| |3 | null | scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show | col1 | col2 | |1 | a| |2 | b| ``` This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing). was: While looking at the default column names used by inline and stack, I found that inline uses col1, col2, etc. (i.e. 1-indexed columns), while stack uses col0, col1, col2, etc. (i.e. 0-indexed columns). Ex: scala> spark.sql("SELECT stack(2, 1, 2, 3)").show | col0 | col1 | |1 | 2| |3 | null | scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show | col1 | col2 | |1 | a| |2 | b| This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing). > Inconsistency in default column names for functions like inline and stack > - > > Key: SPARK-26879 > URL: https://issues.apache.org/jira/browse/SPARK-26879 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jash Gala >Priority: Minor > > In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. > 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed > columns). > Ex: > ``` > scala> spark.sql("SELECT stack(2, 1, 2, 3)").show > | col0 | col1 | > |1 | 2| > |3 | null | > scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, > 'b')))").show > | col1 | col2 | > |1 | a| > |2 | b| > ``` > This feels like an issue with consistency. As discussed on [PR > #23748|https://github.com/apache/spark/pull/23748], it might be a good idea > to standardize this to something specific (like zero-based indexing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26879) Inconsistency in default column names for functions like inline and stack
[ https://issues.apache.org/jira/browse/SPARK-26879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jash Gala updated SPARK-26879: -- Description: In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed columns). {code:title=spark-shell|borderStyle=solid} scala> spark.sql("SELECT stack(2, 1, 2, 3)").show | col0 | col1 | |1 | 2| |3 | null | scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show | col1 | col2 | |1 | a| |2 | b| {code} This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing) for these and other similar functions. was: In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed columns). Ex: {code|borderStyle=solid} scala> spark.sql("SELECT stack(2, 1, 2, 3)").show | col0 | col1 | |1 | 2| |3 | null | scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show | col1 | col2 | |1 | a| |2 | b| {/code} This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing) for these and other similar functions. > Inconsistency in default column names for functions like inline and stack > - > > Key: SPARK-26879 > URL: https://issues.apache.org/jira/browse/SPARK-26879 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jash Gala >Priority: Minor > > In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. > 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed > columns). > {code:title=spark-shell|borderStyle=solid} > scala> spark.sql("SELECT stack(2, 1, 2, 3)").show > | col0 | col1 | > |1 | 2| > |3 | null | > scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, > 'b')))").show > | col1 | col2 | > |1 | a| > |2 | b| > {code} > This feels like an issue with consistency. As discussed on [PR > #23748|https://github.com/apache/spark/pull/23748], it might be a good idea > to standardize this to something specific (like zero-based indexing) for > these and other similar functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26879) Inconsistency in default column names for functions like inline and stack
[ https://issues.apache.org/jira/browse/SPARK-26879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jash Gala updated SPARK-26879: -- Description: In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed columns). Ex: ``` scala> spark.sql("SELECT stack(2, 1, 2, 3)").show | col0 | col1 | |1 | 2| |3 | null | scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show | col1 | col2 | |1 | a| |2 | b| ``` This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing) for these and other similar functions. was: In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed columns). Ex: ``` scala> spark.sql("SELECT stack(2, 1, 2, 3)").show | col0 | col1 | |1 | 2| |3 | null | scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show | col1 | col2 | |1 | a| |2 | b| ``` This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing). > Inconsistency in default column names for functions like inline and stack > - > > Key: SPARK-26879 > URL: https://issues.apache.org/jira/browse/SPARK-26879 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jash Gala >Priority: Minor > > In the Spark SQL functions definitions, `inline` uses col1, col2, etc. (i.e. > 1-indexed columns), while `stack` uses col0, col1, col2, etc. (i.e. 0-indexed > columns). > Ex: > ``` > scala> spark.sql("SELECT stack(2, 1, 2, 3)").show > | col0 | col1 | > |1 | 2| > |3 | null | > scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, > 'b')))").show > | col1 | col2 | > |1 | a| > |2 | b| > ``` > This feels like an issue with consistency. As discussed on [PR > #23748|https://github.com/apache/spark/pull/23748], it might be a good idea > to standardize this to something specific (like zero-based indexing) for > these and other similar functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26879) Inconsistency in default column names for functions like inline and stack
[ https://issues.apache.org/jira/browse/SPARK-26879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jash Gala updated SPARK-26879: -- Description: While looking at the default column names used by inline and stack, I found that inline uses col1, col2, etc. (i.e. 1-indexed columns), while stack uses col0, col1, col2, etc. (i.e. 0-indexed columns). Ex: scala> spark.sql("SELECT stack(2, 1, 2, 3)").show | col0 | col1 | |1 | 2| |3 | null | scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show | col1 | col2 | |1 | a| |2 | b| This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing). was: While looking at the default column names used by inline and stack, I found that inline uses col1, col2, etc. (i.e. 1-indexed columns), while stack uses col0, col1, col2, etc. (i.e. 0-indexed columns). Ex: scala> spark.sql("SELECT stack(2, 1, 2, 3)").show |--+--| | col0 | col1 | |--+--| |1 | 2| |3 | null | |--+--| scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show |--+--| | col1 | col2 | |--+--| |1 | a| |2 | b| |--+--| This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing). > Inconsistency in default column names for functions like inline and stack > - > > Key: SPARK-26879 > URL: https://issues.apache.org/jira/browse/SPARK-26879 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jash Gala >Priority: Minor > > While looking at the default column names used by inline and stack, I found > that inline uses col1, col2, etc. (i.e. 1-indexed columns), while stack uses > col0, col1, col2, etc. (i.e. 0-indexed columns). > Ex: > scala> spark.sql("SELECT stack(2, 1, 2, 3)").show > | col0 | col1 | > |1 | 2| > |3 | null | > scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, > 'b')))").show > | col1 | col2 | > |1 | a| > |2 | b| > This feels like an issue with consistency. As discussed on [PR > #23748|https://github.com/apache/spark/pull/23748], it might be a good idea > to standardize this to something specific (like zero-based indexing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26879) Inconsistency in default column names for functions like inline and stack
Jash Gala created SPARK-26879: - Summary: Inconsistency in default column names for functions like inline and stack Key: SPARK-26879 URL: https://issues.apache.org/jira/browse/SPARK-26879 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Jash Gala While looking at the default column names used by inline and stack, I found that inline uses col1, col2, etc. (i.e. 1-indexed columns), while stack uses col0, col1, col2, etc. (i.e. 0-indexed columns). {{ scala> spark.sql("SELECT stack(2, 1, 2, 3)").show |--+--| | col0 | col1 | |--+--| |1 | 2| |3 | null | |--+--| scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show |--+--| | col1 | col2 | |--+--| |1 | a| |2 | b| |--+--| }} This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26879) Inconsistency in default column names for functions like inline and stack
[ https://issues.apache.org/jira/browse/SPARK-26879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jash Gala updated SPARK-26879: -- Description: While looking at the default column names used by inline and stack, I found that inline uses col1, col2, etc. (i.e. 1-indexed columns), while stack uses col0, col1, col2, etc. (i.e. 0-indexed columns). Ex: scala> spark.sql("SELECT stack(2, 1, 2, 3)").show |--+--| | col0 | col1 | |--+--| |1 | 2| |3 | null | |--+--| scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show |--+--| | col1 | col2 | |--+--| |1 | a| |2 | b| |--+--| This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing). was: While looking at the default column names used by inline and stack, I found that inline uses col1, col2, etc. (i.e. 1-indexed columns), while stack uses col0, col1, col2, etc. (i.e. 0-indexed columns). {{ scala> spark.sql("SELECT stack(2, 1, 2, 3)").show |--+--| | col0 | col1 | |--+--| |1 | 2| |3 | null | |--+--| scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, 'b')))").show |--+--| | col1 | col2 | |--+--| |1 | a| |2 | b| |--+--| }} This feels like an issue with consistency. As discussed on [PR #23748|https://github.com/apache/spark/pull/23748], it might be a good idea to standardize this to something specific (like zero-based indexing). > Inconsistency in default column names for functions like inline and stack > - > > Key: SPARK-26879 > URL: https://issues.apache.org/jira/browse/SPARK-26879 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jash Gala >Priority: Minor > > While looking at the default column names used by inline and stack, I found > that inline uses col1, col2, etc. (i.e. 1-indexed columns), while stack uses > col0, col1, col2, etc. (i.e. 0-indexed columns). > Ex: > scala> spark.sql("SELECT stack(2, 1, 2, 3)").show > |--+--| > | col0 | col1 | > |--+--| > |1 | 2| > |3 | null | > |--+--| > scala> spark.sql("SELECT inline_outer(array(struct(1, 'a'), struct(2, > 'b')))").show > |--+--| > | col1 | col2 | > |--+--| > |1 | a| > |2 | b| > |--+--| > This feels like an issue with consistency. As discussed on [PR > #23748|https://github.com/apache/spark/pull/23748], it might be a good idea > to standardize this to something specific (like zero-based indexing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23619) Document the column names created by explode and posexplode functions
[ https://issues.apache.org/jira/browse/SPARK-23619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763564#comment-16763564 ] Jash Gala edited comment on SPARK-23619 at 2/8/19 2:14 PM: --- I've fixed this and raised a PR: https://github.com/apache/spark/pull/23748 was (Author: jashgala): I'll fix this and raise a PR > Document the column names created by explode and posexplode functions > - > > Key: SPARK-23619 > URL: https://issues.apache.org/jira/browse/SPARK-23619 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Joe Pallas >Priority: Minor > Labels: documentation > > The documentation for {{explode}} and {{posexplode}} neglects to mention the > default column names for the new columns: {{col}} and {{pos}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23619) Document the column names created by explode and posexplode functions
[ https://issues.apache.org/jira/browse/SPARK-23619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763564#comment-16763564 ] Jash Gala commented on SPARK-23619: --- I'll fix this and raise a PR > Document the column names created by explode and posexplode functions > - > > Key: SPARK-23619 > URL: https://issues.apache.org/jira/browse/SPARK-23619 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Joe Pallas >Priority: Minor > Labels: documentation > > The documentation for {{explode}} and {{posexplode}} neglects to mention the > default column names for the new columns: {{col}} and {{pos}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10892) Join with Data Frame returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763001#comment-16763001 ] Jash Gala commented on SPARK-10892: --- This issue is still reproducible in Spark 2.4.0. > Join with Data Frame returns wrong results > -- > > Key: SPARK-10892 > URL: https://issues.apache.org/jira/browse/SPARK-10892 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Ofer Mendelevitch >Priority: Critical > Attachments: data.json > > > I'm attaching a simplified reproducible example of the problem: > 1. Loading a JSON file from HDFS as a Data Frame > 2. Creating 3 data frames: PRCP, TMIN, TMAX > 3. Joining the data frames together. Each of those has a column "value" with > the same name, so renaming them after the join. > 4. The output seems incorrect; the first column has the correct values, but > the two other columns seem to have a copy of the values from the first column. > Here's the sample code: > {code} > import org.apache.spark.sql._ > val sqlc = new SQLContext(sc) > val weather = sqlc.read.format("json").load("data.json") > val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache() > val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache() > val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache() > prcp.filter("year=2012 and month=10").show() > tmin.filter("year=2012 and month=10").show() > tmax.filter("year=2012 and month=10").show() > val out = (prcp.join(tmin, "date_str").join(tmax, "date_str") > .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"), > prcp("value").alias("PRCP"), tmin("value").alias("TMIN"), > tmax("value").alias("TMAX")) ) > out.filter("year=2012 and month=10").show() > {code} > The output is: > {code} > ++---+--+-+---+-++ > |date_str|day|metric|month|station|value|year| > ++---+--+-+---+-++ > |20121001| 1| PRCP| 10|USW00023272|0|2012| > |20121002| 2| PRCP| 10|USW00023272|0|2012| > |20121003| 3| PRCP| 10|USW00023272|0|2012| > |20121004| 4| PRCP| 10|USW00023272|0|2012| > |20121005| 5| PRCP| 10|USW00023272|0|2012| > |20121006| 6| PRCP| 10|USW00023272|0|2012| > |20121007| 7| PRCP| 10|USW00023272|0|2012| > |20121008| 8| PRCP| 10|USW00023272|0|2012| > |20121009| 9| PRCP| 10|USW00023272|0|2012| > |20121010| 10| PRCP| 10|USW00023272|0|2012| > |20121011| 11| PRCP| 10|USW00023272|3|2012| > |20121012| 12| PRCP| 10|USW00023272|0|2012| > |20121013| 13| PRCP| 10|USW00023272|0|2012| > |20121014| 14| PRCP| 10|USW00023272|0|2012| > |20121015| 15| PRCP| 10|USW00023272|0|2012| > |20121016| 16| PRCP| 10|USW00023272|0|2012| > |20121017| 17| PRCP| 10|USW00023272|0|2012| > |20121018| 18| PRCP| 10|USW00023272|0|2012| > |20121019| 19| PRCP| 10|USW00023272|0|2012| > |20121020| 20| PRCP| 10|USW00023272|0|2012| > ++---+--+-+---+-+——+ > ++---+--+-+---+-++ > |date_str|day|metric|month|station|value|year| > ++---+--+-+---+-++ > |20121001| 1| TMIN| 10|USW00023272| 139|2012| > |20121002| 2| TMIN| 10|USW00023272| 178|2012| > |20121003| 3| TMIN| 10|USW00023272| 144|2012| > |20121004| 4| TMIN| 10|USW00023272| 144|2012| > |20121005| 5| TMIN| 10|USW00023272| 139|2012| > |20121006| 6| TMIN| 10|USW00023272| 128|2012| > |20121007| 7| TMIN| 10|USW00023272| 122|2012| > |20121008| 8| TMIN| 10|USW00023272| 122|2012| > |20121009| 9| TMIN| 10|USW00023272| 139|2012| > |20121010| 10| TMIN| 10|USW00023272| 128|2012| > |20121011| 11| TMIN| 10|USW00023272| 122|2012| > |20121012| 12| TMIN| 10|USW00023272| 117|2012| > |20121013| 13| TMIN| 10|USW00023272| 122|2012| > |20121014| 14| TMIN| 10|USW00023272| 128|2012| > |20121015| 15| TMIN| 10|USW00023272| 128|2012| > |20121016| 16| TMIN| 10|USW00023272| 156|2012| > |20121017| 17| TMIN| 10|USW00023272| 139|2012| > |20121018| 18| TMIN| 10|USW00023272| 161|2012| > |20121019| 19| TMIN| 10|USW00023272| 133|2012| > |20121020| 20| TMIN| 10|USW00023272| 122|2012| > ++---+--+-+---+-+——+ > ++---+--+-+---+-++ > |date_str|day|metric|month|station|value|year| > ++---+--+-+---+-++ > |20121001| 1| TMAX| 10|USW00023272| 322|2012| > |20121002| 2| TMAX| 10|USW00023272| 344|2012| > |20121003| 3| TMAX| 10|USW00023272| 222|2012| > |20121004| 4| TMAX| 10|USW00023272| 189|2012| > |20121005| 5| TMAX| 10|USW00023272| 194|2012| > |20121006| 6| TMAX|