[jira] [Updated] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-08-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6489:

Target Version/s: 1.5.0  (was: 1.6.0)

> Optimize lateral view with explode to not read unnecessary columns
> --
>
> Key: SPARK-6489
> URL: https://issues.apache.org/jira/browse/SPARK-6489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Konstantin Shaposhnikov
>Assignee: Wenchen Fan
>  Labels: starter
> Fix For: 1.5.0
>
>
> Currently a query with "lateral view explode(...)" results in an execution 
> plan that reads all columns of the underlying RDD.
> E.g. given *ppl* table is DF created from Person case class:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> {code}
> the following SQL:
> {code}
> select name, sum(d) from ppl lateral view explode(data) d as d group by name
> {code}
> executes as follows:
> {noformat}
> == Physical Plan ==
> Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
>  Exchange (HashPartitioning [name#0], 200)
>   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
> PartialSum#38L]
>Project [name#0,d#21]
> Generate explode(data#2), true, false
>  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
> [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
> (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:35), Some(ppl))
> {noformat}
> Note that *age* column is not needed to produce the output but it is still 
> read from the underlying RDD.
> A sample program to demonstrate the issue:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> object ExplodeDemo extends App {
>   val ppl = Array(
> Person("A", 20, Array(10, 12, 19)),
> Person("B", 25, Array(7, 8, 4)),
> Person("C", 19, Array(12, 4, 232)))
>   
>   val conf = new SparkConf().setMaster("local[2]").setAppName("sql")
>   val sc = new SparkContext(conf)
>   val sqlCtx = new HiveContext(sc)
>   import sqlCtx.implicits._
>   val df = sc.makeRDD(ppl).toDF
>   df.registerTempTable("ppl")
>   sqlCtx.cacheTable("ppl") // cache table otherwise ExistingRDD will be used 
> that do not support column pruning
>   val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) 
> d as d group by name")
>   s.explain(true)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-08-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6489:

Target Version/s: 1.6.0  (was: 1.5.0)

> Optimize lateral view with explode to not read unnecessary columns
> --
>
> Key: SPARK-6489
> URL: https://issues.apache.org/jira/browse/SPARK-6489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Konstantin Shaposhnikov
>  Labels: starter
>
> Currently a query with "lateral view explode(...)" results in an execution 
> plan that reads all columns of the underlying RDD.
> E.g. given *ppl* table is DF created from Person case class:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> {code}
> the following SQL:
> {code}
> select name, sum(d) from ppl lateral view explode(data) d as d group by name
> {code}
> executes as follows:
> {noformat}
> == Physical Plan ==
> Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
>  Exchange (HashPartitioning [name#0], 200)
>   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
> PartialSum#38L]
>Project [name#0,d#21]
> Generate explode(data#2), true, false
>  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
> [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
> (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:35), Some(ppl))
> {noformat}
> Note that *age* column is not needed to produce the output but it is still 
> read from the underlying RDD.
> A sample program to demonstrate the issue:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> object ExplodeDemo extends App {
>   val ppl = Array(
> Person("A", 20, Array(10, 12, 19)),
> Person("B", 25, Array(7, 8, 4)),
> Person("C", 19, Array(12, 4, 232)))
>   
>   val conf = new SparkConf().setMaster("local[2]").setAppName("sql")
>   val sc = new SparkContext(conf)
>   val sqlCtx = new HiveContext(sc)
>   import sqlCtx.implicits._
>   val df = sc.makeRDD(ppl).toDF
>   df.registerTempTable("ppl")
>   sqlCtx.cacheTable("ppl") // cache table otherwise ExistingRDD will be used 
> that do not support column pruning
>   val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) 
> d as d group by name")
>   s.explain(true)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-06-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6489:

Shepherd:   (was: Michael Armbrust)

> Optimize lateral view with explode to not read unnecessary columns
> --
>
> Key: SPARK-6489
> URL: https://issues.apache.org/jira/browse/SPARK-6489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Konstantin Shaposhnikov
>  Labels: starter
>
> Currently a query with "lateral view explode(...)" results in an execution 
> plan that reads all columns of the underlying RDD.
> E.g. given *ppl* table is DF created from Person case class:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> {code}
> the following SQL:
> {code}
> select name, sum(d) from ppl lateral view explode(data) d as d group by name
> {code}
> executes as follows:
> {noformat}
> == Physical Plan ==
> Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
>  Exchange (HashPartitioning [name#0], 200)
>   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
> PartialSum#38L]
>Project [name#0,d#21]
> Generate explode(data#2), true, false
>  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
> [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
> (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:35), Some(ppl))
> {noformat}
> Note that *age* column is not needed to produce the output but it is still 
> read from the underlying RDD.
> A sample program to demonstrate the issue:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> object ExplodeDemo extends App {
>   val ppl = Array(
> Person("A", 20, Array(10, 12, 19)),
> Person("B", 25, Array(7, 8, 4)),
> Person("C", 19, Array(12, 4, 232)))
>   
>   val conf = new SparkConf().setMaster("local[2]").setAppName("sql")
>   val sc = new SparkContext(conf)
>   val sqlCtx = new HiveContext(sc)
>   import sqlCtx.implicits._
>   val df = sc.makeRDD(ppl).toDF
>   df.registerTempTable("ppl")
>   sqlCtx.cacheTable("ppl") // cache table otherwise ExistingRDD will be used 
> that do not support column pruning
>   val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) 
> d as d group by name")
>   s.explain(true)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-06-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6489:

Shepherd: Michael Armbrust

> Optimize lateral view with explode to not read unnecessary columns
> --
>
> Key: SPARK-6489
> URL: https://issues.apache.org/jira/browse/SPARK-6489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Konstantin Shaposhnikov
>  Labels: starter
>
> Currently a query with "lateral view explode(...)" results in an execution 
> plan that reads all columns of the underlying RDD.
> E.g. given *ppl* table is DF created from Person case class:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> {code}
> the following SQL:
> {code}
> select name, sum(d) from ppl lateral view explode(data) d as d group by name
> {code}
> executes as follows:
> {noformat}
> == Physical Plan ==
> Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
>  Exchange (HashPartitioning [name#0], 200)
>   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
> PartialSum#38L]
>Project [name#0,d#21]
> Generate explode(data#2), true, false
>  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
> [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
> (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:35), Some(ppl))
> {noformat}
> Note that *age* column is not needed to produce the output but it is still 
> read from the underlying RDD.
> A sample program to demonstrate the issue:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> object ExplodeDemo extends App {
>   val ppl = Array(
> Person("A", 20, Array(10, 12, 19)),
> Person("B", 25, Array(7, 8, 4)),
> Person("C", 19, Array(12, 4, 232)))
>   
>   val conf = new SparkConf().setMaster("local[2]").setAppName("sql")
>   val sc = new SparkContext(conf)
>   val sqlCtx = new HiveContext(sc)
>   import sqlCtx.implicits._
>   val df = sc.makeRDD(ppl).toDF
>   df.registerTempTable("ppl")
>   sqlCtx.cacheTable("ppl") // cache table otherwise ExistingRDD will be used 
> that do not support column pruning
>   val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) 
> d as d group by name")
>   s.explain(true)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-05-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6489:

Target Version/s: 1.5.0  (was: 1.4.0)

> Optimize lateral view with explode to not read unnecessary columns
> --
>
> Key: SPARK-6489
> URL: https://issues.apache.org/jira/browse/SPARK-6489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Konstantin Shaposhnikov
>  Labels: starter
>
> Currently a query with "lateral view explode(...)" results in an execution 
> plan that reads all columns of the underlying RDD.
> E.g. given *ppl* table is DF created from Person case class:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> {code}
> the following SQL:
> {code}
> select name, sum(d) from ppl lateral view explode(data) d as d group by name
> {code}
> executes as follows:
> {noformat}
> == Physical Plan ==
> Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
>  Exchange (HashPartitioning [name#0], 200)
>   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
> PartialSum#38L]
>Project [name#0,d#21]
> Generate explode(data#2), true, false
>  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
> [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
> (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:35), Some(ppl))
> {noformat}
> Note that *age* column is not needed to produce the output but it is still 
> read from the underlying RDD.
> A sample program to demonstrate the issue:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> object ExplodeDemo extends App {
>   val ppl = Array(
> Person("A", 20, Array(10, 12, 19)),
> Person("B", 25, Array(7, 8, 4)),
> Person("C", 19, Array(12, 4, 232)))
>   
>   val conf = new SparkConf().setMaster("local[2]").setAppName("sql")
>   val sc = new SparkContext(conf)
>   val sqlCtx = new HiveContext(sc)
>   import sqlCtx.implicits._
>   val df = sc.makeRDD(ppl).toDF
>   df.registerTempTable("ppl")
>   sqlCtx.cacheTable("ppl") // cache table otherwise ExistingRDD will be used 
> that do not support column pruning
>   val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) 
> d as d group by name")
>   s.explain(true)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-03-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6489:

Labels: starter  (was: )

> Optimize lateral view with explode to not read unnecessary columns
> --
>
> Key: SPARK-6489
> URL: https://issues.apache.org/jira/browse/SPARK-6489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Konstantin Shaposhnikov
>  Labels: starter
>
> Currently a query with "lateral view explode(...)" results in an execution 
> plan that reads all columns of the underlying RDD.
> E.g. given *ppl* table is DF created from Person case class:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> {code}
> the following SQL:
> {code}
> select name, sum(d) from ppl lateral view explode(data) d as d group by name
> {code}
> executes as follows:
> {noformat}
> == Physical Plan ==
> Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
>  Exchange (HashPartitioning [name#0], 200)
>   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
> PartialSum#38L]
>Project [name#0,d#21]
> Generate explode(data#2), true, false
>  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
> [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
> (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:35), Some(ppl))
> {noformat}
> Note that *age* column is not needed to produce the output but it is still 
> read from the underlying RDD.
> A sample program to demonstrate the issue:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> object ExplodeDemo extends App {
>   val ppl = Array(
> Person("A", 20, Array(10, 12, 19)),
> Person("B", 25, Array(7, 8, 4)),
> Person("C", 19, Array(12, 4, 232)))
>   
>   val conf = new SparkConf().setMaster("local[2]").setAppName("sql")
>   val sc = new SparkContext(conf)
>   val sqlCtx = new HiveContext(sc)
>   import sqlCtx.implicits._
>   val df = sc.makeRDD(ppl).toDF
>   df.registerTempTable("ppl")
>   sqlCtx.cacheTable("ppl") // cache table otherwise ExistingRDD will be used 
> that do not support column pruning
>   val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) 
> d as d group by name")
>   s.explain(true)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-03-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6489:

Target Version/s: 1.4.0

> Optimize lateral view with explode to not read unnecessary columns
> --
>
> Key: SPARK-6489
> URL: https://issues.apache.org/jira/browse/SPARK-6489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Konstantin Shaposhnikov
>
> Currently a query with "lateral view explode(...)" results in an execution 
> plan that reads all columns of the underlying RDD.
> E.g. given *ppl* table is DF created from Person case class:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> {code}
> the following SQL:
> {code}
> select name, sum(d) from ppl lateral view explode(data) d as d group by name
> {code}
> executes as follows:
> {noformat}
> == Physical Plan ==
> Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
>  Exchange (HashPartitioning [name#0], 200)
>   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
> PartialSum#38L]
>Project [name#0,d#21]
> Generate explode(data#2), true, false
>  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
> [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
> (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:35), Some(ppl))
> {noformat}
> Note that *age* column is not needed to produce the output but it is still 
> read from the underlying RDD.
> A sample program to demonstrate the issue:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> object ExplodeDemo extends App {
>   val ppl = Array(
> Person("A", 20, Array(10, 12, 19)),
> Person("B", 25, Array(7, 8, 4)),
> Person("C", 19, Array(12, 4, 232)))
>   
>   val conf = new SparkConf().setMaster("local[2]").setAppName("sql")
>   val sc = new SparkContext(conf)
>   val sqlCtx = new HiveContext(sc)
>   import sqlCtx.implicits._
>   val df = sc.makeRDD(ppl).toDF
>   df.registerTempTable("ppl")
>   sqlCtx.cacheTable("ppl") // cache table otherwise ExistingRDD will be used 
> that do not support column pruning
>   val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) 
> d as d group by name")
>   s.explain(true)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-03-23 Thread Konstantin Shaposhnikov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shaposhnikov updated SPARK-6489:
---
Description: 
Currently a query with "lateral view explode(...)" results in an execution plan 
that reads all columns of the underlying RDD.

E.g. given *ppl* table is DF created from Person case class:
{code}
case class Person(val name: String, val age: Int, val data: Array[Int])
{code}
the following SQL:
{code}
select name, sum(d) from ppl lateral view explode(data) d as d group by name
{code}
executes as follows:
{noformat}
== Physical Plan ==
Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
 Exchange (HashPartitioning [name#0], 200)
  Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS PartialSum#38L]
   Project [name#0,d#21]
Generate explode(data#2), true, false
 InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
[name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
(PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
ExistingRDD.scala:35), Some(ppl))
{noformat}

Note that *age* column is not needed to produce the output but it is still read 
from the underlying RDD.

A sample program to demonstrate the issue:
{code}
case class Person(val name: String, val age: Int, val data: Array[Int])

object ExplodeDemo extends App {
  val ppl = Array(
Person("A", 20, Array(10, 12, 19)),
Person("B", 25, Array(7, 8, 4)),
Person("C", 19, Array(12, 4, 232)))
  

  val conf = new SparkConf().setMaster("local[2]").setAppName("sql")
  val sc = new SparkContext(conf)
  val sqlCtx = new HiveContext(sc)
  import sqlCtx.implicits._
  val df = sc.makeRDD(ppl).toDF
  df.registerTempTable("ppl")
  sqlCtx.cacheTable("ppl") // cache table otherwise ExistingRDD will be used 
that do not support column pruning
  val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) d 
as d group by name")
  s.explain(true)
}
{code}

  was:
Currently a query with "lateral view explode(...)" results in an execution plan 
that reads all columns of the underlying RDD.

E.g. given *ppl* table is DF created from Person case class:
{code}
case class Person(val name: String, val age: Int, val data: Array[Int])
{code}
the following SQL:
{code}
select name, sum(d) from ppl lateral view explode(data) d as d group by name
{code}
executes as follows:
{noformat}
== Physical Plan ==
Aggregate false, [name#0], [name#0,SUM(PartialSum#8L) AS _c1#3L]
 Exchange (HashPartitioning [name#0], 200)
  Aggregate true, [name#0], [name#0,SUM(CAST(d#6, LongType)) AS PartialSum#8L]
   Project [name#0,d#6]
Generate explode(data#2), true, false
 PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
ExistingRDD.scala:35
{noformat}

Note that *age* column is not needed to produce the output but it is still read 
from the underlying RDD.

A sample program to demonstrate the issue:
{code}
case class Person(val name: String, val age: Int, val data: Array[Int])

object ExplodeDemo extends App {
  val ppl = Array(
Person("A", 20, Array(10, 12, 19)),
Person("B", 25, Array(7, 8, 4)),
Person("C", 19, Array(12, 4, 232)))
  

  val conf = new SparkConf().setMaster("local[2]").setAppName("sql")
  val sc = new SparkContext(conf)
  val sqlCtx = new HiveContext(sc)
  import sqlCtx.implicits._
  val df = sc.makeRDD(ppl).toDF
  df.registerTempTable("ppl")
  val s = sqlCtx.sql("select name, sum(d) from ppl lateral view explode(data) d 
as d group by name")
  s.explain(true)
}
{code}


> Optimize lateral view with explode to not read unnecessary columns
> --
>
> Key: SPARK-6489
> URL: https://issues.apache.org/jira/browse/SPARK-6489
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Konstantin Shaposhnikov
>
> Currently a query with "lateral view explode(...)" results in an execution 
> plan that reads all columns of the underlying RDD.
> E.g. given *ppl* table is DF created from Person case class:
> {code}
> case class Person(val name: String, val age: Int, val data: Array[Int])
> {code}
> the following SQL:
> {code}
> select name, sum(d) from ppl lateral view explode(data) d as d group by name
> {code}
> executes as follows:
> {noformat}
> == Physical Plan ==
> Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
>  Exchange (HashPartitioning [name#0], 200)
>   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
> PartialSum#38L]
>Project [name#0,d#21]
> Generate explode(data#2), true, false
>  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
> [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
> (PhysicalRDD [name#0,age#1,data#2]