[jira] [Updated] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-19 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-13932:
-
Affects Version/s: 2.0.0

> CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
> --
>
> Key: SPARK-13932
> URL: https://issues.apache.org/jira/browse/SPARK-13932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Tien-Dung LE
>
> A complex aggregate query using condition in the aggregate function and GROUP 
> BY HAVING clause raises an exception. This issue only happens in Spark 
> version 1.6.+ but not in Spark 1.5.+.
> Here is a typical error message {code}
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
> {code}
> Here is a code snippet to re-produce the error in a spark-shell session:
> {code}
> import sqlContext.implicits._
> case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
>   b: Int = (math.random*1e3).toInt,
>   n: Int = (math.random*1e3).toInt,
>   m: Double = (math.random*1e3))
> val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
> val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )
> df.registerTempTable( "toto" )
> val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS 
> k3, GROUPING__ID"
> val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS 
> k2, SUM(m) AS k3, GROUPING__ID"
> val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
> val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"
> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> {code}
> And here is the full log
> {code}
> scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" )
> res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" )
> res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.

[jira] [Commented] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-19 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199683#comment-15199683
 ] 

Tien-Dung LE commented on SPARK-13932:
--

Hi Xiao, not yet! I only tried with Spark version 1.6.1 and 1.6.0.

> CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
> --
>
> Key: SPARK-13932
> URL: https://issues.apache.org/jira/browse/SPARK-13932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Tien-Dung LE
>
> A complex aggregate query using condition in the aggregate function and GROUP 
> BY HAVING clause raises an exception. This issue only happens in Spark 
> version 1.6.+ but not in Spark 1.5.+.
> Here is a typical error message {code}
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
> {code}
> Here is a code snippet to re-produce the error in a spark-shell session:
> {code}
> import sqlContext.implicits._
> case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
>   b: Int = (math.random*1e3).toInt,
>   n: Int = (math.random*1e3).toInt,
>   m: Double = (math.random*1e3))
> val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
> val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )
> df.registerTempTable( "toto" )
> val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS 
> k3, GROUPING__ID"
> val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS 
> k2, SUM(m) AS k3, GROUPING__ID"
> val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
> val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"
> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> {code}
> And here is the full log
> {code}
> scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" )
> res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" )
> res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$clas

[jira] [Commented] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-19 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199689#comment-15199689
 ] 

Tien-Dung LE commented on SPARK-13932:
--

The error is still there in the latest spark code version 2.0.0-SNAPSHOT

> CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
> --
>
> Key: SPARK-13932
> URL: https://issues.apache.org/jira/browse/SPARK-13932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Tien-Dung LE
>
> A complex aggregate query using condition in the aggregate function and GROUP 
> BY HAVING clause raises an exception. This issue only happens in Spark 
> version 1.6.+ but not in Spark 1.5.+.
> Here is a typical error message {code}
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
> {code}
> Here is a code snippet to re-produce the error in a spark-shell session:
> {code}
> import sqlContext.implicits._
> case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
>   b: Int = (math.random*1e3).toInt,
>   n: Int = (math.random*1e3).toInt,
>   m: Double = (math.random*1e3))
> val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
> val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )
> df.registerTempTable( "toto" )
> val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS 
> k3, GROUPING__ID"
> val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS 
> k2, SUM(m) AS k3, GROUPING__ID"
> val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
> val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"
> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK
> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> {code}
> And here is the full log
> {code}
> scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" )
> res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" )
> res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
> bigint, k3: double, GROUPING__ID: int]
> scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
> org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
> b#55, b#124.; line 1 pos 178
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.Travers

[jira] [Updated] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-16 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-13932:
-
Description: 
A complex aggregate query using condition in the aggregate function and GROUP 
BY HAVING clause raises an exception. This issue only happens in Spark version 
1.6.+ but not in Spark 1.5.+.

Here is a typical error message {code}
org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
b#55, b#124.; line 1 pos 178
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
{code}

Here is a code snippet to re-produce the error in a spark-shell session:
{code}
import sqlContext.implicits._

case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
  b: Int = (math.random*1e3).toInt,
  n: Int = (math.random*1e3).toInt,
  m: Double = (math.random*1e3))

val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )

df.registerTempTable( "toto" )

val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS k3, 
GROUPING__ID"
val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS k2, 
SUM(m) AS k3, GROUPING__ID"

val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"

sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK

sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
{code}

And here is the full log
{code}
scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" )
res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
bigint, k3: double, GROUPING__ID: int]

scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" )
res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
bigint, k3: double, GROUPING__ID: int]

scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
b#55, b#124.; line 1 pos 178
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:316)
at 
org.apache.spark.s

[jira] [Updated] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-16 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-13932:
-
Description: 
A complex aggregate query using condition in the aggregate function and GROUP 
BY HAVING clause raises an exception. Here is a typical erro message {code}
org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
b#55, b#124.; line 1 pos 178
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
{code}

Here is a code snippet to re-produce the error in a spark-shell session:
{code}
import sqlContext.implicits._

case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
  b: Int = (math.random*1e3).toInt,
  n: Int = (math.random*1e3).toInt,
  m: Double = (math.random*1e3))

val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )

df.registerTempTable( "toto" )

val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS k3, 
GROUPING__ID"
val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS k2, 
SUM(m) AS k3, GROUPING__ID"

val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"

sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK

sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
{code}

Here is the full log
{code}
scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" )
res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
bigint, k3: double, GROUPING__ID: int]

scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" )
res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
bigint, k3: double, GROUPING__ID: int]

scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
b#55, b#124.; line 1 pos 178
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at 
or

[jira] [Created] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-16 Thread Tien-Dung LE (JIRA)
Tien-Dung LE created SPARK-13932:


 Summary: CUBE Query with filter (HAVING) and condition (IF) raises 
an AnalysisException
 Key: SPARK-13932
 URL: https://issues.apache.org/jira/browse/SPARK-13932
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1, 1.6.0
Reporter: Tien-Dung LE


A complex aggregate query using condition in the aggregate function and GROUP 
BY HAVING clause raises an exception. Here is a typical erro message {code}
org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
b#55, b#124.; line 1 pos 178
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
{code}

Here is a code snippet to re-produce the error in a spark-shell session:
{code}
import sqlContext.implicits._

case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
  b: Int = (math.random*1e3).toInt,
  n: Int = (math.random*1e3).toInt,
  m: Double = (math.random*1e3))

val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )

df.registerTempTable( "toto" )

val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS k3, 
GROUPING__ID"
val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS k2, 
SUM(m) AS k3, GROUPING__ID"

val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"

sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK

sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2016-01-15 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-12837:
-
Description: 
Executing a sql statement with a large number of partitions requires a high 
memory space for the driver even there are no requests to collect data back to 
the driver.

Here are steps to re-produce the issue.
1. Start spark shell with a spark.driver.maxResultSize setting
{code:java}
bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
{code}
2. Execute the code 
{code:java}
case class Toto( a: Int, b: Int)
val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF

sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK

sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
 "toto2" ) // ERROR
{code}

The error message is 
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
spark.driver.maxResultSize (1024.0 KB)
{code}


  was:
Executing a sql statement with a large number of partitions requires a high 
memory space for the driver even there are no requests to collect data back to 
the driver.

Here are steps to re-produce the issue.
1. Start spark shell with a spark.driver.maxResultSize setting
{code:shell}
bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
{code}
2. Execute the code 
{code:scala}
case class Toto( a: Int, b: Int)
val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF

sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK

sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
 "toto2" ) // ERROR
{code}

The error message is 
{code:scala}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
spark.driver.maxResultSize (1024.0 KB)
{code}



> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2016-01-15 Thread Tien-Dung LE (JIRA)
Tien-Dung LE created SPARK-12837:


 Summary: Spark driver requires large memory space for serialized 
results even there are no data collected to the driver
 Key: SPARK-12837
 URL: https://issues.apache.org/jira/browse/SPARK-12837
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 1.6.0, 1.5.2
Reporter: Tien-Dung LE


Executing a sql statement with a large number of partitions requires a high 
memory space for the driver even there are no requests to collect data back to 
the driver.

Here are steps to re-produce the issue.
1. Start spark shell with a spark.driver.maxResultSize setting
{code:shell}
bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
{code}
2. Execute the code 
{code:scala}
case class Toto( a: Int, b: Int)
val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF

sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK

sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
 "toto2" ) // ERROR
{code}

The error message is 
{code:scala}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
spark.driver.maxResultSize (1024.0 KB)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-10-12 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953158#comment-14953158
 ] 

Tien-Dung LE commented on SPARK-9280:
-

[~davies] I have just checked out with the latest spark master code. SqlContext 
(instead of HiveContext) returns correct number of partitions. Many thanks for 
this fix.

> New HiveContext object unexpectedly loads configuration settings from history 
> --
>
> Key: SPARK-9280
> URL: https://issues.apache.org/jira/browse/SPARK-9280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Tien-Dung LE
>
> In a spark session, stopping a spark context and create a new spark context 
> and hive context does not clean the spark sql configuration. More precisely, 
> the new hive context still keeps the previous configuration settings. It 
> would be great if someone can let us know how to avoid this situation.
> {code:title=New hive context should not load the configurations from history}
> sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
> sc.stop
> val sparkConf2 = new org.apache.spark.SparkConf()
> val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
> val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
> // got 20 as expected
> sqlContext2.setConf( "foo", "foo") 
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
> // expected 30 but got 10
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-27 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642840#comment-14642840
 ] 

Tien-Dung LE commented on SPARK-9280:
-

thanks [~pborck]. Detach the hive session state 
org.apache.hadoop.hive.ql.session.SessionState.detachSession() helps to avoid 
the issue.

> New HiveContext object unexpectedly loads configuration settings from history 
> --
>
> Key: SPARK-9280
> URL: https://issues.apache.org/jira/browse/SPARK-9280
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Tien-Dung LE
>
> In a spark session, stopping a spark context and create a new spark context 
> and hive context does not clean the spark sql configuration. More precisely, 
> the new hive context still keeps the previous configuration settings. It 
> would be great if someone can let us know how to avoid this situation.
> {code:title=New hive context should not load the configurations from history}
> sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
> sc.stop
> val sparkConf2 = new org.apache.spark.SparkConf()
> val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
> val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
> // got 20 as expected
> sqlContext2.setConf( "foo", "foo") 
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
> // expected 30 but got 10
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-27 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9280:

Description: 
In a spark session, stopping a spark context and create a new spark context and 
hive context does not clean the spark sql configuration. More precisely, the 
new hive context still keeps the previous configuration settings. It would be 
great if someone can let us know how to avoid this situation.

{code:title=New hive context should not load the configurations from history}

sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
// got 20 as expected

sqlContext2.setConf( "foo", "foo") 

sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
// expected 30 but got 10

{code}

  was:
In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. It would be great if someone can let us know how to avoid this 
situation.

{code:title=New hive context should not load the configurations from history}

sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
// got 20 as expected

sqlContext2.setConf( "foo", "foo") 

sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
// expected 30 but got 10

{code}


> New HiveContext object unexpectedly loads configuration settings from history 
> --
>
> Key: SPARK-9280
> URL: https://issues.apache.org/jira/browse/SPARK-9280
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Tien-Dung LE
>
> In a spark session, stopping a spark context and create a new spark context 
> and hive context does not clean the spark sql configuration. More precisely, 
> the new hive context still keeps the previous configuration settings. It 
> would be great if someone can let us know how to avoid this situation.
> {code:title=New hive context should not load the configurations from history}
> sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
> sc.stop
> val sparkConf2 = new org.apache.spark.SparkConf()
> val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
> val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
> // got 20 as expected
> sqlContext2.setConf( "foo", "foo") 
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
> // expected 30 but got 10
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-27 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9280:

Description: 
In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. It would be great if someone can let us know how to avoid this 
situation.

{code:title=New hive context should not load the configurations from history}

sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
// got 20 as expected

sqlContext2.setConf( "foo", "foo") 

sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
// expected 30 but got 10

{code}

  was:
In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. It would be great if someone can let us know how to avoid this 
situation.

{code:title=New hive context should not load the configurations from history}
case class Foo ( x: Int = (math.random * 1e3).toInt)
val foo = (1 to 100).map(i => Foo()).toDF
foo.saveAsParquetFile( "foo" )
sqlContext.setConf( "spark.sql.shuffle.partitions", "10")

sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
// got 20 as expected
val foo2 = sqlContext2.parquetFile( "foo" )
sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
// expected 30 but got 10
{code}


> New HiveContext object unexpectedly loads configuration settings from history 
> --
>
> Key: SPARK-9280
> URL: https://issues.apache.org/jira/browse/SPARK-9280
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Tien-Dung LE
>
> In a spark-shell session, stopping a spark context and create a new spark 
> context and hive context does not clean the spark sql configuration. More 
> precisely, the new hive context still keeps the previous configuration 
> settings. It would be great if someone can let us know how to avoid this 
> situation.
> {code:title=New hive context should not load the configurations from history}
> sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
> sc.stop
> val sparkConf2 = new org.apache.spark.SparkConf()
> val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
> val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
> // got 20 as expected
> sqlContext2.setConf( "foo", "foo") 
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
> // expected 30 but got 10
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-24 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9280:

Affects Version/s: 1.4.1

> New HiveContext object unexpectedly loads configuration settings from history 
> --
>
> Key: SPARK-9280
> URL: https://issues.apache.org/jira/browse/SPARK-9280
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Tien-Dung LE
>
> In a spark-shell session, stopping a spark context and create a new spark 
> context and hive context does not clean the spark sql configuration. More 
> precisely, the new hive context still keeps the previous configuration 
> settings. It would be great if someone can let us know how to avoid this 
> situation.
> {code:title=New hive context should not load the configurations from history}
> case class Foo ( x: Int = (math.random * 1e3).toInt)
> val foo = (1 to 100).map(i => Foo()).toDF
> foo.saveAsParquetFile( "foo" )
> sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
> sc.stop
> val sparkConf2 = new org.apache.spark.SparkConf()
> val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
> val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
> // got 20 as expected
> val foo2 = sqlContext2.parquetFile( "foo" )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
> // expected 30 but got 10
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-23 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9280:

Description: 
In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. It would be great if someone can let us know how to avoid this 
situation.

{code:title=New hive context should not load the configurations from history}
case class Foo ( x: Int = (math.random * 1e3).toInt)
val foo = (1 to 100).map(i => Foo()).toDF
foo.saveAsParquetFile( "foo" )
sqlContext.setConf( "spark.sql.shuffle.partitions", "10")

sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
// got 20 as expected
val foo2 = sqlContext2.parquetFile( "foo" )
sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
// expected 30 but got 10
{code}

  was:
In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. It would be great if someone can let us know how to avoid this 
situation.

{code:title=New hive context should not load the configurations from history}
case class Foo ( x: Int = (math.random * 1e3).toInt)
val foo = (1 to 100).map(i => Foo()).toDF
foo.saveAsParquetFile( "foo" )
sqlContext.setConf( "spark.sql.shuffle.partitions", "10")

sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
val foo2 = sqlContext2.parquetFile( "foo" )
sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
// expected 30 but got 10
{code}


> New HiveContext object unexpectedly loads configuration settings from history 
> --
>
> Key: SPARK-9280
> URL: https://issues.apache.org/jira/browse/SPARK-9280
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Tien-Dung LE
>
> In a spark-shell session, stopping a spark context and create a new spark 
> context and hive context does not clean the spark sql configuration. More 
> precisely, the new hive context still keeps the previous configuration 
> settings. It would be great if someone can let us know how to avoid this 
> situation.
> {code:title=New hive context should not load the configurations from history}
> case class Foo ( x: Int = (math.random * 1e3).toInt)
> val foo = (1 to 100).map(i => Foo()).toDF
> foo.saveAsParquetFile( "foo" )
> sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
> sc.stop
> val sparkConf2 = new org.apache.spark.SparkConf()
> val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
> val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
> // got 20 as expected
> val foo2 = sqlContext2.parquetFile( "foo" )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
> // expected 30 but got 10
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-23 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9280:

Description: 
In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. It would be great if someone can let us know how to avoid this 
situation.

{code:title=New hive context should not load the configurations from history}
case class Foo ( x: Int = (math.random * 1e3).toInt)
val foo = (1 to 100).map(i => Foo()).toDF
foo.saveAsParquetFile( "foo" )
sqlContext.setConf( "spark.sql.shuffle.partitions", "10")

sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
val foo2 = sqlContext2.parquetFile( "foo" )
sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
// expected 30 but got 10
{code}

  was:
In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. Here is a code to show this scenario.

{code:title=New hive context should not load the configurations from history}
case class Foo ( x: Int = (math.random * 1e3).toInt)
val foo = (1 to 100).map(i => Foo()).toDF
foo.saveAsParquetFile( "foo" )
sqlContext.setConf( "spark.sql.shuffle.partitions", "10")

sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
val foo2 = sqlContext2.parquetFile( "foo" )
sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
// expected 30 but got 10
{code}


> New HiveContext object unexpectedly loads configuration settings from history 
> --
>
> Key: SPARK-9280
> URL: https://issues.apache.org/jira/browse/SPARK-9280
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Tien-Dung LE
>
> In a spark-shell session, stopping a spark context and create a new spark 
> context and hive context does not clean the spark sql configuration. More 
> precisely, the new hive context still keeps the previous configuration 
> settings. It would be great if someone can let us know how to avoid this 
> situation.
> {code:title=New hive context should not load the configurations from history}
> case class Foo ( x: Int = (math.random * 1e3).toInt)
> val foo = (1 to 100).map(i => Foo()).toDF
> foo.saveAsParquetFile( "foo" )
> sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
> sc.stop
> val sparkConf2 = new org.apache.spark.SparkConf()
> val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
> val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
> val foo2 = sqlContext2.parquetFile( "foo" )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
> // expected 30 but got 10
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-23 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9280:

Component/s: SQL

> New HiveContext object unexpectedly loads configuration settings from history 
> --
>
> Key: SPARK-9280
> URL: https://issues.apache.org/jira/browse/SPARK-9280
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Tien-Dung LE
>
> In a spark-shell session, stopping a spark context and create a new spark 
> context and hive context does not clean the spark sql configuration. More 
> precisely, the new hive context still keeps the previous configuration 
> settings. Here is a code to show this scenario.
> {code:title=New hive context should not load the configurations from history}
> case class Foo ( x: Int = (math.random * 1e3).toInt)
> val foo = (1 to 100).map(i => Foo()).toDF
> foo.saveAsParquetFile( "foo" )
> sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
> sc.stop
> val sparkConf2 = new org.apache.spark.SparkConf()
> val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
> val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
> val foo2 = sqlContext2.parquetFile( "foo" )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
> // expected 30 but got 10
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-23 Thread Tien-Dung LE (JIRA)
Tien-Dung LE created SPARK-9280:
---

 Summary: New HiveContext object unexpectedly loads configuration 
settings from history 
 Key: SPARK-9280
 URL: https://issues.apache.org/jira/browse/SPARK-9280
 Project: Spark
  Issue Type: Bug
Reporter: Tien-Dung LE


In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. Here is a code to show this scenario.

{code:title=New hive context should not load the configurations from history}
case class Foo ( x: Int = (math.random * 1e3).toInt)
val foo = (1 to 100).map(i => Foo()).toDF
foo.saveAsParquetFile( "foo" )
sqlContext.setConf( "spark.sql.shuffle.partitions", "10")

sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
val foo2 = sqlContext2.parquetFile( "foo" )
sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
// expected 30 but got 10
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-23 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9280:

Affects Version/s: 1.3.1

> New HiveContext object unexpectedly loads configuration settings from history 
> --
>
> Key: SPARK-9280
> URL: https://issues.apache.org/jira/browse/SPARK-9280
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Tien-Dung LE
>
> In a spark-shell session, stopping a spark context and create a new spark 
> context and hive context does not clean the spark sql configuration. More 
> precisely, the new hive context still keeps the previous configuration 
> settings. Here is a code to show this scenario.
> {code:title=New hive context should not load the configurations from history}
> case class Foo ( x: Int = (math.random * 1e3).toInt)
> val foo = (1 to 100).map(i => Foo()).toDF
> foo.saveAsParquetFile( "foo" )
> sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
> sc.stop
> val sparkConf2 = new org.apache.spark.SparkConf()
> val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
> val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") 
> val foo2 = sqlContext2.parquetFile( "foo" )
> sqlContext2.getConf( "spark.sql.shuffle.partitions", "30")
> // expected 30 but got 10
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9109) Unpersist a graph object does not work properly

2015-07-18 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE closed SPARK-9109.
---
Resolution: Fixed

The change has been merged at https://github.com/apache/spark/pull/7469


> Unpersist a graph object does not work properly
> ---
>
> Key: SPARK-9109
> URL: https://issues.apache.org/jira/browse/SPARK-9109
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Tien-Dung LE
>Priority: Minor
>
> Unpersist a graph object does not work properly.
> Here is the code to produce 
> {code}
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.slf4j.LoggerFactory
> import org.apache.spark.graphx.util.GraphGenerators
> val graph: Graph[Long, Long] =
>   GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) 
> => id.toLong ).mapEdges( e => e.attr.toLong)
>   
> graph.cache().numEdges
> graph.unpersist()
> {code}
> There should not be any cached RDDs in storage 
> (http://localhost:4040/storage/).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9109) Unpersist a graph object does not work properly

2015-07-17 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631412#comment-14631412
 ] 

Tien-Dung LE edited comment on SPARK-9109 at 7/17/15 2:38 PM:
--

[~sowen] This is my first PR for spark so I am not sure how to process. Could 
you kindly have a first look?

PS: I did run ./dev/run-tests. It passed "Scalastyle checks" but il failed on 
python part

{code}
.../spark/dev/lint-python: line 64: syntax error near unexpected token `>'
.../spark/dev/lint-python: line 64: `easy_install -d "$PYLINT_HOME" 
pylint==1.4.4 &>> "$PYLINT_INSTALL_INFO"'
[error] running .../spark/dev/lint-python ; received return code 2
{code}


was (Author: tien-dung.le):
[~sowen] This is my first PR for spark so I am not sure how to process. Could 
you kindly have a first look?

PS: I did run ./dev/run-tests but il failed on python part

{code}
.../spark/dev/lint-python: line 64: syntax error near unexpected token `>'
.../spark/dev/lint-python: line 64: `easy_install -d "$PYLINT_HOME" 
pylint==1.4.4 &>> "$PYLINT_INSTALL_INFO"'
[error] running .../spark/dev/lint-python ; received return code 2
{code}

> Unpersist a graph object does not work properly
> ---
>
> Key: SPARK-9109
> URL: https://issues.apache.org/jira/browse/SPARK-9109
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Tien-Dung LE
>Priority: Minor
>
> Unpersist a graph object does not work properly.
> Here is the code to produce 
> {code}
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.slf4j.LoggerFactory
> import org.apache.spark.graphx.util.GraphGenerators
> val graph: Graph[Long, Long] =
>   GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) 
> => id.toLong ).mapEdges( e => e.attr.toLong)
>   
> graph.cache().numEdges
> graph.unpersist()
> {code}
> There should not be any cached RDDs in storage 
> (http://localhost:4040/storage/).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9109) Unpersist a graph object does not work properly

2015-07-17 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631412#comment-14631412
 ] 

Tien-Dung LE commented on SPARK-9109:
-

[~sowen] This is my first PR for spark so I am not sure how to process. Could 
you kindly have a first look?

PS: I did run ./dev/run-tests but il failed on python part

{code}
.../spark/dev/lint-python: line 64: syntax error near unexpected token `>'
.../spark/dev/lint-python: line 64: `easy_install -d "$PYLINT_HOME" 
pylint==1.4.4 &>> "$PYLINT_INSTALL_INFO"'
[error] running .../spark/dev/lint-python ; received return code 2
{code}

> Unpersist a graph object does not work properly
> ---
>
> Key: SPARK-9109
> URL: https://issues.apache.org/jira/browse/SPARK-9109
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Tien-Dung LE
>Priority: Minor
>
> Unpersist a graph object does not work properly.
> Here is the code to produce 
> {code}
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.slf4j.LoggerFactory
> import org.apache.spark.graphx.util.GraphGenerators
> val graph: Graph[Long, Long] =
>   GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) 
> => id.toLong ).mapEdges( e => e.attr.toLong)
>   
> graph.cache().numEdges
> graph.unpersist()
> {code}
> There should not be any cached RDDs in storage 
> (http://localhost:4040/storage/).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9109) Unpersist a graph object does not work properly

2015-07-17 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631081#comment-14631081
 ] 

Tien-Dung LE commented on SPARK-9109:
-

I think the cache is done in purpose so we should keep it. The solution is to 
keep all cached RDDs and unpersist them later (when the graph.unpersist is 
called. 

I can propose a change for that but it would be very kind of you to refer me to 
a document (produce) how to make a change and create a PR.

> Unpersist a graph object does not work properly
> ---
>
> Key: SPARK-9109
> URL: https://issues.apache.org/jira/browse/SPARK-9109
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Tien-Dung LE
>Priority: Minor
>
> Unpersist a graph object does not work properly.
> Here is the code to produce 
> {code}
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.slf4j.LoggerFactory
> import org.apache.spark.graphx.util.GraphGenerators
> val graph: Graph[Long, Long] =
>   GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) 
> => id.toLong ).mapEdges( e => e.attr.toLong)
>   
> graph.cache().numEdges
> graph.unpersist()
> {code}
> There should not be any cached RDDs in storage 
> (http://localhost:4040/storage/).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9109) Unpersist a graph object does not work properly

2015-07-17 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630941#comment-14630941
 ] 

Tien-Dung LE commented on SPARK-9109:
-

Thanks [~sowen]. Indeed, there is still a cached edges RDD. I think this RDD is 
left from the graph construction in GraphImpl.scala, more precisely at 
fromEdgeRDD() function.

Here is the latest code follow your suggestion.
{code}

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.slf4j.LoggerFactory
import org.apache.spark.graphx.util.GraphGenerators

val graph1 = GraphGenerators.logNormalGraph(sc, numVertices = 100)
val graph2 = graph1.mapVertices( (id, _) => id.toLong )
val graph3 = graph2.mapEdges( e => e.attr.toLong)
  
graph3.cache().numEdges
graph3.unpersist()
graph2.unpersist()
graph1.unpersist()
graph2.unpersist()
graph3.unpersist()

sc.getPersistentRDDs.foreach( r => println( r._2.toString))

/*  
GraphImpl.scala

private def fromEdgeRDD[VD: ClassTag, ED: ClassTag](
  edges: EdgeRDDImpl[ED, VD],
  defaultVertexAttr: VD,
  edgeStorageLevel: StorageLevel,
  vertexStorageLevel: StorageLevel): GraphImpl[VD, ED] = {
val edgesCached = edges.withTargetStorageLevel(edgeStorageLevel).cache()
val vertices = VertexRDD.fromEdges(edgesCached, 
edgesCached.partitions.size, defaultVertexAttr)
  .withTargetStorageLevel(vertexStorageLevel)
fromExistingRDDs(vertices, edgesCached)
  }
*/
{code}




> Unpersist a graph object does not work properly
> ---
>
> Key: SPARK-9109
> URL: https://issues.apache.org/jira/browse/SPARK-9109
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Tien-Dung LE
>Priority: Minor
>
> Unpersist a graph object does not work properly.
> Here is the code to produce 
> {code}
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.slf4j.LoggerFactory
> import org.apache.spark.graphx.util.GraphGenerators
> val graph: Graph[Long, Long] =
>   GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) 
> => id.toLong ).mapEdges( e => e.attr.toLong)
>   
> graph.cache().numEdges
> graph.unpersist()
> {code}
> There should not be any cached RDDs in storage 
> (http://localhost:4040/storage/).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9109) Unpersist a graph object does not work properly

2015-07-16 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9109:

Description: 
Unpersist a graph object does not work properly.

Here is the code to produce 

{code}
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.slf4j.LoggerFactory
import org.apache.spark.graphx.util.GraphGenerators

val graph: Graph[Long, Long] =
  GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) => 
id.toLong ).mapEdges( e => e.attr.toLong)
  
graph.cache().numEdges
graph.unpersist()
{code}

There should not be any cached RDDs in storage (http://localhost:4040/storage/).

  was:
Unpersist a graph object does not work properly.

Here is the code to produce 


import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.slf4j.LoggerFactory
import org.apache.spark.graphx.util.GraphGenerators

val graph: Graph[Long, Long] =
  GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) => 
id.toLong ).mapEdges( e => e.attr.toLong)
  
graph.cache().numEdges
graph.unpersist()


There should not be any cached RDDs in storage (http://localhost:4040/storage/).


> Unpersist a graph object does not work properly
> ---
>
> Key: SPARK-9109
> URL: https://issues.apache.org/jira/browse/SPARK-9109
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Tien-Dung LE
>
> Unpersist a graph object does not work properly.
> Here is the code to produce 
> {code}
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
> import org.slf4j.LoggerFactory
> import org.apache.spark.graphx.util.GraphGenerators
> val graph: Graph[Long, Long] =
>   GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) 
> => id.toLong ).mapEdges( e => e.attr.toLong)
>   
> graph.cache().numEdges
> graph.unpersist()
> {code}
> There should not be any cached RDDs in storage 
> (http://localhost:4040/storage/).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9109) Unpersist a graph object does not work properly

2015-07-16 Thread Tien-Dung LE (JIRA)
Tien-Dung LE created SPARK-9109:
---

 Summary: Unpersist a graph object does not work properly
 Key: SPARK-9109
 URL: https://issues.apache.org/jira/browse/SPARK-9109
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0, 1.3.1
Reporter: Tien-Dung LE


Unpersist a graph object does not work properly.

Here is the code to produce 


import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.slf4j.LoggerFactory
import org.apache.spark.graphx.util.GraphGenerators

val graph: Graph[Long, Long] =
  GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) => 
id.toLong ).mapEdges( e => e.attr.toLong)
  
graph.cache().numEdges
graph.unpersist()


There should not be any cached RDDs in storage (http://localhost:4040/storage/).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-03-11 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE closed SPARK-5499.
---

checkpoint mechanism can solve the issue.

> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-03-11 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356699#comment-14356699
 ] 

Tien-Dung LE commented on SPARK-5499:
-

Many thanks for your reply. I did the same as your suggestion and it worked.

> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298622#comment-14298622
 ] 

Tien-Dung LE edited comment on SPARK-5499 at 1/30/15 1:47 PM:
--

I tried with checkpoint() but had the same error. Here is the code

{code}
for (i <- 1 to 1000) {
  newPair = pair.map(_.swap).persist()

  pair = newPair
  println("" + i + ": count = " + pair.count())

  if( i % 100 == 0) {
pair.checkpoint()
  }
}
{code}


was (Author: tien-dung.le):
I tried with checkpoint() but same had the same error. Here is the code

{code}
for (i <- 1 to 1000) {
  newPair = pair.map(_.swap).persist()

  pair = newPair
  println("" + i + ": count = " + pair.count())

  if( i % 100 == 0) {
pair.checkpoint()
  }
}
{code}

> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298622#comment-14298622
 ] 

Tien-Dung LE commented on SPARK-5499:
-

I tried with checkpoint() but same had the same error. Here is the code

{code}
for (i <- 1 to 1000) {
  newPair = pair.map(_.swap).persist()

  pair = newPair
  println("" + i + ": count = " + pair.count())

  if( i % 100 == 0) {
pair.checkpoint()
  }
}
{code}

> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298608#comment-14298608
 ] 

Tien-Dung LE commented on SPARK-5499:
-

Thanks Sean Owen for your comment.

Calling persist() or cache() does not help. Did you mean to call checkpoint() ?

> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-5499:

Description: 
I got an error "org.apache.spark.SparkException: Job aborted due to stage 
failure: Task serialization failed: java.lang.StackOverflowError" when 
executing an action with 1000 transformations.

Here is a code snippet to re-produce the error:
{code}
  import org.apache.spark.rdd.RDD
  var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
var newPair: RDD[(Long,Long)] = null

for (i <- 1 to 1000) {
  newPair = pair.map(_.swap)
  pair = newPair
}

println("Count = " + pair.count())

{code}

  was:
I got an error "org.apache.spark.SparkException: Job aborted due to stage 
failure: Task serialization failed: java.lang.StackOverflowError" when 
executing an action with 1000 transformations.

Here is a code snippet to re-produce the error:

  import org.apache.spark.rdd.RDD
  var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
var newPair: RDD[(Long,Long)] = null

for (i <- 1 to 1000) {
  newPair = pair.map(_.swap)
  pair = newPair
}

println("Count = " + pair.count())



> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
> {code}
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-5499:

Description: 
I got an error "org.apache.spark.SparkException: Job aborted due to stage 
failure: Task serialization failed: java.lang.StackOverflowError" when 
executing an action with 1000 transformations.

Here is a code snippet to re-produce the error:

  import org.apache.spark.rdd.RDD
  var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
var newPair: RDD[(Long,Long)] = null

for (i <- 1 to 1000) {
  newPair = pair.map(_.swap)
  pair = newPair
}

println("Count = " + pair.count())


  was:
I got an error "org.apache.spark.SparkException: Job aborted due to stage 
failure: Task serialization failed: java.lang.StackOverflowError" when 
executing an action with 1000 transformations cause.

Here is a code snippet to re-produce the error:

  import org.apache.spark.rdd.RDD
  var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
var newPair: RDD[(Long,Long)] = null

for (i <- 1 to 1000) {
  newPair = pair.map(_.swap)
  pair = newPair
}

println("Count = " + pair.count())



> iterative computing with 1000 iterations causes stage failure
> -
>
> Key: SPARK-5499
> URL: https://issues.apache.org/jira/browse/SPARK-5499
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Tien-Dung LE
>
> I got an error "org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task serialization failed: java.lang.StackOverflowError" when 
> executing an action with 1000 transformations.
> Here is a code snippet to re-produce the error:
>   import org.apache.spark.rdd.RDD
>   var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
> var newPair: RDD[(Long,Long)] = null
> for (i <- 1 to 1000) {
>   newPair = pair.map(_.swap)
>   pair = newPair
> }
> println("Count = " + pair.count())



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5499) iterative computing with 1000 iterations causes stage failure

2015-01-30 Thread Tien-Dung LE (JIRA)
Tien-Dung LE created SPARK-5499:
---

 Summary: iterative computing with 1000 iterations causes stage 
failure
 Key: SPARK-5499
 URL: https://issues.apache.org/jira/browse/SPARK-5499
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Tien-Dung LE


I got an error "org.apache.spark.SparkException: Job aborted due to stage 
failure: Task serialization failed: java.lang.StackOverflowError" when 
executing an action with 1000 transformations cause.

Here is a code snippet to re-produce the error:

  import org.apache.spark.rdd.RDD
  var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L)))
var newPair: RDD[(Long,Long)] = null

for (i <- 1 to 1000) {
  newPair = pair.map(_.swap)
  pair = newPair
}

println("Count = " + pair.count())




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org