[jira] [Updated] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-13932: - Affects Version/s: 2.0.0 > CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException > -- > > Key: SPARK-13932 > URL: https://issues.apache.org/jira/browse/SPARK-13932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 2.0.0 >Reporter: Tien-Dung LE > > A complex aggregate query using condition in the aggregate function and GROUP > BY HAVING clause raises an exception. This issue only happens in Spark > version 1.6.+ but not in Spark 1.5.+. > Here is a typical error message {code} > org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: > b#55, b#124.; line 1 pos 178 > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > {code} > Here is a code snippet to re-produce the error in a spark-shell session: > {code} > import sqlContext.implicits._ > case class Toto( a: String = f"${(math.random*1e6).toLong}%06.0f", > b: Int = (math.random*1e3).toInt, > n: Int = (math.random*1e3).toInt, > m: Double = (math.random*1e3)) > val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto()) > val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data ) > df.registerTempTable( "toto" ) > val sqlSelect1 = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS > k3, GROUPING__ID" > val sqlSelect2 = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS > k2, SUM(m) AS k3, GROUPING__ID" > val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))" > val sqlHaving = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)" > sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK > sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK > sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR > {code} > And here is the full log > {code} > scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) > res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: > bigint, k3: double, GROUPING__ID: int] > scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) > res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: > bigint, k3: double, GROUPING__ID: int] > scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR > org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: > b#55, b#124.; line 1 pos 178 > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.
[jira] [Commented] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199683#comment-15199683 ] Tien-Dung LE commented on SPARK-13932: -- Hi Xiao, not yet! I only tried with Spark version 1.6.1 and 1.6.0. > CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException > -- > > Key: SPARK-13932 > URL: https://issues.apache.org/jira/browse/SPARK-13932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1 >Reporter: Tien-Dung LE > > A complex aggregate query using condition in the aggregate function and GROUP > BY HAVING clause raises an exception. This issue only happens in Spark > version 1.6.+ but not in Spark 1.5.+. > Here is a typical error message {code} > org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: > b#55, b#124.; line 1 pos 178 > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > {code} > Here is a code snippet to re-produce the error in a spark-shell session: > {code} > import sqlContext.implicits._ > case class Toto( a: String = f"${(math.random*1e6).toLong}%06.0f", > b: Int = (math.random*1e3).toInt, > n: Int = (math.random*1e3).toInt, > m: Double = (math.random*1e3)) > val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto()) > val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data ) > df.registerTempTable( "toto" ) > val sqlSelect1 = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS > k3, GROUPING__ID" > val sqlSelect2 = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS > k2, SUM(m) AS k3, GROUPING__ID" > val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))" > val sqlHaving = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)" > sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK > sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK > sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR > {code} > And here is the full log > {code} > scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) > res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: > bigint, k3: double, GROUPING__ID: int] > scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) > res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: > bigint, k3: double, GROUPING__ID: int] > scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR > org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: > b#55, b#124.; line 1 pos 178 > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$clas
[jira] [Commented] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199689#comment-15199689 ] Tien-Dung LE commented on SPARK-13932: -- The error is still there in the latest spark code version 2.0.0-SNAPSHOT > CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException > -- > > Key: SPARK-13932 > URL: https://issues.apache.org/jira/browse/SPARK-13932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 2.0.0 >Reporter: Tien-Dung LE > > A complex aggregate query using condition in the aggregate function and GROUP > BY HAVING clause raises an exception. This issue only happens in Spark > version 1.6.+ but not in Spark 1.5.+. > Here is a typical error message {code} > org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: > b#55, b#124.; line 1 pos 178 > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > {code} > Here is a code snippet to re-produce the error in a spark-shell session: > {code} > import sqlContext.implicits._ > case class Toto( a: String = f"${(math.random*1e6).toLong}%06.0f", > b: Int = (math.random*1e3).toInt, > n: Int = (math.random*1e3).toInt, > m: Double = (math.random*1e3)) > val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto()) > val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data ) > df.registerTempTable( "toto" ) > val sqlSelect1 = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS > k3, GROUPING__ID" > val sqlSelect2 = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS > k2, SUM(m) AS k3, GROUPING__ID" > val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))" > val sqlHaving = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)" > sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK > sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK > sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR > {code} > And here is the full log > {code} > scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) > res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: > bigint, k3: double, GROUPING__ID: int] > scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) > res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: > bigint, k3: double, GROUPING__ID: int] > scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR > org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: > b#55, b#124.; line 1 pos 178 > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.Travers
[jira] [Updated] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-13932: - Description: A complex aggregate query using condition in the aggregate function and GROUP BY HAVING clause raises an exception. This issue only happens in Spark version 1.6.+ but not in Spark 1.5.+. Here is a typical error message {code} org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: b#55, b#124.; line 1 pos 178 at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) {code} Here is a code snippet to re-produce the error in a spark-shell session: {code} import sqlContext.implicits._ case class Toto( a: String = f"${(math.random*1e6).toLong}%06.0f", b: Int = (math.random*1e3).toInt, n: Int = (math.random*1e3).toInt, m: Double = (math.random*1e3)) val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto()) val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data ) df.registerTempTable( "toto" ) val sqlSelect1 = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS k3, GROUPING__ID" val sqlSelect2 = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS k2, SUM(m) AS k3, GROUPING__ID" val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))" val sqlHaving = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)" sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR {code} And here is the full log {code} scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: bigint, k3: double, GROUPING__ID: int] scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: bigint, k3: double, GROUPING__ID: int] scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: b#55, b#124.; line 1 pos 178 at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:316) at org.apache.spark.s
[jira] [Updated] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-13932: - Description: A complex aggregate query using condition in the aggregate function and GROUP BY HAVING clause raises an exception. Here is a typical erro message {code} org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: b#55, b#124.; line 1 pos 178 at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) {code} Here is a code snippet to re-produce the error in a spark-shell session: {code} import sqlContext.implicits._ case class Toto( a: String = f"${(math.random*1e6).toLong}%06.0f", b: Int = (math.random*1e3).toInt, n: Int = (math.random*1e3).toInt, m: Double = (math.random*1e3)) val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto()) val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data ) df.registerTempTable( "toto" ) val sqlSelect1 = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS k3, GROUPING__ID" val sqlSelect2 = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS k2, SUM(m) AS k3, GROUPING__ID" val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))" val sqlHaving = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)" sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR {code} Here is the full log {code} scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: bigint, k3: double, GROUPING__ID: int] scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: bigint, k3: double, GROUPING__ID: int] scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: b#55, b#124.; line 1 pos 178 at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471) at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:316) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316) at or
[jira] [Created] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException
Tien-Dung LE created SPARK-13932: Summary: CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException Key: SPARK-13932 URL: https://issues.apache.org/jira/browse/SPARK-13932 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1, 1.6.0 Reporter: Tien-Dung LE A complex aggregate query using condition in the aggregate function and GROUP BY HAVING clause raises an exception. Here is a typical erro message {code} org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: b#55, b#124.; line 1 pos 178 at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) {code} Here is a code snippet to re-produce the error in a spark-shell session: {code} import sqlContext.implicits._ case class Toto( a: String = f"${(math.random*1e6).toLong}%06.0f", b: Int = (math.random*1e3).toInt, n: Int = (math.random*1e3).toInt, m: Double = (math.random*1e3)) val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto()) val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data ) df.registerTempTable( "toto" ) val sqlSelect1 = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS k3, GROUPING__ID" val sqlSelect2 = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS k2, SUM(m) AS k3, GROUPING__ID" val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))" val sqlHaving = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)" sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-12837: - Description: Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver. Here are steps to re-produce the issue. 1. Start spark shell with a spark.driver.maxResultSize setting {code:java} bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m {code} 2. Execute the code {code:java} case class Toto( a: Int, b: Int) val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( "toto2" ) // ERROR {code} The error message is {code:java} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 393 tasks (1025.9 KB) is bigger than spark.driver.maxResultSize (1024.0 KB) {code} was: Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver. Here are steps to re-produce the issue. 1. Start spark shell with a spark.driver.maxResultSize setting {code:shell} bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m {code} 2. Execute the code {code:scala} case class Toto( a: Int, b: Int) val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( "toto2" ) // ERROR {code} The error message is {code:scala} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 393 tasks (1025.9 KB) is bigger than spark.driver.maxResultSize (1024.0 KB) {code} > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
Tien-Dung LE created SPARK-12837: Summary: Spark driver requires large memory space for serialized results even there are no data collected to the driver Key: SPARK-12837 URL: https://issues.apache.org/jira/browse/SPARK-12837 Project: Spark Issue Type: Question Components: SQL Affects Versions: 1.6.0, 1.5.2 Reporter: Tien-Dung LE Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver. Here are steps to re-produce the issue. 1. Start spark shell with a spark.driver.maxResultSize setting {code:shell} bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m {code} 2. Execute the code {code:scala} case class Toto( a: Int, b: Int) val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( "toto2" ) // ERROR {code} The error message is {code:scala} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 393 tasks (1025.9 KB) is bigger than spark.driver.maxResultSize (1024.0 KB) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953158#comment-14953158 ] Tien-Dung LE commented on SPARK-9280: - [~davies] I have just checked out with the latest spark master code. SqlContext (instead of HiveContext) returns correct number of partitions. Many thanks for this fix. > New HiveContext object unexpectedly loads configuration settings from history > -- > > Key: SPARK-9280 > URL: https://issues.apache.org/jira/browse/SPARK-9280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.1, 1.4.1 >Reporter: Tien-Dung LE > > In a spark session, stopping a spark context and create a new spark context > and hive context does not clean the spark sql configuration. More precisely, > the new hive context still keeps the previous configuration settings. It > would be great if someone can let us know how to avoid this situation. > {code:title=New hive context should not load the configurations from history} > sqlContext.setConf( "spark.sql.shuffle.partitions", "10") > sc.stop > val sparkConf2 = new org.apache.spark.SparkConf() > val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) > val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") > // got 20 as expected > sqlContext2.setConf( "foo", "foo") > sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") > // expected 30 but got 10 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642840#comment-14642840 ] Tien-Dung LE commented on SPARK-9280: - thanks [~pborck]. Detach the hive session state org.apache.hadoop.hive.ql.session.SessionState.detachSession() helps to avoid the issue. > New HiveContext object unexpectedly loads configuration settings from history > -- > > Key: SPARK-9280 > URL: https://issues.apache.org/jira/browse/SPARK-9280 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.1 >Reporter: Tien-Dung LE > > In a spark session, stopping a spark context and create a new spark context > and hive context does not clean the spark sql configuration. More precisely, > the new hive context still keeps the previous configuration settings. It > would be great if someone can let us know how to avoid this situation. > {code:title=New hive context should not load the configurations from history} > sqlContext.setConf( "spark.sql.shuffle.partitions", "10") > sc.stop > val sparkConf2 = new org.apache.spark.SparkConf() > val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) > val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") > // got 20 as expected > sqlContext2.setConf( "foo", "foo") > sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") > // expected 30 but got 10 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9280: Description: In a spark session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} sqlContext.setConf( "spark.sql.shuffle.partitions", "10") sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") // got 20 as expected sqlContext2.setConf( "foo", "foo") sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") // expected 30 but got 10 {code} was: In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} sqlContext.setConf( "spark.sql.shuffle.partitions", "10") sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") // got 20 as expected sqlContext2.setConf( "foo", "foo") sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") // expected 30 but got 10 {code} > New HiveContext object unexpectedly loads configuration settings from history > -- > > Key: SPARK-9280 > URL: https://issues.apache.org/jira/browse/SPARK-9280 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.1 >Reporter: Tien-Dung LE > > In a spark session, stopping a spark context and create a new spark context > and hive context does not clean the spark sql configuration. More precisely, > the new hive context still keeps the previous configuration settings. It > would be great if someone can let us know how to avoid this situation. > {code:title=New hive context should not load the configurations from history} > sqlContext.setConf( "spark.sql.shuffle.partitions", "10") > sc.stop > val sparkConf2 = new org.apache.spark.SparkConf() > val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) > val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") > // got 20 as expected > sqlContext2.setConf( "foo", "foo") > sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") > // expected 30 but got 10 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9280: Description: In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} sqlContext.setConf( "spark.sql.shuffle.partitions", "10") sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") // got 20 as expected sqlContext2.setConf( "foo", "foo") sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") // expected 30 but got 10 {code} was: In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i => Foo()).toDF foo.saveAsParquetFile( "foo" ) sqlContext.setConf( "spark.sql.shuffle.partitions", "10") sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") // got 20 as expected val foo2 = sqlContext2.parquetFile( "foo" ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") // expected 30 but got 10 {code} > New HiveContext object unexpectedly loads configuration settings from history > -- > > Key: SPARK-9280 > URL: https://issues.apache.org/jira/browse/SPARK-9280 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.1 >Reporter: Tien-Dung LE > > In a spark-shell session, stopping a spark context and create a new spark > context and hive context does not clean the spark sql configuration. More > precisely, the new hive context still keeps the previous configuration > settings. It would be great if someone can let us know how to avoid this > situation. > {code:title=New hive context should not load the configurations from history} > sqlContext.setConf( "spark.sql.shuffle.partitions", "10") > sc.stop > val sparkConf2 = new org.apache.spark.SparkConf() > val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) > val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") > // got 20 as expected > sqlContext2.setConf( "foo", "foo") > sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") > // expected 30 but got 10 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9280: Affects Version/s: 1.4.1 > New HiveContext object unexpectedly loads configuration settings from history > -- > > Key: SPARK-9280 > URL: https://issues.apache.org/jira/browse/SPARK-9280 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.1 >Reporter: Tien-Dung LE > > In a spark-shell session, stopping a spark context and create a new spark > context and hive context does not clean the spark sql configuration. More > precisely, the new hive context still keeps the previous configuration > settings. It would be great if someone can let us know how to avoid this > situation. > {code:title=New hive context should not load the configurations from history} > case class Foo ( x: Int = (math.random * 1e3).toInt) > val foo = (1 to 100).map(i => Foo()).toDF > foo.saveAsParquetFile( "foo" ) > sqlContext.setConf( "spark.sql.shuffle.partitions", "10") > sc.stop > val sparkConf2 = new org.apache.spark.SparkConf() > val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) > val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") > // got 20 as expected > val foo2 = sqlContext2.parquetFile( "foo" ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") > // expected 30 but got 10 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9280: Description: In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i => Foo()).toDF foo.saveAsParquetFile( "foo" ) sqlContext.setConf( "spark.sql.shuffle.partitions", "10") sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") // got 20 as expected val foo2 = sqlContext2.parquetFile( "foo" ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") // expected 30 but got 10 {code} was: In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i => Foo()).toDF foo.saveAsParquetFile( "foo" ) sqlContext.setConf( "spark.sql.shuffle.partitions", "10") sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") val foo2 = sqlContext2.parquetFile( "foo" ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") // expected 30 but got 10 {code} > New HiveContext object unexpectedly loads configuration settings from history > -- > > Key: SPARK-9280 > URL: https://issues.apache.org/jira/browse/SPARK-9280 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Tien-Dung LE > > In a spark-shell session, stopping a spark context and create a new spark > context and hive context does not clean the spark sql configuration. More > precisely, the new hive context still keeps the previous configuration > settings. It would be great if someone can let us know how to avoid this > situation. > {code:title=New hive context should not load the configurations from history} > case class Foo ( x: Int = (math.random * 1e3).toInt) > val foo = (1 to 100).map(i => Foo()).toDF > foo.saveAsParquetFile( "foo" ) > sqlContext.setConf( "spark.sql.shuffle.partitions", "10") > sc.stop > val sparkConf2 = new org.apache.spark.SparkConf() > val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) > val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") > // got 20 as expected > val foo2 = sqlContext2.parquetFile( "foo" ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") > // expected 30 but got 10 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9280: Description: In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i => Foo()).toDF foo.saveAsParquetFile( "foo" ) sqlContext.setConf( "spark.sql.shuffle.partitions", "10") sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") val foo2 = sqlContext2.parquetFile( "foo" ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") // expected 30 but got 10 {code} was: In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. Here is a code to show this scenario. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i => Foo()).toDF foo.saveAsParquetFile( "foo" ) sqlContext.setConf( "spark.sql.shuffle.partitions", "10") sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") val foo2 = sqlContext2.parquetFile( "foo" ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") // expected 30 but got 10 {code} > New HiveContext object unexpectedly loads configuration settings from history > -- > > Key: SPARK-9280 > URL: https://issues.apache.org/jira/browse/SPARK-9280 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Tien-Dung LE > > In a spark-shell session, stopping a spark context and create a new spark > context and hive context does not clean the spark sql configuration. More > precisely, the new hive context still keeps the previous configuration > settings. It would be great if someone can let us know how to avoid this > situation. > {code:title=New hive context should not load the configurations from history} > case class Foo ( x: Int = (math.random * 1e3).toInt) > val foo = (1 to 100).map(i => Foo()).toDF > foo.saveAsParquetFile( "foo" ) > sqlContext.setConf( "spark.sql.shuffle.partitions", "10") > sc.stop > val sparkConf2 = new org.apache.spark.SparkConf() > val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) > val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") > val foo2 = sqlContext2.parquetFile( "foo" ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") > // expected 30 but got 10 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9280: Component/s: SQL > New HiveContext object unexpectedly loads configuration settings from history > -- > > Key: SPARK-9280 > URL: https://issues.apache.org/jira/browse/SPARK-9280 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: Tien-Dung LE > > In a spark-shell session, stopping a spark context and create a new spark > context and hive context does not clean the spark sql configuration. More > precisely, the new hive context still keeps the previous configuration > settings. Here is a code to show this scenario. > {code:title=New hive context should not load the configurations from history} > case class Foo ( x: Int = (math.random * 1e3).toInt) > val foo = (1 to 100).map(i => Foo()).toDF > foo.saveAsParquetFile( "foo" ) > sqlContext.setConf( "spark.sql.shuffle.partitions", "10") > sc.stop > val sparkConf2 = new org.apache.spark.SparkConf() > val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) > val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") > val foo2 = sqlContext2.parquetFile( "foo" ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") > // expected 30 but got 10 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
Tien-Dung LE created SPARK-9280: --- Summary: New HiveContext object unexpectedly loads configuration settings from history Key: SPARK-9280 URL: https://issues.apache.org/jira/browse/SPARK-9280 Project: Spark Issue Type: Bug Reporter: Tien-Dung LE In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. Here is a code to show this scenario. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i => Foo()).toDF foo.saveAsParquetFile( "foo" ) sqlContext.setConf( "spark.sql.shuffle.partitions", "10") sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") val foo2 = sqlContext2.parquetFile( "foo" ) sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") // expected 30 but got 10 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9280: Affects Version/s: 1.3.1 > New HiveContext object unexpectedly loads configuration settings from history > -- > > Key: SPARK-9280 > URL: https://issues.apache.org/jira/browse/SPARK-9280 > Project: Spark > Issue Type: Bug >Affects Versions: 1.3.1 >Reporter: Tien-Dung LE > > In a spark-shell session, stopping a spark context and create a new spark > context and hive context does not clean the spark sql configuration. More > precisely, the new hive context still keeps the previous configuration > settings. Here is a code to show this scenario. > {code:title=New hive context should not load the configurations from history} > case class Foo ( x: Int = (math.random * 1e3).toInt) > val foo = (1 to 100).map(i => Foo()).toDF > foo.saveAsParquetFile( "foo" ) > sqlContext.setConf( "spark.sql.shuffle.partitions", "10") > sc.stop > val sparkConf2 = new org.apache.spark.SparkConf() > val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) > val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "20") > val foo2 = sqlContext2.parquetFile( "foo" ) > sqlContext2.getConf( "spark.sql.shuffle.partitions", "30") > // expected 30 but got 10 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9109) Unpersist a graph object does not work properly
[ https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE closed SPARK-9109. --- Resolution: Fixed The change has been merged at https://github.com/apache/spark/pull/7469 > Unpersist a graph object does not work properly > --- > > Key: SPARK-9109 > URL: https://issues.apache.org/jira/browse/SPARK-9109 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.3.1, 1.4.0 >Reporter: Tien-Dung LE >Priority: Minor > > Unpersist a graph object does not work properly. > Here is the code to produce > {code} > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.slf4j.LoggerFactory > import org.apache.spark.graphx.util.GraphGenerators > val graph: Graph[Long, Long] = > GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) > => id.toLong ).mapEdges( e => e.attr.toLong) > > graph.cache().numEdges > graph.unpersist() > {code} > There should not be any cached RDDs in storage > (http://localhost:4040/storage/). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9109) Unpersist a graph object does not work properly
[ https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631412#comment-14631412 ] Tien-Dung LE edited comment on SPARK-9109 at 7/17/15 2:38 PM: -- [~sowen] This is my first PR for spark so I am not sure how to process. Could you kindly have a first look? PS: I did run ./dev/run-tests. It passed "Scalastyle checks" but il failed on python part {code} .../spark/dev/lint-python: line 64: syntax error near unexpected token `>' .../spark/dev/lint-python: line 64: `easy_install -d "$PYLINT_HOME" pylint==1.4.4 &>> "$PYLINT_INSTALL_INFO"' [error] running .../spark/dev/lint-python ; received return code 2 {code} was (Author: tien-dung.le): [~sowen] This is my first PR for spark so I am not sure how to process. Could you kindly have a first look? PS: I did run ./dev/run-tests but il failed on python part {code} .../spark/dev/lint-python: line 64: syntax error near unexpected token `>' .../spark/dev/lint-python: line 64: `easy_install -d "$PYLINT_HOME" pylint==1.4.4 &>> "$PYLINT_INSTALL_INFO"' [error] running .../spark/dev/lint-python ; received return code 2 {code} > Unpersist a graph object does not work properly > --- > > Key: SPARK-9109 > URL: https://issues.apache.org/jira/browse/SPARK-9109 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.3.1, 1.4.0 >Reporter: Tien-Dung LE >Priority: Minor > > Unpersist a graph object does not work properly. > Here is the code to produce > {code} > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.slf4j.LoggerFactory > import org.apache.spark.graphx.util.GraphGenerators > val graph: Graph[Long, Long] = > GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) > => id.toLong ).mapEdges( e => e.attr.toLong) > > graph.cache().numEdges > graph.unpersist() > {code} > There should not be any cached RDDs in storage > (http://localhost:4040/storage/). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9109) Unpersist a graph object does not work properly
[ https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631412#comment-14631412 ] Tien-Dung LE commented on SPARK-9109: - [~sowen] This is my first PR for spark so I am not sure how to process. Could you kindly have a first look? PS: I did run ./dev/run-tests but il failed on python part {code} .../spark/dev/lint-python: line 64: syntax error near unexpected token `>' .../spark/dev/lint-python: line 64: `easy_install -d "$PYLINT_HOME" pylint==1.4.4 &>> "$PYLINT_INSTALL_INFO"' [error] running .../spark/dev/lint-python ; received return code 2 {code} > Unpersist a graph object does not work properly > --- > > Key: SPARK-9109 > URL: https://issues.apache.org/jira/browse/SPARK-9109 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.3.1, 1.4.0 >Reporter: Tien-Dung LE >Priority: Minor > > Unpersist a graph object does not work properly. > Here is the code to produce > {code} > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.slf4j.LoggerFactory > import org.apache.spark.graphx.util.GraphGenerators > val graph: Graph[Long, Long] = > GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) > => id.toLong ).mapEdges( e => e.attr.toLong) > > graph.cache().numEdges > graph.unpersist() > {code} > There should not be any cached RDDs in storage > (http://localhost:4040/storage/). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9109) Unpersist a graph object does not work properly
[ https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631081#comment-14631081 ] Tien-Dung LE commented on SPARK-9109: - I think the cache is done in purpose so we should keep it. The solution is to keep all cached RDDs and unpersist them later (when the graph.unpersist is called. I can propose a change for that but it would be very kind of you to refer me to a document (produce) how to make a change and create a PR. > Unpersist a graph object does not work properly > --- > > Key: SPARK-9109 > URL: https://issues.apache.org/jira/browse/SPARK-9109 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.3.1, 1.4.0 >Reporter: Tien-Dung LE >Priority: Minor > > Unpersist a graph object does not work properly. > Here is the code to produce > {code} > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.slf4j.LoggerFactory > import org.apache.spark.graphx.util.GraphGenerators > val graph: Graph[Long, Long] = > GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) > => id.toLong ).mapEdges( e => e.attr.toLong) > > graph.cache().numEdges > graph.unpersist() > {code} > There should not be any cached RDDs in storage > (http://localhost:4040/storage/). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9109) Unpersist a graph object does not work properly
[ https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630941#comment-14630941 ] Tien-Dung LE commented on SPARK-9109: - Thanks [~sowen]. Indeed, there is still a cached edges RDD. I think this RDD is left from the graph construction in GraphImpl.scala, more precisely at fromEdgeRDD() function. Here is the latest code follow your suggestion. {code} import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD import org.slf4j.LoggerFactory import org.apache.spark.graphx.util.GraphGenerators val graph1 = GraphGenerators.logNormalGraph(sc, numVertices = 100) val graph2 = graph1.mapVertices( (id, _) => id.toLong ) val graph3 = graph2.mapEdges( e => e.attr.toLong) graph3.cache().numEdges graph3.unpersist() graph2.unpersist() graph1.unpersist() graph2.unpersist() graph3.unpersist() sc.getPersistentRDDs.foreach( r => println( r._2.toString)) /* GraphImpl.scala private def fromEdgeRDD[VD: ClassTag, ED: ClassTag]( edges: EdgeRDDImpl[ED, VD], defaultVertexAttr: VD, edgeStorageLevel: StorageLevel, vertexStorageLevel: StorageLevel): GraphImpl[VD, ED] = { val edgesCached = edges.withTargetStorageLevel(edgeStorageLevel).cache() val vertices = VertexRDD.fromEdges(edgesCached, edgesCached.partitions.size, defaultVertexAttr) .withTargetStorageLevel(vertexStorageLevel) fromExistingRDDs(vertices, edgesCached) } */ {code} > Unpersist a graph object does not work properly > --- > > Key: SPARK-9109 > URL: https://issues.apache.org/jira/browse/SPARK-9109 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.3.1, 1.4.0 >Reporter: Tien-Dung LE >Priority: Minor > > Unpersist a graph object does not work properly. > Here is the code to produce > {code} > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.slf4j.LoggerFactory > import org.apache.spark.graphx.util.GraphGenerators > val graph: Graph[Long, Long] = > GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) > => id.toLong ).mapEdges( e => e.attr.toLong) > > graph.cache().numEdges > graph.unpersist() > {code} > There should not be any cached RDDs in storage > (http://localhost:4040/storage/). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9109) Unpersist a graph object does not work properly
[ https://issues.apache.org/jira/browse/SPARK-9109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9109: Description: Unpersist a graph object does not work properly. Here is the code to produce {code} import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD import org.slf4j.LoggerFactory import org.apache.spark.graphx.util.GraphGenerators val graph: Graph[Long, Long] = GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) => id.toLong ).mapEdges( e => e.attr.toLong) graph.cache().numEdges graph.unpersist() {code} There should not be any cached RDDs in storage (http://localhost:4040/storage/). was: Unpersist a graph object does not work properly. Here is the code to produce import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD import org.slf4j.LoggerFactory import org.apache.spark.graphx.util.GraphGenerators val graph: Graph[Long, Long] = GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) => id.toLong ).mapEdges( e => e.attr.toLong) graph.cache().numEdges graph.unpersist() There should not be any cached RDDs in storage (http://localhost:4040/storage/). > Unpersist a graph object does not work properly > --- > > Key: SPARK-9109 > URL: https://issues.apache.org/jira/browse/SPARK-9109 > Project: Spark > Issue Type: Bug >Affects Versions: 1.3.1, 1.4.0 >Reporter: Tien-Dung LE > > Unpersist a graph object does not work properly. > Here is the code to produce > {code} > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > import org.slf4j.LoggerFactory > import org.apache.spark.graphx.util.GraphGenerators > val graph: Graph[Long, Long] = > GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) > => id.toLong ).mapEdges( e => e.attr.toLong) > > graph.cache().numEdges > graph.unpersist() > {code} > There should not be any cached RDDs in storage > (http://localhost:4040/storage/). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9109) Unpersist a graph object does not work properly
Tien-Dung LE created SPARK-9109: --- Summary: Unpersist a graph object does not work properly Key: SPARK-9109 URL: https://issues.apache.org/jira/browse/SPARK-9109 Project: Spark Issue Type: Bug Affects Versions: 1.4.0, 1.3.1 Reporter: Tien-Dung LE Unpersist a graph object does not work properly. Here is the code to produce import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD import org.slf4j.LoggerFactory import org.apache.spark.graphx.util.GraphGenerators val graph: Graph[Long, Long] = GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) => id.toLong ).mapEdges( e => e.attr.toLong) graph.cache().numEdges graph.unpersist() There should not be any cached RDDs in storage (http://localhost:4040/storage/). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5499) iterative computing with 1000 iterations causes stage failure
[ https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE closed SPARK-5499. --- checkpoint mechanism can solve the issue. > iterative computing with 1000 iterations causes stage failure > - > > Key: SPARK-5499 > URL: https://issues.apache.org/jira/browse/SPARK-5499 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Tien-Dung LE > > I got an error "org.apache.spark.SparkException: Job aborted due to stage > failure: Task serialization failed: java.lang.StackOverflowError" when > executing an action with 1000 transformations. > Here is a code snippet to re-produce the error: > {code} > import org.apache.spark.rdd.RDD > var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) > var newPair: RDD[(Long,Long)] = null > for (i <- 1 to 1000) { > newPair = pair.map(_.swap) > pair = newPair > } > println("Count = " + pair.count()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure
[ https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356699#comment-14356699 ] Tien-Dung LE commented on SPARK-5499: - Many thanks for your reply. I did the same as your suggestion and it worked. > iterative computing with 1000 iterations causes stage failure > - > > Key: SPARK-5499 > URL: https://issues.apache.org/jira/browse/SPARK-5499 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Tien-Dung LE > > I got an error "org.apache.spark.SparkException: Job aborted due to stage > failure: Task serialization failed: java.lang.StackOverflowError" when > executing an action with 1000 transformations. > Here is a code snippet to re-produce the error: > {code} > import org.apache.spark.rdd.RDD > var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) > var newPair: RDD[(Long,Long)] = null > for (i <- 1 to 1000) { > newPair = pair.map(_.swap) > pair = newPair > } > println("Count = " + pair.count()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5499) iterative computing with 1000 iterations causes stage failure
[ https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298622#comment-14298622 ] Tien-Dung LE edited comment on SPARK-5499 at 1/30/15 1:47 PM: -- I tried with checkpoint() but had the same error. Here is the code {code} for (i <- 1 to 1000) { newPair = pair.map(_.swap).persist() pair = newPair println("" + i + ": count = " + pair.count()) if( i % 100 == 0) { pair.checkpoint() } } {code} was (Author: tien-dung.le): I tried with checkpoint() but same had the same error. Here is the code {code} for (i <- 1 to 1000) { newPair = pair.map(_.swap).persist() pair = newPair println("" + i + ": count = " + pair.count()) if( i % 100 == 0) { pair.checkpoint() } } {code} > iterative computing with 1000 iterations causes stage failure > - > > Key: SPARK-5499 > URL: https://issues.apache.org/jira/browse/SPARK-5499 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Tien-Dung LE > > I got an error "org.apache.spark.SparkException: Job aborted due to stage > failure: Task serialization failed: java.lang.StackOverflowError" when > executing an action with 1000 transformations. > Here is a code snippet to re-produce the error: > {code} > import org.apache.spark.rdd.RDD > var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) > var newPair: RDD[(Long,Long)] = null > for (i <- 1 to 1000) { > newPair = pair.map(_.swap) > pair = newPair > } > println("Count = " + pair.count()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure
[ https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298622#comment-14298622 ] Tien-Dung LE commented on SPARK-5499: - I tried with checkpoint() but same had the same error. Here is the code {code} for (i <- 1 to 1000) { newPair = pair.map(_.swap).persist() pair = newPair println("" + i + ": count = " + pair.count()) if( i % 100 == 0) { pair.checkpoint() } } {code} > iterative computing with 1000 iterations causes stage failure > - > > Key: SPARK-5499 > URL: https://issues.apache.org/jira/browse/SPARK-5499 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Tien-Dung LE > > I got an error "org.apache.spark.SparkException: Job aborted due to stage > failure: Task serialization failed: java.lang.StackOverflowError" when > executing an action with 1000 transformations. > Here is a code snippet to re-produce the error: > {code} > import org.apache.spark.rdd.RDD > var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) > var newPair: RDD[(Long,Long)] = null > for (i <- 1 to 1000) { > newPair = pair.map(_.swap) > pair = newPair > } > println("Count = " + pair.count()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5499) iterative computing with 1000 iterations causes stage failure
[ https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298608#comment-14298608 ] Tien-Dung LE commented on SPARK-5499: - Thanks Sean Owen for your comment. Calling persist() or cache() does not help. Did you mean to call checkpoint() ? > iterative computing with 1000 iterations causes stage failure > - > > Key: SPARK-5499 > URL: https://issues.apache.org/jira/browse/SPARK-5499 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Tien-Dung LE > > I got an error "org.apache.spark.SparkException: Job aborted due to stage > failure: Task serialization failed: java.lang.StackOverflowError" when > executing an action with 1000 transformations. > Here is a code snippet to re-produce the error: > {code} > import org.apache.spark.rdd.RDD > var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) > var newPair: RDD[(Long,Long)] = null > for (i <- 1 to 1000) { > newPair = pair.map(_.swap) > pair = newPair > } > println("Count = " + pair.count()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5499) iterative computing with 1000 iterations causes stage failure
[ https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-5499: Description: I got an error "org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError" when executing an action with 1000 transformations. Here is a code snippet to re-produce the error: {code} import org.apache.spark.rdd.RDD var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) var newPair: RDD[(Long,Long)] = null for (i <- 1 to 1000) { newPair = pair.map(_.swap) pair = newPair } println("Count = " + pair.count()) {code} was: I got an error "org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError" when executing an action with 1000 transformations. Here is a code snippet to re-produce the error: import org.apache.spark.rdd.RDD var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) var newPair: RDD[(Long,Long)] = null for (i <- 1 to 1000) { newPair = pair.map(_.swap) pair = newPair } println("Count = " + pair.count()) > iterative computing with 1000 iterations causes stage failure > - > > Key: SPARK-5499 > URL: https://issues.apache.org/jira/browse/SPARK-5499 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Tien-Dung LE > > I got an error "org.apache.spark.SparkException: Job aborted due to stage > failure: Task serialization failed: java.lang.StackOverflowError" when > executing an action with 1000 transformations. > Here is a code snippet to re-produce the error: > {code} > import org.apache.spark.rdd.RDD > var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) > var newPair: RDD[(Long,Long)] = null > for (i <- 1 to 1000) { > newPair = pair.map(_.swap) > pair = newPair > } > println("Count = " + pair.count()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5499) iterative computing with 1000 iterations causes stage failure
[ https://issues.apache.org/jira/browse/SPARK-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-5499: Description: I got an error "org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError" when executing an action with 1000 transformations. Here is a code snippet to re-produce the error: import org.apache.spark.rdd.RDD var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) var newPair: RDD[(Long,Long)] = null for (i <- 1 to 1000) { newPair = pair.map(_.swap) pair = newPair } println("Count = " + pair.count()) was: I got an error "org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError" when executing an action with 1000 transformations cause. Here is a code snippet to re-produce the error: import org.apache.spark.rdd.RDD var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) var newPair: RDD[(Long,Long)] = null for (i <- 1 to 1000) { newPair = pair.map(_.swap) pair = newPair } println("Count = " + pair.count()) > iterative computing with 1000 iterations causes stage failure > - > > Key: SPARK-5499 > URL: https://issues.apache.org/jira/browse/SPARK-5499 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Tien-Dung LE > > I got an error "org.apache.spark.SparkException: Job aborted due to stage > failure: Task serialization failed: java.lang.StackOverflowError" when > executing an action with 1000 transformations. > Here is a code snippet to re-produce the error: > import org.apache.spark.rdd.RDD > var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) > var newPair: RDD[(Long,Long)] = null > for (i <- 1 to 1000) { > newPair = pair.map(_.swap) > pair = newPair > } > println("Count = " + pair.count()) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5499) iterative computing with 1000 iterations causes stage failure
Tien-Dung LE created SPARK-5499: --- Summary: iterative computing with 1000 iterations causes stage failure Key: SPARK-5499 URL: https://issues.apache.org/jira/browse/SPARK-5499 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Tien-Dung LE I got an error "org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError" when executing an action with 1000 transformations cause. Here is a code snippet to re-produce the error: import org.apache.spark.rdd.RDD var pair: RDD[(Long,Long)] = sc.parallelize(Array((1L,2L))) var newPair: RDD[(Long,Long)] = null for (i <- 1 to 1000) { newPair = pair.map(_.swap) pair = newPair } println("Count = " + pair.count()) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org