[jira] [Commented] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454841#comment-15454841
 ] 

Apache Spark commented on SPARK-17356:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14915

> Out of memory when calling TreeNode.toJSON
> --
>
> Key: SPARK-17356
> URL: https://issues.apache.org/jira/browse/SPARK-17356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Attachments: jmap.txt, jstack.txt, queryplan.txt
>
>
> When using MLLib, when calling toJSON on a plan with many level of 
> sub-queries, it may cause out of memory exception with stack trace like this
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
>   at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
>   at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
>   at scala.collection.immutable.List$.newBuilder(List.scala:396)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
>   at 
> scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
>   at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
>   at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>   at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
>   at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566)
> {code}
> The query plan, stack trace, and jmap distribution is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454553#comment-15454553
 ] 

Sean Zhong commented on SPARK-17356:


Reproducer:

{code}
# Trigger OOM
scala> :paste -raw
// Entering paste mode (ctrl-D to finish)

package org.apache.spark.ml.attribute

import org.apache.spark.ml.attribute._
import org.apache.spark.sql.catalyst.expressions.{Alias, Literal}
import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
import org.apache.spark.sql.catalyst.dsl.plans._

object Test {
  def main(args: Array[String]): Unit = {
val rand = new java.util.Random()
val attr: Attribute = new BinaryAttribute(Some("a"), 
Some(rand.nextInt(10)), Some(Array("value1", "value2")))
val attributeGroup = new AttributeGroup("group", Array.fill(100)(attr))
val alias = Alias(Literal(0), "alias")(explicitMetadata = 
Some(attributeGroup.toMetadata()))
val testRelation = LocalRelation()
val query = testRelation.select((0 to 100).toSeq.map(_ => alias): _*)
System.out.print(query.toJSON.length)
  }
}

// Exiting paste mode, now interpreting.

scala> org.apache.spark.ml.attribute.Test.main(null)
{code}

> Out of memory when calling TreeNode.toJSON
> --
>
> Key: SPARK-17356
> URL: https://issues.apache.org/jira/browse/SPARK-17356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Attachments: jmap.txt, jstack.txt, queryplan.txt
>
>
> When using MLLib, when calling toJSON on a plan with many level of 
> sub-queries, it may cause out of memory exception with stack trace like this
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
>   at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
>   at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
>   at scala.collection.immutable.List$.newBuilder(List.scala:396)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
>   at 
> scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
>   at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
>   at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>   at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
>   at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566)
> {code}
> The query plan, stack trace, and jmap distribution is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454495#comment-15454495
 ] 

Sean Zhong commented on SPARK-17356:


Root cause:

1. MLLib heavily leverage MetaData to store a lot of attribute information, in 
the case here, the metadata may contains tens of thousands of Attribute 
information. And the meta data may be stored to Alias expression like this:
{code}
case class Alias(child: Expression, name: String)(
val exprId: ExprId = NamedExpression.newExprId,
val qualifier: Option[String] = None,
val explicitMetadata: Option[Metadata] = None,
override val isGenerated: java.lang.Boolean = false)
{code} 

If we serialize the meta data to JSON, it will take a huge amount of memory.


> Out of memory when calling TreeNode.toJSON
> --
>
> Key: SPARK-17356
> URL: https://issues.apache.org/jira/browse/SPARK-17356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Attachments: jmap.txt, jstack.txt, queryplan.txt
>
>
> When using MLLib, when calling toJSON on a plan with many level of 
> sub-queries, it may cause out of memory exception with stack trace like this
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
>   at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
>   at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
>   at scala.collection.immutable.List$.newBuilder(List.scala:396)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
>   at 
> scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
>   at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
>   at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>   at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
>   at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566)
> {code}
> The query plan, stack trace, and jmap distribution is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17356) Out of memory when calling TreeNode.toJSON

2016-09-01 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15454488#comment-15454488
 ] 

Sean Zhong commented on SPARK-17356:


*Analysis*

After looking at the mmap, there is a suspicious line
{code}
  20: 18388624  [Lorg.apache.spark.ml.attribute.Attribute;
{code}

This means a single Attribute array takes more than 8388624 bytes, and if each 
reference takes 8 bytes, it means there are 1 million attributes.

The array probably is used in AttributeGroup, whose signature is:
{code}
class AttributeGroup private (
val name: String,
val numAttributes: Option[Int],
attrs: Option[Array[Attribute]])
{code}

And, in AttributeGroup, there is a toMetaData function which will convert the 
Attribute array to Meta data
{code}
  def toMetadata(): Metadata = toMetadata(Metadata.empty)
{code}

Finally, the metadata are saved to expression Attribute.
 
For example, in org.apache.spark.ml.feature.Interaction transform function, the 
meta data is set to attribute of Alias expression, when aliasing the udf 
function like this:

{code}
  override def transform(dataset: Dataset[_]): DataFrame = {
...
// !NOTE!: This is an attribute group
val featureAttrs = getFeatureAttrs(inputFeatures)

def interactFunc = udf { row: Row =>
  ...
}

val featureCols = inputFeatures.map { f =>
  f.dataType match {
case DoubleType => dataset(f.name)
case _: VectorUDT => dataset(f.name)
case _: NumericType | BooleanType => dataset(f.name).cast(DoubleType)
  }
}

// !NOTE!: The meta data i stored in Alias expresion by function call 
.as(..., featureAttrs.toMetadata())
dataset.select(
  col("*"),
  interactFunc(struct(featureCols: _*)).as($(outputCol), 
featureAttrs.toMetadata()))
  }

{code}

And, when calling toJSON, the metaData will be converted to JSON.

> Out of memory when calling TreeNode.toJSON
> --
>
> Key: SPARK-17356
> URL: https://issues.apache.org/jira/browse/SPARK-17356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
> Attachments: jmap.txt, jstack.txt, queryplan.txt
>
>
> When using MLLib, when calling toJSON on a plan with many level of 
> sub-queries, it may cause out of memory exception with stack trace like this
> {code}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at scala.collection.mutable.AbstractSeq.(Seq.scala:47)
>   at scala.collection.mutable.AbstractBuffer.(Buffer.scala:48)
>   at scala.collection.mutable.ListBuffer.(ListBuffer.scala:46)
>   at scala.collection.immutable.List$.newBuilder(List.scala:396)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.newBuilder(GenericTraversableTemplate.scala:64)
>   at 
> scala.collection.AbstractTraversable.newBuilder(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:262)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
>   at 
> scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:274)
>   at scala.collection.AbstractTraversable.filterNot(Traversable.scala:105)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:25)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:20)
>   at 
> org.json4s.jackson.JValueSerializer.serialize(JValueSerializer.scala:7)
>   at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>   at org.json4s.jackson.JsonMethods$class.compact(JsonMethods.scala:34)
>   at org.json4s.jackson.JsonMethods$.compact(JsonMethods.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:566)
> {code}
> The query plan, stack trace, and jmap distribution is attached.



--
This message was sent