[jira] [Commented] (SPARK-27069) Spark(2.3.2) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232

2019-03-07 Thread TAESUK KIM (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786519#comment-16786519
 ] 

TAESUK KIM commented on SPARK-27069:


I'm sorry for that. My mistake

> Spark(2.3.2) LDA transfomation memory error(java.lang.OutOfMemoryError at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232
> 
>
> Key: SPARK-27069
> URL: https://issues.apache.org/jira/browse/SPARK-27069
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.2
> Environment: Below is my environment
> DataSet
>  # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail)
>  # Word : about 3553918(can't change)
> Spark environment
>  # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail)
>  # executor-core,driver-core : 3
>  # spark.serializer : default and 
> org.apache.spark.serializer.KryoSerializer(both fail)
>  # spark.executor.memoryOverhead : 18G --> 36G fail
> Jave version : 1.8.0_191 (Oracle Corporation)
>  
>Reporter: TAESUK KIM
>Priority: Major
>
> I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed 
> version , ml ) using Spark 2.3.2(emr-5.18.0) .
> After that I want to transform new DataSet by using that model. But when I 
> transform new data, I alway get error related memory error.
> I changed data size from x 0.1 , to x 0.01. But always get memory 
> error({{java.lang.OutOfMemoryError at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123}})
>  
> That hugeCapacity error(overflow) is happened when size of array is over 
> Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find 
> why this error is happened.
> And I want to change serializer to {{KryoSerializer}}. But I found 
> this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call 
> org.apache.spark.serializer.JavaSerializationStream even though I register 
> KryoClasses
>  
> Is there any thing I can do ?
>  
> Below is code
> {code} 
> val countvModel = CountVectorizerModel.load("s3://~/")
> val ldaModel = DistributedLDAModel.load("s3://~/")
> val transformeddata=countvModel.transform(inputData).select("productid", 
> "itemid", "ptkString", "features")
> var featureldaDF = ldaModel.transform(transformeddata).select("productid", 
> "itemid", "topicDistribution", "ptkString").toDF("productid", "itemid", 
> "features", "ptkString") featureldaDF=featureldaDF.persist //this is 328 line
> {code} 
>  
> Other testing
>  - Java option : UseParallelGC , UseG1GC (all fail)
> Below is log
> {code}
> 19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: 
> java.lang.OutOfMemoryError java.lang.OutOfMemoryError at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>  at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at 
> org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> 

[jira] [Updated] (SPARK-27069) Spark(2.3.2) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232

2019-03-07 Thread TAESUK KIM (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TAESUK KIM updated SPARK-27069:
---
Summary: Spark(2.3.2) LDA transfomation memory 
error(java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232  
(was: Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232)

> Spark(2.3.2) LDA transfomation memory error(java.lang.OutOfMemoryError at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232
> 
>
> Key: SPARK-27069
> URL: https://issues.apache.org/jira/browse/SPARK-27069
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.2
> Environment: Below is my environment
> DataSet
>  # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail)
>  # Word : about 3553918(can't change)
> Spark environment
>  # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail)
>  # executor-core,driver-core : 3
>  # spark.serializer : default and 
> org.apache.spark.serializer.KryoSerializer(both fail)
>  # spark.executor.memoryOverhead : 18G --> 36G fail
> Jave version : 1.8.0_191 (Oracle Corporation)
>  
>Reporter: TAESUK KIM
>Priority: Major
>
> I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed 
> version , ml ) using Spark 2.3.2(emr-5.18.0) .
> After that I want to transform new DataSet by using that model. But when I 
> transform new data, I alway get error related memory error.
> I changed data size from x 0.1 , to x 0.01. But always get memory 
> error({{java.lang.OutOfMemoryError at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123}})
>  
> That hugeCapacity error(overflow) is happened when size of array is over 
> Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find 
> why this error is happened.
> And I want to change serializer to {{KryoSerializer}}. But I found 
> this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call 
> org.apache.spark.serializer.JavaSerializationStream even though I register 
> KryoClasses
>  
> Is there any thing I can do ?
>  
> Below is code
> {code} 
> val countvModel = CountVectorizerModel.load("s3://~/")
> val ldaModel = DistributedLDAModel.load("s3://~/")
> val transformeddata=countvModel.transform(inputData).select("productid", 
> "itemid", "ptkString", "features")
> var featureldaDF = ldaModel.transform(transformeddata).select("productid", 
> "itemid", "topicDistribution", "ptkString").toDF("productid", "itemid", 
> "features", "ptkString") featureldaDF=featureldaDF.persist //this is 328 line
> {code} 
>  
> Other testing
>  - Java option : UseParallelGC , UseG1GC (all fail)
> Below is log
> {code}
> 19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: 
> java.lang.OutOfMemoryError java.lang.OutOfMemoryError at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>  at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at 
> 

[jira] [Updated] (SPARK-27069) Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232

2019-03-06 Thread TAESUK KIM (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TAESUK KIM updated SPARK-27069:
---
Summary: Spark(2.3.1) LDA transfomation memory 
error(java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232  
(was: Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123))

> Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232
> 
>
> Key: SPARK-27069
> URL: https://issues.apache.org/jira/browse/SPARK-27069
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.2
> Environment: Below is my environment
> DataSet
>  # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail)
>  # Word : about 3553918(can't change)
> Spark environment
>  # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail)
>  # executor-core,driver-core : 3
>  # spark.serializer : default and 
> org.apache.spark.serializer.KryoSerializer(both fail)
>  # spark.executor.memoryOverhead : 18G --> 36G fail
> Jave version : 1.8.0_191 (Oracle Corporation)
>  
>Reporter: TAESUK KIM
>Priority: Major
>
> I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed 
> version , ml ) using Spark 2.3.2(emr-5.18.0) .
> After that I want to transform new DataSet by using that model. But when I 
> transform new data, I alway get error related memory error.
> I changed data size from x 0.1 , to x 0.01. But always get memory 
> error(java.lang.OutOfMemoryError at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
>  
> That hugeCapacity error(overflow) is happened when size of array is over 
> Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find 
> why this error is happened.
> And I want to change serializer to KryoSerializer. But I found 
> this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call 
> org.apache.spark.serializer.JavaSerializationStream even though I register 
> KryoClasses
>  
> Is there any thing I can do ?
>  
> Below is code
>  
> {{val countvModel = CountVectorizerModel.load("s3://~/") }}
> {{val ldaModel = DistributedLDAModel.load("s3://~/") }}
> {{val transformeddata=countvModel.transform(inputData).select("productid", 
> "itemid", "ptkString", "features") var featureldaDF = 
> ldaModel.transform(transformeddata).select("productid", "itemid", 
> "topicDistribution", "ptkString").toDF("productid", "itemid", "features", 
> "ptkString") featureldaDF=featureldaDF.persist //this is 328 line }}
>  
> Other testing
>  # Java option : UseParallelGC , UseG1GC (all fail)
> Below is log
> {{19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: 
> java.lang.OutOfMemoryError java.lang.OutOfMemoryError at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>  at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>  at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at 
> 

[jira] [Created] (SPARK-27069) Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)

2019-03-05 Thread TAESUK KIM (JIRA)
TAESUK KIM created SPARK-27069:
--

 Summary: Spark(2.3.1) LDA transfomation memory 
error(java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
 Key: SPARK-27069
 URL: https://issues.apache.org/jira/browse/SPARK-27069
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.2
 Environment: Below is my environment

DataSet
 # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail)

 # Word : about 3553918(can't change)

Spark environment
 # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail)

 # executor-core,driver-core : 3

 # spark.serializer : default and 
org.apache.spark.serializer.KryoSerializer(both fail)

 # spark.executor.memoryOverhead : 18G --> 36G fail

Jave version : 1.8.0_191 (Oracle Corporation)

 
Reporter: TAESUK KIM


I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed 
version , ml ) using Spark 2.3.2(emr-5.18.0) .
After that I want to transform new DataSet by using that model. But when I 
transform new data, I alway get error related memory error.
I changed data size from x 0.1 , to x 0.01. But always get memory 
error(java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
 
That hugeCapacity error(overflow) is happened when size of array is over 
Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find why 
this error is happened.

And I want to change serializer to KryoSerializer. But I found 
this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call 
org.apache.spark.serializer.JavaSerializationStream even though I register 
KryoClasses
 

Is there any thing I can do ?

 
Below is code

 
{{val countvModel = CountVectorizerModel.load("s3://~/") }}
{{val ldaModel = DistributedLDAModel.load("s3://~/") }}
{{val transformeddata=countvModel.transform(inputData).select("productid", 
"itemid", "ptkString", "features") var featureldaDF = 
ldaModel.transform(transformeddata).select("productid", "itemid", 
"topicDistribution", "ptkString").toDF("productid", "itemid", "features", 
"ptkString") featureldaDF=featureldaDF.persist //this is 328 line }}
 

Other testing
 # Java option : UseParallelGC , UseG1GC (all fail)

Below is log
{{19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: 
java.lang.OutOfMemoryError java.lang.OutOfMemoryError at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at 
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at 
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
 at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
 at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
 at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
 at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
 at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at 
org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) 
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at 
org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107)
 at