[jira] [Commented] (SPARK-27069) Spark(2.3.2) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232
[ https://issues.apache.org/jira/browse/SPARK-27069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786519#comment-16786519 ] TAESUK KIM commented on SPARK-27069: I'm sorry for that. My mistake > Spark(2.3.2) LDA transfomation memory error(java.lang.OutOfMemoryError at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232 > > > Key: SPARK-27069 > URL: https://issues.apache.org/jira/browse/SPARK-27069 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.2 > Environment: Below is my environment > DataSet > # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail) > # Word : about 3553918(can't change) > Spark environment > # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail) > # executor-core,driver-core : 3 > # spark.serializer : default and > org.apache.spark.serializer.KryoSerializer(both fail) > # spark.executor.memoryOverhead : 18G --> 36G fail > Jave version : 1.8.0_191 (Oracle Corporation) > >Reporter: TAESUK KIM >Priority: Major > > I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed > version , ml ) using Spark 2.3.2(emr-5.18.0) . > After that I want to transform new DataSet by using that model. But when I > transform new data, I alway get error related memory error. > I changed data size from x 0.1 , to x 0.01. But always get memory > error({{java.lang.OutOfMemoryError at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123}}) > > That hugeCapacity error(overflow) is happened when size of array is over > Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find > why this error is happened. > And I want to change serializer to {{KryoSerializer}}. But I found > this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call > org.apache.spark.serializer.JavaSerializationStream even though I register > KryoClasses > > Is there any thing I can do ? > > Below is code > {code} > val countvModel = CountVectorizerModel.load("s3://~/") > val ldaModel = DistributedLDAModel.load("s3://~/") > val transformeddata=countvModel.transform(inputData).select("productid", > "itemid", "ptkString", "features") > var featureldaDF = ldaModel.transform(transformeddata).select("productid", > "itemid", "topicDistribution", "ptkString").toDF("productid", "itemid", > "features", "ptkString") featureldaDF=featureldaDF.persist //this is 328 line > {code} > > Other testing > - Java option : UseParallelGC , UseG1GC (all fail) > Below is log > {code} > 19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: > java.lang.OutOfMemoryError java.lang.OutOfMemoryError at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at > org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at > org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at > org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at >
[jira] [Updated] (SPARK-27069) Spark(2.3.2) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232
[ https://issues.apache.org/jira/browse/SPARK-27069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TAESUK KIM updated SPARK-27069: --- Summary: Spark(2.3.2) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232 (was: Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232) > Spark(2.3.2) LDA transfomation memory error(java.lang.OutOfMemoryError at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232 > > > Key: SPARK-27069 > URL: https://issues.apache.org/jira/browse/SPARK-27069 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.2 > Environment: Below is my environment > DataSet > # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail) > # Word : about 3553918(can't change) > Spark environment > # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail) > # executor-core,driver-core : 3 > # spark.serializer : default and > org.apache.spark.serializer.KryoSerializer(both fail) > # spark.executor.memoryOverhead : 18G --> 36G fail > Jave version : 1.8.0_191 (Oracle Corporation) > >Reporter: TAESUK KIM >Priority: Major > > I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed > version , ml ) using Spark 2.3.2(emr-5.18.0) . > After that I want to transform new DataSet by using that model. But when I > transform new data, I alway get error related memory error. > I changed data size from x 0.1 , to x 0.01. But always get memory > error({{java.lang.OutOfMemoryError at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123}}) > > That hugeCapacity error(overflow) is happened when size of array is over > Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find > why this error is happened. > And I want to change serializer to {{KryoSerializer}}. But I found > this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call > org.apache.spark.serializer.JavaSerializationStream even though I register > KryoClasses > > Is there any thing I can do ? > > Below is code > {code} > val countvModel = CountVectorizerModel.load("s3://~/") > val ldaModel = DistributedLDAModel.load("s3://~/") > val transformeddata=countvModel.transform(inputData).select("productid", > "itemid", "ptkString", "features") > var featureldaDF = ldaModel.transform(transformeddata).select("productid", > "itemid", "topicDistribution", "ptkString").toDF("productid", "itemid", > "features", "ptkString") featureldaDF=featureldaDF.persist //this is 328 line > {code} > > Other testing > - Java option : UseParallelGC , UseG1GC (all fail) > Below is log > {code} > 19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: > java.lang.OutOfMemoryError java.lang.OutOfMemoryError at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at > org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at > org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at >
[jira] [Updated] (SPARK-27069) Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232
[ https://issues.apache.org/jira/browse/SPARK-27069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] TAESUK KIM updated SPARK-27069: --- Summary: Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232 (was: Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)) > Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232 > > > Key: SPARK-27069 > URL: https://issues.apache.org/jira/browse/SPARK-27069 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.2 > Environment: Below is my environment > DataSet > # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail) > # Word : about 3553918(can't change) > Spark environment > # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail) > # executor-core,driver-core : 3 > # spark.serializer : default and > org.apache.spark.serializer.KryoSerializer(both fail) > # spark.executor.memoryOverhead : 18G --> 36G fail > Jave version : 1.8.0_191 (Oracle Corporation) > >Reporter: TAESUK KIM >Priority: Major > > I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed > version , ml ) using Spark 2.3.2(emr-5.18.0) . > After that I want to transform new DataSet by using that model. But when I > transform new data, I alway get error related memory error. > I changed data size from x 0.1 , to x 0.01. But always get memory > error(java.lang.OutOfMemoryError at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) > > That hugeCapacity error(overflow) is happened when size of array is over > Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find > why this error is happened. > And I want to change serializer to KryoSerializer. But I found > this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call > org.apache.spark.serializer.JavaSerializationStream even though I register > KryoClasses > > Is there any thing I can do ? > > Below is code > > {{val countvModel = CountVectorizerModel.load("s3://~/") }} > {{val ldaModel = DistributedLDAModel.load("s3://~/") }} > {{val transformeddata=countvModel.transform(inputData).select("productid", > "itemid", "ptkString", "features") var featureldaDF = > ldaModel.transform(transformeddata).select("productid", "itemid", > "topicDistribution", "ptkString").toDF("productid", "itemid", "features", > "ptkString") featureldaDF=featureldaDF.persist //this is 328 line }} > > Other testing > # Java option : UseParallelGC , UseG1GC (all fail) > Below is log > {{19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: > java.lang.OutOfMemoryError java.lang.OutOfMemoryError at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at > java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at > org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at > org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at >
[jira] [Created] (SPARK-27069) Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
TAESUK KIM created SPARK-27069: -- Summary: Spark(2.3.1) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) Key: SPARK-27069 URL: https://issues.apache.org/jira/browse/SPARK-27069 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.2 Environment: Below is my environment DataSet # Document : about 100,000,000 --> 10,000,000 --> 1,000,000(All fail) # Word : about 3553918(can't change) Spark environment # executor-memory,driver-memory : 18G --> 32g --> 64 --> 128g(all fail) # executor-core,driver-core : 3 # spark.serializer : default and org.apache.spark.serializer.KryoSerializer(both fail) # spark.executor.memoryOverhead : 18G --> 36G fail Jave version : 1.8.0_191 (Oracle Corporation) Reporter: TAESUK KIM I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed version , ml ) using Spark 2.3.2(emr-5.18.0) . After that I want to transform new DataSet by using that model. But when I transform new data, I alway get error related memory error. I changed data size from x 0.1 , to x 0.01. But always get memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) That hugeCapacity error(overflow) is happened when size of array is over Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find why this error is happened. And I want to change serializer to KryoSerializer. But I found this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call org.apache.spark.serializer.JavaSerializationStream even though I register KryoClasses Is there any thing I can do ? Below is code {{val countvModel = CountVectorizerModel.load("s3://~/") }} {{val ldaModel = DistributedLDAModel.load("s3://~/") }} {{val transformeddata=countvModel.transform(inputData).select("productid", "itemid", "ptkString", "features") var featureldaDF = ldaModel.transform(transformeddata).select("productid", "itemid", "topicDistribution", "ptkString").toDF("productid", "itemid", "features", "ptkString") featureldaDF=featureldaDF.persist //this is 328 line }} Other testing # Java option : UseParallelGC , UseG1GC (all fail) Below is log {{19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: java.lang.OutOfMemoryError java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107) at