[ https://issues.apache.org/jira/browse/SPARK-18249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15682893#comment-15682893 ]
Damian Momot commented on SPARK-18249: -------------------------------------- Hi, nope it's simple case class, exactly this: {code} case class ParsedUserAgent ( user_agent: String, parser_version: String, client_type: Option[String], client_name: Option[String], client_version: Option[String], operating_system_name: Option[String], operating_system_version: Option[String], operating_system_platform: Option[String], device_type: Option[String], device_brand: Option[String], device_model: Option[String], is_bot: Option[Boolean], is_mobile: Option[Boolean], is_desktop: Option[Boolean] ) {/code} Flow itself is a bit more complicated as such structures are being read from HDFS, filtered, projected, distinct-collected, inserted into datasets again, unioned with other dataset, grouped and reduced. In meantime there are some repartitions and caches using StorageLevel.MEMORY_AND_DISK_SER When it first happened i tried: - removing intermediate dataset saves/caches - removing explicit repartitions - switching from kryo serialziation to standard java serialziation But no of those solved a problem. I also tried to create spark-shell equivalent flow (but on simple structures) but it didn't trigger bug > StackOverflowError when saving dataset to parquet > ------------------------------------------------- > > Key: SPARK-18249 > URL: https://issues.apache.org/jira/browse/SPARK-18249 > Project: Spark > Issue Type: Bug > Affects Versions: 2.0.1, 2.0.2, 2.1.0 > Reporter: Damian Momot > > Not really sure what's exact reproduction path. It's first job which was > updated from spark 1.6.2 (dataframes) to 2.0.1 (datasets) - both on > hadoop.version=2.6.0-cdh5.6.0. It fails on last stage which saves data to > parquet files (actually small data during test) > Exception in thread "main" org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:149) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488) > [some user code calls removed] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.StackOverflowError > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.getBytes(UnsafeRow.java:593) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeExternal(UnsafeRow.java:661) > at > java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1456) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org