[
https://issues.apache.org/jira/browse/SPARK-17335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15455293#comment-15455293
]
Michal Kielbowicz commented on SPARK-17335:
-------------------------------------------
It turned out I was a bit wrong, the problem doesn't lie within
StructType.simpleString method as ExpressionEncoder.toString is not used
(simpleString is only called when initializing the schemaString value).
We do not know why truncation is being used when stringifying the data
structure. Please find our stacktrace below:
{code}
java.lang.IllegalArgumentException: Error: : expected at the position 1012 of
'[VERY LONG STRUCTTYPE STRING]' but ' ' is found.
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:483)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:484)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:484)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
at org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
at
org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:178)
at
org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:220)
at
org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:93)
at
org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
at
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
> Creating Hive table from Spark data
> -----------------------------------
>
> Key: SPARK-17335
> URL: https://issues.apache.org/jira/browse/SPARK-17335
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.0.0
> Reporter: Michal Kielbowicz
>
> Recently my team started using Spark for analysis of huge JSON objects. Spark
> itself handles it well. The problem starts when we try to create a Hive table
> from it using steps from this part of doc:
> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
> After running command `spark.sql("CREATE TABLE x AS (SELECT * FROM y)") we
> get following exception (sorry for obfuscating, confidential data):
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.IllegalArgumentException: Error: : expected at the position 993 of
> 'string:struct<a:boolean,b:array<string>,c:boolean,d:struct<e:boolean,f:boolean,[...(few
> others)],z:boolean,... 4 more fields>,[...(rest of valid struct string)]>'
> but ' ' is found.;
> {code}
> It turned out that the exception was raised because of `... 4 more fields`
> part as it is not a valid representation of data structure.
> An easy workaround is to set `spark.debug.maxToStringFields` to some large
> value. Nevertheless it shouldn't be required and the stringifying process
> should use methods targeted at giving valid data structure for Hive.
> In my opinion the root problem is here:
> https://github.com/apache/spark/blob/9d7a47406ed538f0005cdc7a62bc6e6f20634815/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L318
> when calling `simpleString` method instead of `catalogString`. Nevertheless
> this class is used at many places and I don't feel that experienced with
> Spark to automatically submit PR.
> We believe this issue is indirectly caused by this PR:
> https://github.com/apache/spark/pull/13537
> There has been almost the same issue in the past. You can find it here:
> https://issues.apache.org/jira/browse/SPARK-16415
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]