[ https://issues.apache.org/jira/browse/SPARK-23225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340813#comment-16340813 ]
Marco Gaido commented on SPARK-23225: ------------------------------------- I am not able to reproduce on master. May you provide a sample data to reproduce the problem? It should have been fixed in SPARK-18877, but 2.1.1 should contain the fix. > Spark is infering decimal values with wrong precision > ----------------------------------------------------- > > Key: SPARK-23225 > URL: https://issues.apache.org/jira/browse/SPARK-23225 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.1 > Reporter: Nacho García Fernández > Priority: Major > > Hi there. > I'm reading a CSV file with data exported by DB2. This CSV file is about 1.6M > records. Most of the records are actually Decimal(1,0), but about 200 of them > are Decimal(2,0) > The file looks like: > > {code:java} > +-------------+ > |MY_COLUMN| > +-------------+ > | +0001.| > | +0010.| > | +0011.| > | +0002.| > ......... > {code} > > Everything is OK when I read the input file with the following line (actually > I'm not calling any spark action yet): > {code:java} > val test = spark.read.option("delimiter", ";").option("inferSchema", > "true").option("header", "true"). csv("testfile") > {code} > > After calling a simple action like *test.distinct* or *test.count*, Spark is > throwing the following exception: > > {code:java} > [Stage 58:> (0 + 4) / 65][Stage 60:> (0 + 0) / 3][Stage 61:> (0 + 0) / > 7]2018-01-24 11:01:27 ERROR org.apache.spark.executor.Executor:91 - Exception > in task 1.0 in stage 58.0 (TID 6614) java.lang.IllegalArgumentException: > requirement failed: Decimal precision 2 exceeds max precision 1 at > scala.Predef$.require(Predef.scala:224) at > org.apache.spark.sql.types.Decimal.set(Decimal.scala:113) at > org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:426) at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:273) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:99) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) 2018-01-24 11:01:27 WARN > org.apache.spark.scheduler.TaskSetManager:66 - Lost task 1.0 in stage 58.0 > (TID 6614, localhost, executor driver): java.lang.IllegalArgumentException: > requirement failed: Decimal precision 2 exceeds max precision 1 at > scala.Predef$.require(Predef.scala:224) at > org.apache.spark.sql.types.Decimal.set(Decimal.scala:113) at > org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:426) at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:273) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:99) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) 2018-01-24 11:01:27 ERROR > org.apache.spark.executor.Executor:91 - Exception in task 2.0 in stage 58.0 > (TID 6615) java.lang.IllegalArgumentException: requirement failed: Decimal > precision 3 exceeds max precision 2 at > scala.Predef$.require(Predef.scala:224) at > org.apache.spark.sql.types.Decimal.set(Decimal.scala:113) at > org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:426) at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:273) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:99) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) 2018-01-24 11:01:27 ERROR > org.apache.spark.executor.Executor:91 - Exception in task 3.0 in stage 58.0 > (TID 6616) java.lang.IllegalArgumentException: requirement failed: Decimal > precision 3 exceeds max precision 2 at > scala.Predef$.require(Predef.scala:224) at > org.apache.spark.sql.types.Decimal.set(Decimal.scala:113) at > org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:426) at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:273) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102) > {code} > > > The issue is that Spark is infering this column as *DecimalType(1) (I > checked it out with the printSchema),* so it fails whenever I try to filter > out or carry out some spark actions. > I cannot understand why Spark is failing infering this column (maybe Spark is > based on data samplings when infering schemas? It makes no sense if Spark > actually reads the input file twice when inferSchema is true). > Any help is welcome. > Thanks in advance > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org