Nihal Jain created PHOENIX-7267:
-----------------------------------
Summary: CsvBulkLoadTool fails for a bad record with "(startline
1) EOF reached before encapsulated token finished"
Key: PHOENIX-7267
URL: https://issues.apache.org/jira/browse/PHOENIX-7267
Project: Phoenix
Issue Type: Bug
Affects Versions: 5.1.3, 5.2.0, 5.3.0
Reporter: Nihal Jain
Assignee: Nihal Jain
We are trying to load data where there are few bad record for some files due to
which mappers fail and hence the entire job fail with following error:
{code:java}
Error: java.lang.RuntimeException: java.lang.RuntimeException:
java.io.IOException: (startline 1) EOF reached before encapsulated token
finished
at
org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:206)
at
org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:77)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.lang.RuntimeException: java.io.IOException: (startline 1) EOF
reached before encapsulated token finished
at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:398)
at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:407)
at
org.apache.phoenix.thirdparty.com.google.common.collect.Iterators.getNext(Iterators.java:895)
at
org.apache.phoenix.thirdparty.com.google.common.collect.Iterables.getFirst(Iterables.java:827)
at
org.apache.phoenix.mapreduce.CsvToKeyValueMapper$CsvLineParser.parse(CsvToKeyValueMapper.java:109)
at
org.apache.phoenix.mapreduce.CsvToKeyValueMapper$CsvLineParser.parse(CsvToKeyValueMapper.java:91)
at
org.apache.phoenix.mapreduce.FormatToBytesWritableMapper.map(FormatToBytesWritableMapper.java:164)
... 9 more
Caused by: java.io.IOException: (startline 1) EOF reached before encapsulated
token finished
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:282)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:450)
at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:395)
... 15 more {code}
I have figured out there is code in commons-csv which throws a RuntimeException
when it fails to parse are record which is not handled by phoenix as we only
catch IOException.
See
[https://github.com/apache/commons-csv/blob/rel/commons-csv-1.0/src/main/java/org/apache/commons/csv/CSVParser.java#L398]
Also see
[https://github.com/apache/phoenix/blob/master/phoenix-core-server/src/main/java/org/apache/phoenix/mapreduce/FormatToBytesWritableMapper.java#L167]
This is undesired, in worst case the job should just skip the failed record
than the whole job. Note we are passing --ignore-errors.
This bug is to fix this behavior.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)