[
https://issues.apache.org/jira/browse/ORC-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655228#comment-17655228
]
Penglei Shi commented on ORC-1343:
----------------------------------
[~deshanxiao] , I set `orc.create.index=false `when writing,
`orc.row.index.stride` was default 10000 and i have checked there were no index
in the ORC file using orc-tool. If there is no filter, spark can read this
file successfully. But when i add a filter, it occurs error
{code:java}
23/01/06 11:04:18 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times;
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID
0) (10.107.103.132 executor driver): java.lang.AssertionError: Index is not
populated for 1
at
org.apache.orc.impl.RecordReaderImpl$SargApplier.pickRowGroups(RecordReaderImpl.java:1128)
at
org.apache.orc.impl.RecordReaderImpl.pickRowGroups(RecordReaderImpl.java:1219)
at
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1239)
at
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1291)
at
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1334)
at
org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:355)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)
at
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:130)
at
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:185)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
{code}
On condition of filters pushdown, RecordReaderImpl.pickRowGroups will use
rowIndex and if there is no RowIndex, it will throw error:
{code:java}
if (indexes[columnIx] == null) {
throw new AssertionError("Index is not populated for " + columnIx);
} {code}
> Reading ORC files without index occurs error using latested spark
> -----------------------------------------------------------------
>
> Key: ORC-1343
> URL: https://issues.apache.org/jira/browse/ORC-1343
> Project: ORC
> Issue Type: Bug
> Reporter: Penglei Shi
> Priority: Major
>
> https://issues.apache.org/jira/browse/ORC-1283 this issue has fixed the
> problem that ENABLE_INDEXES does not take effect. But without index, filters
> pushdown will occur error, this seems to be because of the code below in
> RecordReaderImpl.java
> {code:java}
> if (indexes[columnIx] == null) {
> throw new AssertionError("Index is not populated for " + columnIx);
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)