[jira] [Commented] (ORC-1343) Reading ORC files without index occurs error using latested spark

Penglei Shi (Jira) Thu, 05 Jan 2023 19:23:04 -0800


    [ 
https://issues.apache.org/jira/browse/ORC-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655228#comment-17655228
 ]


Penglei Shi commented on ORC-1343:
----------------------------------

[~deshanxiao] , I set `orc.create.index=false `when writing, 
`orc.row.index.stride` was default 10000 and i have checked there were no index 
in the ORC file using orc-tool.  If there is no filter, spark can read this 
file successfully. But when i add a filter, it occurs error
{code:java}
23/01/06 11:04:18 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0) (10.107.103.132 executor driver): java.lang.AssertionError: Index is not 
populated for 1
        at 
org.apache.orc.impl.RecordReaderImpl$SargApplier.pickRowGroups(RecordReaderImpl.java:1128)
        at 
org.apache.orc.impl.RecordReaderImpl.pickRowGroups(RecordReaderImpl.java:1219)
        at 
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1239)
        at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1291)
        at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1334)
        at 
org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:355)
        at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)
        at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:130)
        at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:185)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
 {code}
On condition of filters pushdown, RecordReaderImpl.pickRowGroups will use 
rowIndex and if there is no RowIndex, it will throw error:
{code:java}
if (indexes[columnIx] == null) {
  throw new AssertionError("Index is not populated for " + columnIx);
} {code}

> Reading ORC files without index occurs error using latested spark
> -----------------------------------------------------------------
>
>                 Key: ORC-1343
>                 URL: https://issues.apache.org/jira/browse/ORC-1343
>             Project: ORC
>          Issue Type: Bug
>            Reporter: Penglei Shi
>            Priority: Major
>
> https://issues.apache.org/jira/browse/ORC-1283 this issue has fixed the 
> problem that ENABLE_INDEXES does not take effect. But without index, filters 
> pushdown will occur error, this seems to be because of the code below in 
> RecordReaderImpl.java
> {code:java}
> if (indexes[columnIx] == null) {
>   throw new AssertionError("Index is not populated for " + columnIx);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ORC-1343) Reading ORC files without index occurs error using latested spark

Reply via email to