[jira] [Created] (SPARK-40048) Partitions are traversed multiple times invalidating Accumulator consistency

sam (Jira) Thu, 11 Aug 2022 01:57:06 -0700

sam created SPARK-40048:
---------------------------

             Summary: Partitions are traversed multiple times invalidating 
Accumulator consistency
                 Key: SPARK-40048
                 URL: https://issues.apache.org/jira/browse/SPARK-40048
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.4
            Reporter: sam



We are trying to use Accumulators to count RDDs without having to force 
`.count()` on them for efficiency reasons.  We are aware tasks can fail and 
re-run, which will invalidate the value of the accumulator, so we count the 
number of times a partition has been traversed, so we can detect this.

The problem is that partitions are being traversed multiple times even though

 - We cache the RDD in memory
 - No tasks are failing, no executors are dying.
 - There is plenty of memory (no RDD eviction)

The code we use:

```
val count: LongAccumulator
val partitionTraverseCounts: List[LongAccumulator]

def incrementTimesCalled(partitionIndex: Int): Unit =
      partitionTraverseCounts(partitionIndex).add(1)

def incrementForPartition[T](index: Int, it: Iterator[T]): Iterator[T] = {
    incrementTimesCalled(index)

    it.map { x =>
      increment()
      x
    }
  }
```

We have a 50 partition RDD, and we frequently see odd traverse counts:

```
traverseCounts: List(2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 
2, 2, 2, 1, 2)
```

As you can see, some partitions are traversed twice, while others are traversed 
only once.

To confirm no task failures:

```
cat job.log | grep -i task | grep -i fail
```

To confirm no memory issues:

```
cat job.log | grep -i memory
```

We see every log line has multiple GB memory free.

We also don't see any errors or exceptions.

Question:

1. Why is spark traversing a cached RDD multiple times?
2. Is there any way to disable this?

Many thanks,

Sam




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40048) Partitions are traversed multiple times invalidating Accumulator consistency

Reply via email to