The imbalance was caused by the stuck partition, after 10s of hours the
receiving rate went down. But the second ERR log I mentioned in the first
mail now occur at most of tasks(I did’t count, but keep flushing my
terminal) and jeopardize the job, as every batch takes 2 min(2-15 seconds
before) execution time and delays as much as 2+ hours.

The daunting ERROR log:

5/09/23 14:20:24 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414899 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:24 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414896 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:24 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414898 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:24 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414893 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:24 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414900 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:24 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414895 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:24 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414897 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:24 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414894 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
[Stage 4:>                                                     (0 + 3)
/ 3][Stage 29809:>                                             (0 +
176) / 180]15/09/23 14:20:54 ERROR TaskSchedulerImpl: Ignoring update
with state FINISHED for TID 1414899 because its task set is gone (this
is likely the result of receiving duplicate task finished status
updates)
15/09/23 14:20:54 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414896 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:54 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414898 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:54 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414895 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:54 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414893 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:54 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414900 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:54 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414897 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)
15/09/23 14:20:54 ERROR TaskSchedulerImpl: Ignoring update with state
FINISHED for TID 1414894 because its task set is gone (this is likely
the result of receiving duplicate task finished status updates)

the source code where the errors were thrown: statusUpdate(tid: Long,
state: TaskState, serializedData: ByteBuffer)
<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L355>,
any suggestions that where should I dig in?

BR,
Todd Leo

On Wed, Sep 23, 2015 at 1:53 PM Tathagata Das t...@databricks.com
<http://mailto:t...@databricks.com> wrote:

Also, you could switch to the Direct KAfka API which was first released as
> experimental in 1.3. In 1.5 we graduated it from experimental, but its
> quite usable in Spark 1.3.1
>
> TD
>
> On Tue, Sep 22, 2015 at 7:45 PM, SLiZn Liu <sliznmail...@gmail.com> wrote:
>
>> Cool, we are still sticking with 1.3.1, will upgrade to 1.5 ASAP. Thanks
>> for the tips, Tathagata!
>>
>> On Wed, Sep 23, 2015 at 10:40 AM Tathagata Das <t...@databricks.com>
>> wrote:
>>
>>> A lot of these imbalances were solved in spark 1.5. Could you give that
>>> a spin?
>>>
>>> https://issues.apache.org/jira/browse/SPARK-8882
>>>
>>> On Tue, Sep 22, 2015 at 12:17 AM, SLiZn Liu <sliznmail...@gmail.com>
>>> wrote:
>>>
>>>> Hi spark users,
>>>>
>>>> In our Spark Streaming app via Kafka integration on Mesos, we initialed
>>>> 3 receivers to receive 3 Kafka partitions, whereas records receiving rate
>>>> imbalance been observed, with spark.streaming.receiver.maxRate is set
>>>> to 120, sometimes 1 of which receives very close to the limit while
>>>> the other two only at roughly fifty per second.
>>>>
>>>> This may be caused by previous receiver failure, where one of the
>>>> receivers’ receiving rate drop to 0. We restarted the Spark Streaming app,
>>>> and the imbalance began. We suspect that the partition which received by
>>>> the failing receiver got jammed, and the other two receivers cannot take up
>>>> its data.
>>>>
>>>> The 3-nodes cluster tends to run slowly, nearly all the tasks is
>>>> registered at the node with previous receiver failure(I used unionto
>>>> combine 3 receivers’ DStream, thus I expect the combined DStream is
>>>> well distributed across all nodes), cannot guarantee to finish one batch in
>>>> a single batch time, stages get piled up, and the digested log shows as
>>>> following:
>>>>
>>>> ...
>>>> 5728.399: [GC (Allocation Failure) [PSYoungGen: 
>>>> 6954678K->17088K(6961152K)] 7074614K->138108K(20942336K), 0.0203877 secs] 
>>>> [Times: user=0.20 sys=0.00, real=0.02 secs]
>>>>
>>>> ...
>>>> 5/09/22 13:33:35 ERROR TaskSchedulerImpl: Ignoring update with state 
>>>> FINISHED for TID 77219 because its task set is gone (this is likely the 
>>>> result of
>>>> receiving duplicate task finished status updates)
>>>>
>>>> ...
>>>>
>>>> the two type of log was printed in execution of some (not all) stages.
>>>>
>>>> My configurations:
>>>> # of cores on each node: 64
>>>> # of nodes: 3
>>>> batch time is set to 10 seconds
>>>>
>>>> spark.streaming.receiver.maxRate        120
>>>> spark.streaming.blockInterval           160  // set to the value that 
>>>> divides 10 seconds approx. to  total cores, which is 64, to max out all 
>>>> the nodes: 10s * 1000 / 64
>>>> spark.storage.memoryFraction            0.1  // this one doesn't seem to 
>>>> work, since the young gen / old gen ratio is nearly 0.3 instead of 0.1
>>>>
>>>> anyone got an idea? Appreciate for your patience.
>>>>
>>>> BR,
>>>> Todd Leo
>>>> ​
>>>>
>>>
>>>
> ​

Reply via email to