[ 
https://issues.apache.org/jira/browse/SPARK-17386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss updated SPARK-17386:
------------------------------------
    Description: 
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will sit in a loop calling {{getOffset()}} every 10 msec 
(the default value of STREAMING_POLLING_DELAY) on every {{Source}} until new 
data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this rapid polling leads to excessive CPU usage.

In a production environment, this overhead could disrupt critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If no 
data has arrived, Spark will list the directory's contents up to 100 times per 
second. This overhead could disrupt service to other systems using HDFS, 
including Spark itself. A similar situation will exist with the Kafka source, 
the {{getOffset()}} method of which will presumably call Kafka's 
{{Consumer.poll()}} method.

  was:
The default trigger interval for a Structured Streaming query is 
{{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
When the trigger is set to this default value, the scheduler in 
{{StreamExecution}} will spin in a tight loop calling {{getOffset()}} every 10 
msec (the default value of STREAMING_POLLING_DELAY) on every {{Source}} until 
new data arrives.

In test cases, where most of the sources are {{MemoryStream}} or 
{{TextSocketSource}}, this spinning leads to excessive CPU usage.

In a production environment, this spinning could take down critical 
infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} or 
the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
{{FileStreamSource}} performs a directory listing of an HDFS directory. If no 
data has arrived, Spark will list the directory's contents up to 100 times per 
second. This overhead could disrupt service to other systems using HDFS, 
including Spark itself. A similar situation will exist with the Kafka source, 
the {{getOffset()}} method of which will presumably call Kafka's 
{{Consumer.poll()}} method.


> Default polling and trigger intervals cause excessive RPC calls
> ---------------------------------------------------------------
>
>                 Key: SPARK-17386
>                 URL: https://issues.apache.org/jira/browse/SPARK-17386
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>            Reporter: Frederick Reiss
>            Priority: Minor
>
> The default trigger interval for a Structured Streaming query is 
> {{ProcessingTime(0)}}, i.e. "trigger new microbatches as fast as possible". 
> When the trigger is set to this default value, the scheduler in 
> {{StreamExecution}} will sit in a loop calling {{getOffset()}} every 10 msec 
> (the default value of STREAMING_POLLING_DELAY) on every {{Source}} until new 
> data arrives.
> In test cases, where most of the sources are {{MemoryStream}} or 
> {{TextSocketSource}}, this rapid polling leads to excessive CPU usage.
> In a production environment, this overhead could disrupt critical 
> infrastructure. Most sources in Spark clusters will be {{FileStreamSource}} 
> or the not-yet-written Kafka 0.10 Source. The {{getOffset()}} method of 
> {{FileStreamSource}} performs a directory listing of an HDFS directory. If no 
> data has arrived, Spark will list the directory's contents up to 100 times 
> per second. This overhead could disrupt service to other systems using HDFS, 
> including Spark itself. A similar situation will exist with the Kafka source, 
> the {{getOffset()}} method of which will presumably call Kafka's 
> {{Consumer.poll()}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to