[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674628#comment-15674628
 ] 

Ofir Manor commented on SPARK-18475:
------------------------------------

I was just wondering if it actually works, but it seems you found a way to hack 
it (I thought you would need different consumer group per worker to avoid 
coordination by the broker, but it didn't seem like it). 
If it does provide a big perf boost in some cases, and it is not enabled by 
default, I personally don't have any objections.
[~c...@koeninger.org] - I didn't understand your objection. An RDD / dataset 
does not have any inherent order guarantees (same as a SQL result set), and the 
Kafka metadata per message (including topic, partition, offset) is exposed if 
someone really cares. If you guys have a smart hack that allows you to divide a 
specific partition into ranges and have different workers read different range 
of a partition in parallel, and if it does provide a significant perf boost, 
why not have it as an option? 
I don't think it should break correctness, as the boundries are decided anyway 
by the driver. 

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18475
>                 URL: https://issues.apache.org/jira/browse/SPARK-18475
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.0.2, 2.1.0
>            Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to