Subscribe

2021-11-02 Thread XING JIN


-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



subscribe

2020-09-07 Thread Bowen Li



[jira] [Resolved] (SPARK-22589) Subscribe to multiple roles in Mesos

2019-05-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22589.
--
Resolution: Incomplete

> Subscribe to multiple roles in Mesos
> 
>
> Key: SPARK-22589
> URL: https://issues.apache.org/jira/browse/SPARK-22589
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Fabiano Francesconi
>Priority: Major
>  Labels: bulk-closed
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Mesos offers the capability of [subscribing to multiple 
> roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that 
> Spark could easily be extended to opt-in for this specific capability.
> From my understanding, this is the [Spark source 
> code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94]
>  that regulates the subscription to the role. I wonder on whether just 
> passing a comma-separated list of frameworks (hence, splitting on that 
> string) would already be sufficient to leverage this capability.
> Is there any side-effect that this change will cause?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22589) Subscribe to multiple roles in Mesos

2019-05-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22589:
-
Labels: bulk-closed  (was: )

> Subscribe to multiple roles in Mesos
> 
>
> Key: SPARK-22589
> URL: https://issues.apache.org/jira/browse/SPARK-22589
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Fabiano Francesconi
>Priority: Major
>  Labels: bulk-closed
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Mesos offers the capability of [subscribing to multiple 
> roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that 
> Spark could easily be extended to opt-in for this specific capability.
> From my understanding, this is the [Spark source 
> code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94]
>  that regulates the subscription to the role. I wonder on whether just 
> passing a comma-separated list of frameworks (hence, splitting on that 
> string) would already be sufficient to leverage this capability.
> Is there any side-effect that this change will cause?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



subscribe

2019-04-22 Thread Bowen Li



[jira] [Updated] (SPARK-22589) Subscribe to multiple roles in Mesos

2017-11-23 Thread Fabiano Francesconi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabiano Francesconi updated SPARK-22589:

Description: 
Mesos offers the capability of [subscribing to multiple 
roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that 
Spark could easily be extended to opt-in for this specific capability.

>From my understanding, this is the [Spark source 
>code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94]
> that regulates the subscription to the role. I wonder on whether just passing 
>a comma-separated list of frameworks (hence, splitting on that string) would 
>already be sufficient to leverage this capability.

Is there any side-effect that this change will cause?

  was:
Mesos offers the capability of [subscribing to multiple 
roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that 
Spark could easily be extended to opt-in for this specific capability.

>From my understanding, this is the [Spark source 
>code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94]
> that regulates the subscription to the role. I wonder on whether just passing 
>a comma-separated list of frameworks (hence, splitting on that string) would 
>already be sufficient to leverage this capability.


> Subscribe to multiple roles in Mesos
> 
>
> Key: SPARK-22589
> URL: https://issues.apache.org/jira/browse/SPARK-22589
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Fabiano Francesconi
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Mesos offers the capability of [subscribing to multiple 
> roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that 
> Spark could easily be extended to opt-in for this specific capability.
> From my understanding, this is the [Spark source 
> code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94]
>  that regulates the subscription to the role. I wonder on whether just 
> passing a comma-separated list of frameworks (hence, splitting on that 
> string) would already be sufficient to leverage this capability.
> Is there any side-effect that this change will cause?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22589) Subscribe to multiple roles in Mesos

2017-11-23 Thread Fabiano Francesconi (JIRA)
Fabiano Francesconi created SPARK-22589:
---

 Summary: Subscribe to multiple roles in Mesos
 Key: SPARK-22589
 URL: https://issues.apache.org/jira/browse/SPARK-22589
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 2.2.0, 2.1.2
Reporter: Fabiano Francesconi


Mesos offers the capability of [subscribing to multiple 
roles|http://mesos.apache.org/documentation/latest/roles/]. I believe that 
Spark could easily be extended to opt-in for this specific capability.

>From my understanding, this is the [Spark source 
>code|https://github.com/apache/spark/blob/fc45c2c88a838b8f46659ebad2a8f3a9923bc95f/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L94]
> that regulates the subscription to the role. I wonder on whether just passing 
>a comma-separated list of frameworks (hence, splitting on that string) would 
>already be sufficient to leverage this capability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



subscribe

2017-11-21 Thread Shinichiro Abe



[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-17 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971893#comment-15971893
 ] 

Stephane Maarek commented on SPARK-20287:
-

[~c...@koeninger.org] It makes sense. I didn't realized in the direct streams, 
that the driver was in charge of assigning metadata to the executors to pull 
data. Therefore yes you're right, it's "incompatible" with the Kafka way of 
being "master-free", where each consumer really doesn't know and shouldn't care 
about how many other consumers there are. I think this ticket can now be closed 
(just re-open it if you don't believe so). Maybe it'll be worth opening a KIP 
on Kafka to have some APIs to allow Spark to be a bit more "optimized", but it 
all seems okay for now. Cheers!

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-17 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek closed SPARK-20287.
---
Resolution: Not A Problem

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-13 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-20287:
---

What you're describing is closer to the receiver-based implementation,
which had a number of issues.  What I tried to achieve with the direct
stream implementation was to have the driver figure out offset ranges
for the next batch, then have executors deterministically consume
exactly those messages with a 1:1 mapping between kafka partition and
spark partition.

If you have a single consumer subscribed to multiple topicpartitions,
you'll get intermingled messages for all of those partitions.  With
the new consumer api subscribed to multiple partitions, there isn't a
way to say "get topicpartition A until offset 1234", which is what we
need.




-- 
Cody Koeninger


> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-12 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966973#comment-15966973
 ] 

Stephane Maarek commented on SPARK-20287:
-

[~c...@koeninger.org] 
How about using the subscribe pattern?
https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

```
public void subscribe(Collection topics)
Subscribe to the given list of topics to get dynamically assigned partitions. 
Topic subscriptions are not incremental. This list will replace the current 
assignment (if there is one). It is not possible to combine topic subscription 
with group management with manual partition assignment through 
assign(Collection). If the given list of topics is empty, it is treated the 
same as unsubscribe().
```

Then you let Kafka handle the partition assignments? As all the consumers share 
the same group.id, the data will be effectively distributed between every Spark 
instance?

But then I guess you may have already explored that option and it goes against 
the Spark DirectStream API? (not a Spark expert, just trying to understand the 
limitations. I believe you when you say you did it the most straightforward way)

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966180#comment-15966180
 ] 

Cody Koeninger commented on SPARK-20287:


The issue here is that the underlying new Kafka consumer api doesn't have a way 
for a single consumer to subscribe to multiple partitions, but only read a 
particular range of messages from one of them.

The max capacity is just a simple way of dealing with what is basically a LRU 
cache - if someone creates topics dynamically and then stops sending messages 
to them, you don't want to keep leaking resources.

I'm not claiming there's anything great or elegant about those solutions, but 
they were pretty much the most straightforward way to make the direct stream 
model work with the new kafka consumer api.

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-11 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963939#comment-15963939
 ] 

Stephane Maarek commented on SPARK-20287:
-

The other issue I can see is the coordinator work that has to re-coordinate XX 
number of Kafka Consumers should one go down. That's more expensive if you have 
100 consumers versus a few. But as you said, it should be performance 
limitation-driven, right now that'd be speculation. 

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-11 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963938#comment-15963938
 ] 

Stephane Maarek commented on SPARK-20287:
-

[~srowen] those are good points. In the case of 100 separate machines on 100 
tasks, then I agree you have 100 Kafka Consumers no matter what. I guess as you 
said, my optimisation would come when you have tasks on the same machine that 
could share a Kafka Consumer. 
My concern is as you said the number of connections opened to Kafka that might 
be high even if not needed. I understand one Kafka Consumer distributing to 
multiple tasks may bind them together on the receive, and I'm not a Spark 
expert so I can't measure the implications of that on performance. 

My concern then is with the spark.streaming.kafka.consumer.cache.maxCapacity 
parameter. Is that truly needed?
Say one executor consumes from 100 partitions, do we really need to have a 
maxCapacity parameter? The executor should just spin as many consumers as 
needed ?
Same, in a distributed context, can't the individual executors figure out how 
many kafka consumers they need? 

Thanks for the discussion, I appreciate it

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963920#comment-15963920
 ] 

Sean Owen commented on SPARK-20287:
---

Spark has a different execution model though where it does want to distribute 
processing of partitions, into logically separate (sometimes physically 
separate) tasks. It makes sense to consume one Kafka partition as one Spark 
partition. If you have 100 workers consuming 100 partitions but on 100 
different machines, there's no way to share those, right?

There might be some scope to use a single consumer to consume n Kafka 
partitions on behalf of n Spark tasks when they happen to be in one executor. 
Does that solve a problem though? you say you think it might be a big overhead, 
but can it be? the overhead sounds like more connections than might be needed 
otherwise. I could see that being a problem at thousands of tasks.

The flip-side is sharing has its own complexity and, I presume, bottlenecks 
that now bind tasks together. This could be problematic, but I haven't thought 
through the details.

I think you'd have to make more of a case it's a problem, and then propose a 
solution?

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-10 Thread Stephane Maarek (JIRA)
Stephane Maarek created SPARK-20287:
---

 Summary: Kafka Consumer should be able to subscribe to more than 
one topic partition
 Key: SPARK-20287
 URL: https://issues.apache.org/jira/browse/SPARK-20287
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Stephane Maarek


As I understand and as it stands, one Kafka Consumer is created for each topic 
partition in the source Kafka topics, and they're cached.

cf 
https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48

In my opinion, that makes the design an anti pattern for Kafka and highly 
unefficient:
- Each Kafka consumer creates a connection to Kafka
- Spark doesn't leverage the power of the Kafka consumers, which is that it 
automatically assigns and balances partitions amongst all the consumers that 
share the same group.id
- You can still cache your Kafka consumer even if it has multiple partitions.

I'm not sure about how that translates to the spark underlying RDD 
architecture, but from a Kafka standpoint, I believe creating one consumer per 
partition is a big overhead, and a risk as the user may have to increase the 
spark.streaming.kafka.consumer.cache.maxCapacity parameter. 

Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



subscribe to spark issues

2017-03-21 Thread Yash Sharma
subscribe to spark issues


[jira] [Resolved] (SPARK-9556) Make all BlockGenerators subscribe to rate limit updates

2015-08-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9556.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Make all BlockGenerators subscribe to rate limit updates
 

 Key: SPARK-9556
 URL: https://issues.apache.org/jira/browse/SPARK-9556
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9556) Make all BlockGenerators subscribe to rate limit updates

2015-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9556:
---

Assignee: Tathagata Das  (was: Apache Spark)

 Make all BlockGenerators subscribe to rate limit updates
 

 Key: SPARK-9556
 URL: https://issues.apache.org/jira/browse/SPARK-9556
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9556) Make all BlockGenerators subscribe to rate limit updates

2015-08-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9556:
---

Assignee: Apache Spark  (was: Tathagata Das)

 Make all BlockGenerators subscribe to rate limit updates
 

 Key: SPARK-9556
 URL: https://issues.apache.org/jira/browse/SPARK-9556
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9556) Make all BlockGenerators subscribe to rate limit updates

2015-08-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652875#comment-14652875
 ] 

Apache Spark commented on SPARK-9556:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/7913

 Make all BlockGenerators subscribe to rate limit updates
 

 Key: SPARK-9556
 URL: https://issues.apache.org/jira/browse/SPARK-9556
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9556) Make all BlockGenerators subscribe to rate limit updates

2015-08-03 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-9556:


 Summary: Make all BlockGenerators subscribe to rate limit updates
 Key: SPARK-9556
 URL: https://issues.apache.org/jira/browse/SPARK-9556
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



subscribe

2015-05-01 Thread Jianfeng (Jeff) Zhang

Best Regard,
Jeff Zhang



Subscribe

2014-07-14 Thread Mubarak Seyed