[jira] [Updated] (KAFKA-2360) The kafka-consumer-perf-test.sh script help information print useless parameters.

2015-07-27 Thread Bo Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Wang updated KAFKA-2360:
---
Description: 
Run kafka-consumer-perf-test.sh --help to show help information,  but found 3 
parameters useless : 
--batch-size and --batch-size
That is producer of parameters.

bin]# ./kafka-consumer-perf-test.sh --help
Missing required argument "[topic]"
Option  Description
--  ---
--batch-size Number of messages to write in a   
  single batch. (default: 200) 
--compression-codec 
--date-format  The date format to use for formatting  
  the time field. See java.text.   
  SimpleDateFormat for options.
  (default: -MM-dd HH:mm:ss:SSS)   
--fetch-size The amount of data to fetch in a   
  single request. (default: 1048576)   

  was:
Run kafka-consumer-perf-test.sh --help to show help information,  but found two 
parameters useless : 
--batch-size and --batch-size
That is producer of parameters.

bin]# ./kafka-consumer-perf-test.sh --help
Missing required argument "[topic]"
Option  Description
--  ---
--batch-size Number of messages to write in a   
  single batch. (default: 200) 
--compression-codec 
--date-format  The date format to use for formatting  
  the time field. See java.text.   
  SimpleDateFormat for options.
  (default: -MM-dd HH:mm:ss:SSS)   
--fetch-size The amount of data to fetch in a   
  single request. (default: 1048576)   


> The kafka-consumer-perf-test.sh script help information print useless 
> parameters.
> -
>
> Key: KAFKA-2360
> URL: https://issues.apache.org/jira/browse/KAFKA-2360
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.8.2.1
> Environment: Linux
>Reporter: Bo Wang
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Run kafka-consumer-perf-test.sh --help to show help information,  but found 3 
> parameters useless : 
> --batch-size and --batch-size
> That is producer of parameters.
> bin]# ./kafka-consumer-perf-test.sh --help
> Missing required argument "[topic]"
> Option  Description   
>  
> --  ---   
>  
> --batch-size Number of messages to write in a  
>  
>   single batch. (default: 200)
>  
> --compression-codec   
>   supported codec: NoCompressionCodec (default: 0)
>  
>   as 0, GZIPCompressionCodec as 1,
>  
>   SnappyCompressionCodec as 2,
>  
>   LZ4CompressionCodec as 3>   
>  
> --date-format  The date format to use for formatting 
>  
>   the time field. See java.text.  
>  
>   SimpleDateFormat for options.   
>  
>   (default: -MM-dd HH:mm:ss:SSS)  
>  
> --fetch-size The amount of data to fetch in a  
>  
>   single request. (default: 1048576)  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2360) The kafka-consumer-perf-test.sh script help information print useless parameters.

2015-07-27 Thread Bo Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Wang updated KAFKA-2360:
---
Description: 
Run kafka-consumer-perf-test.sh --help to show help information,  but found 3 
parameters useless : 
--batch-size and --batch-size --messages
That is producer of parameters.

bin]# ./kafka-consumer-perf-test.sh --help
Missing required argument "[topic]"
Option  Description
--  ---
--batch-size Number of messages to write in a   
  single batch. (default: 200) 
--compression-codec 
--date-format  The date format to use for formatting  
  the time field. See java.text.   
  SimpleDateFormat for options.
  (default: -MM-dd HH:mm:ss:SSS)   
--fetch-size The amount of data to fetch in a   
  single request. (default: 1048576)   
--messages The number of messages to send or  
  consume (default:
  9223372036854775807)

  was:
Run kafka-consumer-perf-test.sh --help to show help information,  but found 3 
parameters useless : 
--batch-size and --batch-size
That is producer of parameters.

bin]# ./kafka-consumer-perf-test.sh --help
Missing required argument "[topic]"
Option  Description
--  ---
--batch-size Number of messages to write in a   
  single batch. (default: 200) 
--compression-codec 
--date-format  The date format to use for formatting  
  the time field. See java.text.   
  SimpleDateFormat for options.
  (default: -MM-dd HH:mm:ss:SSS)   
--fetch-size The amount of data to fetch in a   
  single request. (default: 1048576)   


> The kafka-consumer-perf-test.sh script help information print useless 
> parameters.
> -
>
> Key: KAFKA-2360
> URL: https://issues.apache.org/jira/browse/KAFKA-2360
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.8.2.1
> Environment: Linux
>Reporter: Bo Wang
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Run kafka-consumer-perf-test.sh --help to show help information,  but found 3 
> parameters useless : 
> --batch-size and --batch-size --messages
> That is producer of parameters.
> bin]# ./kafka-consumer-perf-test.sh --help
> Missing required argument "[topic]"
> Option  Description   
>  
> --  ---   
>  
> --batch-size Number of messages to write in a  
>  
>   single batch. (default: 200)
>  
> --compression-codec   
>   supported codec: NoCompressionCodec (default: 0)
>  
>   as 0, GZIPCompressionCodec as 1,
>  
>   SnappyCompressionCodec as 2,
>  
>   LZ4CompressionCodec as 3>   
>  
> --date-format  The date format to use for formatting 
>  
>   the time field. See java.text.  
>  
>   SimpleDateFormat for options.   
>  
>   (default: -MM-dd HH:mm:ss:SSS)  
>  
> --fetch-size The amount of data to fetch in a  
>  
>   single request. (default: 1048576)  
>  
> --messages The number of messages to send or 
>  
>   consume (default:   
>  
>   9223372036854775807)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Gwen Shapira
Hi,

Since we will be discussing KIP-28 in the call tomorrow, can you
update the KIP with the feature-comparison with  existing solutions?
I admit that I do not see a need for single-event-producer-consumer
pair (AKA Flume Interceptor). I've seen tons of people implement such
apps in the last year, and it seemed easy. Now, perhaps we were doing
it all wrong... but I'd like to know how :)

If we are talking about a bigger story (i.e. DSL, real
stream-processing, etc), thats a different discussion. I've seen a
bunch of misconceptions about SparkStreaming in this discussion, and I
have some thoughts in that regard, but I'd rather not go into that if
thats outside the scope of this KIP.

Gwen


On Fri, Jul 24, 2015 at 9:48 AM, Guozhang Wang  wrote:
> Hi Ewen,
>
> Replies inlined.
>
> On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava 
> wrote:
>
>> Just some notes on the KIP doc itself:
>>
>> * It'd be useful to clarify at what point the plain consumer + custom code
>> + producer breaks down. I think trivial filtering and aggregation on a
>> single stream usually work fine with this model. Anything where you need
>> more complex joins, windowing, etc. are where it breaks down. I think most
>> interesting applications require that functionality, but it's helpful to
>> make this really clear in the motivation -- right now, Kafka only provides
>> the lowest level plumbing for stream processing applications, so most
>> interesting apps require very heavyweight frameworks.
>>
>
> I think for users to efficiently express complex logic like joins
> windowing, etc, a higher-level programming interface beyond the process()
> interface would definitely be better, but that does not necessarily require
> a "heavyweight" frameworks, which usually includes more than just the
> high-level functional programming model. I would argue that an alternative
> solution would better be provided for users who want some high-level
> programming interface but not a heavyweight stream processing framework
> that include the processor library plus another DSL layer on top of it.
>
>
>
>> * I think the feature comparison of plain producer/consumer, stream
>> processing frameworks, and this new library is a good start, but we might
>> want something more thorough and structured, like a feature matrix. Right
>> now it's hard to figure out exactly how they relate to each other.
>>
>
> Cool, I can do that.
>
>
>> * I'd personally push the library vs. framework story very strongly -- the
>> total buy-in and weak integration story of stream processing frameworks is
>> a big downside and makes a library a really compelling (and currently
>> unavailable, as far as I am aware) alternative.
>>
>
> Are you suggesting there are still some content missing about the
> motivations of adding the proposed library in the wiki page?
>
>
>> * Comment about in-memory storage of other frameworks is interesting -- it
>> is specific to the framework, but is supposed to also give performance
>> benefits. The high-level functional processing interface would allow for
>> combining multiple operations when there's no shuffle, but when there is a
>> shuffle, we'll always be writing to Kafka, right? Spark (and presumably
>> spark streaming) is supposed to get a big win by handling shuffles such
>> that the data just stays in cache and never actually hits disk, or at least
>> hits disk in the background. Will we take a hit because we always write to
>> Kafka?
>>
>
> I agree with Neha's comments here. One more point I want to make is
> materializing to Kafka is not necessarily much worse than keeping data in
> memory if the downstream consumption is caught up such that most of the
> reads will be hitting file cache. I remember Samza has illustrated that
> under such scenarios its throughput is actually quite comparable to Spark
> Streaming / Storm.
>
>
>> * I really struggled with the structure of the KIP template with Copycat
>> because the flow doesn't work well for proposals like this. They aren't as
>> concrete changes as the KIP template was designed for. I'd completely
>> ignore that template in favor of optimizing for clarity if I were you.
>>
>> -Ewen
>>
>> On Thu, Jul 23, 2015 at 5:59 PM, Guozhang Wang  wrote:
>>
>> > Hi all,
>> >
>> > I just posted KIP-28: Add a transform client for data processing
>> > <
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+transform+client+for+data+processing
>> > >
>> > .
>> >
>> > The wiki page does not yet have the full design / implementation details,
>> > and this email is to kick-off the conversation on whether we should add
>> > this new client with the described motivations, and if yes what features
>> /
>> > functionalities should be included.
>> >
>> > Looking forward to your feedback!
>> >
>> > -- Guozhang
>> >
>>
>>
>>
>> --
>> Thanks,
>> Ewen
>>
>
>
>
> --
> -- Guozhang


[jira] [Commented] (KAFKA-2026) Logging of unused options always shows null for the value and is misleading if the option is used by serializers

2015-07-27 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643820#comment-14643820
 ] 

Xuan Gong commented on KAFKA-2026:
--

I think that the warning messages here are used to remind the users that they 
did not specify some configurations. So, instead of simply changing the value, 
we could rephrase the warning messages which include something like "This 
configuration is not specified explicitly, will use the default value", etc

> Logging of unused options always shows null for the value and is misleading 
> if the option is used by serializers
> 
>
> Key: KAFKA-2026
> URL: https://issues.apache.org/jira/browse/KAFKA-2026
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 0.8.2.1
>Reporter: Ewen Cheslack-Postava
>Assignee: Manikumar Reddy
>Priority: Trivial
> Fix For: 0.8.3
>
> Attachments: KAFKA-2026.patch
>
>
> This is a really simple issue. When AbstractConfig logs unused messages, it 
> gets the value from the parsed configs. Since those are generated from the 
> ConfigDef, they value will not have been parsed or copied over from the 
> original map. This is especially confusing if you've explicitly set an option 
> to pass through to the serializers since you're always going to see these 
> warnings in your log.
> The simplest patch would grab the original value from this.originals. But now 
> I'm not sure logging this makes sense at all anymore since configuring any 
> serializer that has options that aren't in ProducerConfig will create a 
> misleading warning message. Further, using AbstractConfig for your serializer 
> implementation would cause all the producer's config settings to be logged as 
> unused. Since a single set of properties is being used to configure multiple 
> components, trying to log unused keys may not make sense anymore.
> Example of confusion caused by this: 
> http://mail-archives.apache.org/mod_mbox/kafka-users/201503.mbox/%3CCAPAVcJ8nwSVjia3%2BH893V%2B87StST6r0xN4O2ac8Es2bEXjv1OA%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 36871: Patch for KAFKA-2381

2015-07-27 Thread Ashish Singh

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36871/
---

(Updated July 28, 2015, 4:56 a.m.)


Review request for kafka.


Bugs: KAFKA-2381
https://issues.apache.org/jira/browse/KAFKA-2381


Repository: kafka


Description (updated)
---

Add a more specific unit test


KAFKA-2381: Possible ConcurrentModificationException while unsubscribing from a 
topic in new consumer


Diffs (updated)
-

  
clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java
 4d9a425201115a66b457b58d670992b279091f5a 
  
clients/src/test/java/org/apache/kafka/clients/consumer/internals/SubscriptionStateTest.java
 319751c374ccdc7e7d7d74bcd01bc279b1bdb26e 
  core/src/test/scala/integration/kafka/api/ConsumerTest.scala 
3eb5f95731a3f06f662b334ab2b3d0ad7fa9e1ca 

Diff: https://reviews.apache.org/r/36871/diff/


Testing
---


Thanks,

Ashish Singh



[jira] [Commented] (KAFKA-2381) Possible ConcurrentModificationException while unsubscribing from a topic in new consumer

2015-07-27 Thread Ashish K Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643843#comment-14643843
 ] 

Ashish K Singh commented on KAFKA-2381:
---

Updated reviewboard https://reviews.apache.org/r/36871/
 against branch trunk

> Possible ConcurrentModificationException while unsubscribing from a topic in 
> new consumer
> -
>
> Key: KAFKA-2381
> URL: https://issues.apache.org/jira/browse/KAFKA-2381
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Ashish K Singh
>Assignee: Ashish K Singh
> Attachments: KAFKA-2381.patch, KAFKA-2381_2015-07-27_17:56:00.patch, 
> KAFKA-2381_2015-07-27_21:56:06.patch
>
>
> Possible ConcurrentModificationException while unsubscribing from a topic in 
> new consumer. Attempt is made to modify AssignedPartitions while looping over 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2381) Possible ConcurrentModificationException while unsubscribing from a topic in new consumer

2015-07-27 Thread Ashish K Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish K Singh updated KAFKA-2381:
--
Attachment: KAFKA-2381_2015-07-27_21:56:06.patch

> Possible ConcurrentModificationException while unsubscribing from a topic in 
> new consumer
> -
>
> Key: KAFKA-2381
> URL: https://issues.apache.org/jira/browse/KAFKA-2381
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Ashish K Singh
>Assignee: Ashish K Singh
> Attachments: KAFKA-2381.patch, KAFKA-2381_2015-07-27_17:56:00.patch, 
> KAFKA-2381_2015-07-27_21:56:06.patch
>
>
> Possible ConcurrentModificationException while unsubscribing from a topic in 
> new consumer. Attempt is made to modify AssignedPartitions while looping over 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 36871: Patch for KAFKA-2381

2015-07-27 Thread Ashish Singh

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36871/
---

(Updated July 28, 2015, 4:59 a.m.)


Review request for kafka.


Bugs: KAFKA-2381
https://issues.apache.org/jira/browse/KAFKA-2381


Repository: kafka


Description (updated)
---

Move closing of consumer to finally


Add a more specific unit test


KAFKA-2381: Possible ConcurrentModificationException while unsubscribing from a 
topic in new consumer


Diffs (updated)
-

  
clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java
 4d9a425201115a66b457b58d670992b279091f5a 
  
clients/src/test/java/org/apache/kafka/clients/consumer/internals/SubscriptionStateTest.java
 319751c374ccdc7e7d7d74bcd01bc279b1bdb26e 
  core/src/test/scala/integration/kafka/api/ConsumerTest.scala 
3eb5f95731a3f06f662b334ab2b3d0ad7fa9e1ca 

Diff: https://reviews.apache.org/r/36871/diff/


Testing
---


Thanks,

Ashish Singh



[jira] [Commented] (KAFKA-2381) Possible ConcurrentModificationException while unsubscribing from a topic in new consumer

2015-07-27 Thread Ashish K Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643847#comment-14643847
 ] 

Ashish K Singh commented on KAFKA-2381:
---

Updated reviewboard https://reviews.apache.org/r/36871/
 against branch trunk

> Possible ConcurrentModificationException while unsubscribing from a topic in 
> new consumer
> -
>
> Key: KAFKA-2381
> URL: https://issues.apache.org/jira/browse/KAFKA-2381
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Ashish K Singh
>Assignee: Ashish K Singh
> Attachments: KAFKA-2381.patch, KAFKA-2381_2015-07-27_17:56:00.patch, 
> KAFKA-2381_2015-07-27_21:56:06.patch, KAFKA-2381_2015-07-27_21:59:41.patch
>
>
> Possible ConcurrentModificationException while unsubscribing from a topic in 
> new consumer. Attempt is made to modify AssignedPartitions while looping over 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 36871: Patch for KAFKA-2381

2015-07-27 Thread Ashish Singh


> On July 28, 2015, 1:15 a.m., Aditya Auradkar wrote:
> > core/src/test/scala/integration/kafka/api/ConsumerTest.scala, line 233
> > 
> >
> > consider closing this in a finally. A failing test can cause incorrect 
> > tear down of the test

Done. Thanks for the review.


- Ashish


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36871/#review93215
---


On July 28, 2015, 4:59 a.m., Ashish Singh wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36871/
> ---
> 
> (Updated July 28, 2015, 4:59 a.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-2381
> https://issues.apache.org/jira/browse/KAFKA-2381
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> Move closing of consumer to finally
> 
> 
> Add a more specific unit test
> 
> 
> KAFKA-2381: Possible ConcurrentModificationException while unsubscribing from 
> a topic in new consumer
> 
> 
> Diffs
> -
> 
>   
> clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java
>  4d9a425201115a66b457b58d670992b279091f5a 
>   
> clients/src/test/java/org/apache/kafka/clients/consumer/internals/SubscriptionStateTest.java
>  319751c374ccdc7e7d7d74bcd01bc279b1bdb26e 
>   core/src/test/scala/integration/kafka/api/ConsumerTest.scala 
> 3eb5f95731a3f06f662b334ab2b3d0ad7fa9e1ca 
> 
> Diff: https://reviews.apache.org/r/36871/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Ashish Singh
> 
>



Re: Review Request 36871: Patch for KAFKA-2381

2015-07-27 Thread Ashish Singh


> On July 28, 2015, 1:11 a.m., Jason Gustafson wrote:
> > Ouch. Hard to believe this wasn't caught yet.

It is. Thanks for the review. Addressed your concern.


- Ashish


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36871/#review93213
---


On July 28, 2015, 4:59 a.m., Ashish Singh wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36871/
> ---
> 
> (Updated July 28, 2015, 4:59 a.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-2381
> https://issues.apache.org/jira/browse/KAFKA-2381
> 
> 
> Repository: kafka
> 
> 
> Description
> ---
> 
> Move closing of consumer to finally
> 
> 
> Add a more specific unit test
> 
> 
> KAFKA-2381: Possible ConcurrentModificationException while unsubscribing from 
> a topic in new consumer
> 
> 
> Diffs
> -
> 
>   
> clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java
>  4d9a425201115a66b457b58d670992b279091f5a 
>   
> clients/src/test/java/org/apache/kafka/clients/consumer/internals/SubscriptionStateTest.java
>  319751c374ccdc7e7d7d74bcd01bc279b1bdb26e 
>   core/src/test/scala/integration/kafka/api/ConsumerTest.scala 
> 3eb5f95731a3f06f662b334ab2b3d0ad7fa9e1ca 
> 
> Diff: https://reviews.apache.org/r/36871/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Ashish Singh
> 
>



[jira] [Updated] (KAFKA-2381) Possible ConcurrentModificationException while unsubscribing from a topic in new consumer

2015-07-27 Thread Ashish K Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish K Singh updated KAFKA-2381:
--
Attachment: KAFKA-2381_2015-07-27_21:59:41.patch

> Possible ConcurrentModificationException while unsubscribing from a topic in 
> new consumer
> -
>
> Key: KAFKA-2381
> URL: https://issues.apache.org/jira/browse/KAFKA-2381
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Ashish K Singh
>Assignee: Ashish K Singh
> Attachments: KAFKA-2381.patch, KAFKA-2381_2015-07-27_17:56:00.patch, 
> KAFKA-2381_2015-07-27_21:56:06.patch, KAFKA-2381_2015-07-27_21:59:41.patch
>
>
> Possible ConcurrentModificationException while unsubscribing from a topic in 
> new consumer. Attempt is made to modify AssignedPartitions while looping over 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Aditya Auradkar
+1 on comparison with existing solutions. On a high level, it seems nice to
have a transform library inside Kafka.. a lot of the building blocks are
already there to build a stream processing framework. However the details
are tricky to get right I think this discussion will get a lot more
interesting when we have something concrete to look at. I'm +1 for the
general idea.
How far away are we from having something a prototype patch to play with?

Couple of observations:
- Since the input source for each processor is always Kafka, you get basic
client side partition management out of the box it use the high level
consumer.
- The KIP states that cmd line tools will be provided to deploy as a
separate service. Is the proposed scope limited to providing a library with
which makes it possible build stream-processing-as- a-service or provide
such a service within Kafka itself?

Aditya

On Mon, Jul 27, 2015 at 8:20 PM, Gwen Shapira  wrote:

> Hi,
>
> Since we will be discussing KIP-28 in the call tomorrow, can you
> update the KIP with the feature-comparison with  existing solutions?
> I admit that I do not see a need for single-event-producer-consumer
> pair (AKA Flume Interceptor). I've seen tons of people implement such
> apps in the last year, and it seemed easy. Now, perhaps we were doing
> it all wrong... but I'd like to know how :)
>
> If we are talking about a bigger story (i.e. DSL, real
> stream-processing, etc), thats a different discussion. I've seen a
> bunch of misconceptions about SparkStreaming in this discussion, and I
> have some thoughts in that regard, but I'd rather not go into that if
> thats outside the scope of this KIP.
>
> Gwen
>
>
> On Fri, Jul 24, 2015 at 9:48 AM, Guozhang Wang  wrote:
> > Hi Ewen,
> >
> > Replies inlined.
> >
> > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <
> e...@confluent.io>
> > wrote:
> >
> >> Just some notes on the KIP doc itself:
> >>
> >> * It'd be useful to clarify at what point the plain consumer + custom
> code
> >> + producer breaks down. I think trivial filtering and aggregation on a
> >> single stream usually work fine with this model. Anything where you need
> >> more complex joins, windowing, etc. are where it breaks down. I think
> most
> >> interesting applications require that functionality, but it's helpful to
> >> make this really clear in the motivation -- right now, Kafka only
> provides
> >> the lowest level plumbing for stream processing applications, so most
> >> interesting apps require very heavyweight frameworks.
> >>
> >
> > I think for users to efficiently express complex logic like joins
> > windowing, etc, a higher-level programming interface beyond the process()
> > interface would definitely be better, but that does not necessarily
> require
> > a "heavyweight" frameworks, which usually includes more than just the
> > high-level functional programming model. I would argue that an
> alternative
> > solution would better be provided for users who want some high-level
> > programming interface but not a heavyweight stream processing framework
> > that include the processor library plus another DSL layer on top of it.
> >
> >
> >
> >> * I think the feature comparison of plain producer/consumer, stream
> >> processing frameworks, and this new library is a good start, but we
> might
> >> want something more thorough and structured, like a feature matrix.
> Right
> >> now it's hard to figure out exactly how they relate to each other.
> >>
> >
> > Cool, I can do that.
> >
> >
> >> * I'd personally push the library vs. framework story very strongly --
> the
> >> total buy-in and weak integration story of stream processing frameworks
> is
> >> a big downside and makes a library a really compelling (and currently
> >> unavailable, as far as I am aware) alternative.
> >>
> >
> > Are you suggesting there are still some content missing about the
> > motivations of adding the proposed library in the wiki page?
> >
> >
> >> * Comment about in-memory storage of other frameworks is interesting --
> it
> >> is specific to the framework, but is supposed to also give performance
> >> benefits. The high-level functional processing interface would allow for
> >> combining multiple operations when there's no shuffle, but when there
> is a
> >> shuffle, we'll always be writing to Kafka, right? Spark (and presumably
> >> spark streaming) is supposed to get a big win by handling shuffles such
> >> that the data just stays in cache and never actually hits disk, or at
> least
> >> hits disk in the background. Will we take a hit because we always write
> to
> >> Kafka?
> >>
> >
> > I agree with Neha's comments here. One more point I want to make is
> > materializing to Kafka is not necessarily much worse than keeping data in
> > memory if the downstream consumption is caught up such that most of the
> > reads will be hitting file cache. I remember Samza has illustrated that
> > under such scenarios its throughput is actually quite comparable to Spark

Number of kafka topics/partitions supported per cluster of n nodes

2015-07-27 Thread Prabhjot Bharaj
Hi,

I'm looking forward to a benchmark which can explain how many total number
of topics and partitions can be created in a cluster of n nodes, given the
message size varies between x and y bytes and how does it vary with varying
heap sizes and how it affects the system performance.

e.g. the result should look like: t topics with p partitions each can be
supported in a cluster of n nodes with a heap size of h MB, before the
cluster sees things like JVM crashes or high mem usage or system slowdown
etc.

I think such benchmarks must exist so that we can make better decisions on
ops side
If these details dont exist, I'll be doing this test myself on varying the
values of parameters described above. I would be happy to share the numbers
with the community

Thanks,
prabcs


Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Neha Narkhede
Gwen,

We have a compilation of notes from comparison with other systems. They
might be missing details that folks who worked on that system might be able
to point out. We can share that and discuss further on the KIP call.

We do hope to include a DSL since that is the most natural way of
expressing stream processing operations on top of the processor client. The
DSL layer should be equivalent to that provided by Spark streaming or Flink
in terms of expressiveness though there will be differences in
implementation. Our client is intended to be simpler, with minimum external
dependencies since it integrates closely with Kafka. This is really what
most application development is hoping to get - a lightweight library on
top of Kafka that allows them to process streams of data.

Thanks
Neha

On Mon, Jul 27, 2015 at 8:20 PM, Gwen Shapira  wrote:

> Hi,
>
> Since we will be discussing KIP-28 in the call tomorrow, can you
> update the KIP with the feature-comparison with  existing solutions?
> I admit that I do not see a need for single-event-producer-consumer
> pair (AKA Flume Interceptor). I've seen tons of people implement such
> apps in the last year, and it seemed easy. Now, perhaps we were doing
> it all wrong... but I'd like to know how :)
>
> If we are talking about a bigger story (i.e. DSL, real
> stream-processing, etc), thats a different discussion. I've seen a
> bunch of misconceptions about SparkStreaming in this discussion, and I
> have some thoughts in that regard, but I'd rather not go into that if
> thats outside the scope of this KIP.
>
> Gwen
>
>
> On Fri, Jul 24, 2015 at 9:48 AM, Guozhang Wang  wrote:
> > Hi Ewen,
> >
> > Replies inlined.
> >
> > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <
> e...@confluent.io>
> > wrote:
> >
> >> Just some notes on the KIP doc itself:
> >>
> >> * It'd be useful to clarify at what point the plain consumer + custom
> code
> >> + producer breaks down. I think trivial filtering and aggregation on a
> >> single stream usually work fine with this model. Anything where you need
> >> more complex joins, windowing, etc. are where it breaks down. I think
> most
> >> interesting applications require that functionality, but it's helpful to
> >> make this really clear in the motivation -- right now, Kafka only
> provides
> >> the lowest level plumbing for stream processing applications, so most
> >> interesting apps require very heavyweight frameworks.
> >>
> >
> > I think for users to efficiently express complex logic like joins
> > windowing, etc, a higher-level programming interface beyond the process()
> > interface would definitely be better, but that does not necessarily
> require
> > a "heavyweight" frameworks, which usually includes more than just the
> > high-level functional programming model. I would argue that an
> alternative
> > solution would better be provided for users who want some high-level
> > programming interface but not a heavyweight stream processing framework
> > that include the processor library plus another DSL layer on top of it.
> >
> >
> >
> >> * I think the feature comparison of plain producer/consumer, stream
> >> processing frameworks, and this new library is a good start, but we
> might
> >> want something more thorough and structured, like a feature matrix.
> Right
> >> now it's hard to figure out exactly how they relate to each other.
> >>
> >
> > Cool, I can do that.
> >
> >
> >> * I'd personally push the library vs. framework story very strongly --
> the
> >> total buy-in and weak integration story of stream processing frameworks
> is
> >> a big downside and makes a library a really compelling (and currently
> >> unavailable, as far as I am aware) alternative.
> >>
> >
> > Are you suggesting there are still some content missing about the
> > motivations of adding the proposed library in the wiki page?
> >
> >
> >> * Comment about in-memory storage of other frameworks is interesting --
> it
> >> is specific to the framework, but is supposed to also give performance
> >> benefits. The high-level functional processing interface would allow for
> >> combining multiple operations when there's no shuffle, but when there
> is a
> >> shuffle, we'll always be writing to Kafka, right? Spark (and presumably
> >> spark streaming) is supposed to get a big win by handling shuffles such
> >> that the data just stays in cache and never actually hits disk, or at
> least
> >> hits disk in the background. Will we take a hit because we always write
> to
> >> Kafka?
> >>
> >
> > I agree with Neha's comments here. One more point I want to make is
> > materializing to Kafka is not necessarily much worse than keeping data in
> > memory if the downstream consumption is caught up such that most of the
> > reads will be hitting file cache. I remember Samza has illustrated that
> > under such scenarios its throughput is actually quite comparable to Spark
> > Streaming / Storm.
> >
> >
> >> * I really struggled with the structure of the KIP templa

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Yi Pan
Hi, Aditya,

{quote}
- The KIP states that cmd line tools will be provided to deploy as a
separate service. Is the proposed scope limited to providing a library with
which makes it possible build stream-processing-as- a-service or provide
such a service within Kafka itself?
{quote}

There has already been a long discussion happened in Samza mailing list
which partly resulted in this KIP proposal. The basic conclusion was that
this KIP is to build stream processor library that could be used as library
or standalone process. The standalone process may be used as a deployment
method of stream process in a cluster environment, but that would be
outside the scope of this KIP.

-Yi

On Mon, Jul 27, 2015 at 10:46 PM, Aditya Auradkar <
aaurad...@linkedin.com.invalid> wrote:

> +1 on comparison with existing solutions. On a high level, it seems nice to
> have a transform library inside Kafka.. a lot of the building blocks are
> already there to build a stream processing framework. However the details
> are tricky to get right I think this discussion will get a lot more
> interesting when we have something concrete to look at. I'm +1 for the
> general idea.
> How far away are we from having something a prototype patch to play with?
>
> Couple of observations:
> - Since the input source for each processor is always Kafka, you get basic
> client side partition management out of the box it use the high level
> consumer.
> - The KIP states that cmd line tools will be provided to deploy as a
> separate service. Is the proposed scope limited to providing a library with
> which makes it possible build stream-processing-as- a-service or provide
> such a service within Kafka itself?
>
> Aditya
>
> On Mon, Jul 27, 2015 at 8:20 PM, Gwen Shapira 
> wrote:
>
> > Hi,
> >
> > Since we will be discussing KIP-28 in the call tomorrow, can you
> > update the KIP with the feature-comparison with  existing solutions?
> > I admit that I do not see a need for single-event-producer-consumer
> > pair (AKA Flume Interceptor). I've seen tons of people implement such
> > apps in the last year, and it seemed easy. Now, perhaps we were doing
> > it all wrong... but I'd like to know how :)
> >
> > If we are talking about a bigger story (i.e. DSL, real
> > stream-processing, etc), thats a different discussion. I've seen a
> > bunch of misconceptions about SparkStreaming in this discussion, and I
> > have some thoughts in that regard, but I'd rather not go into that if
> > thats outside the scope of this KIP.
> >
> > Gwen
> >
> >
> > On Fri, Jul 24, 2015 at 9:48 AM, Guozhang Wang 
> wrote:
> > > Hi Ewen,
> > >
> > > Replies inlined.
> > >
> > > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <
> > e...@confluent.io>
> > > wrote:
> > >
> > >> Just some notes on the KIP doc itself:
> > >>
> > >> * It'd be useful to clarify at what point the plain consumer + custom
> > code
> > >> + producer breaks down. I think trivial filtering and aggregation on a
> > >> single stream usually work fine with this model. Anything where you
> need
> > >> more complex joins, windowing, etc. are where it breaks down. I think
> > most
> > >> interesting applications require that functionality, but it's helpful
> to
> > >> make this really clear in the motivation -- right now, Kafka only
> > provides
> > >> the lowest level plumbing for stream processing applications, so most
> > >> interesting apps require very heavyweight frameworks.
> > >>
> > >
> > > I think for users to efficiently express complex logic like joins
> > > windowing, etc, a higher-level programming interface beyond the
> process()
> > > interface would definitely be better, but that does not necessarily
> > require
> > > a "heavyweight" frameworks, which usually includes more than just the
> > > high-level functional programming model. I would argue that an
> > alternative
> > > solution would better be provided for users who want some high-level
> > > programming interface but not a heavyweight stream processing framework
> > > that include the processor library plus another DSL layer on top of it.
> > >
> > >
> > >
> > >> * I think the feature comparison of plain producer/consumer, stream
> > >> processing frameworks, and this new library is a good start, but we
> > might
> > >> want something more thorough and structured, like a feature matrix.
> > Right
> > >> now it's hard to figure out exactly how they relate to each other.
> > >>
> > >
> > > Cool, I can do that.
> > >
> > >
> > >> * I'd personally push the library vs. framework story very strongly --
> > the
> > >> total buy-in and weak integration story of stream processing
> frameworks
> > is
> > >> a big downside and makes a library a really compelling (and currently
> > >> unavailable, as far as I am aware) alternative.
> > >>
> > >
> > > Are you suggesting there are still some content missing about the
> > > motivations of adding the proposed library in the wiki page?
> > >
> > >
> > >> * Comment about in-memory storage of other fram

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Neha Narkhede
Adi,

How far away are we from having something a prototype patch to play with?
>

We are working to share a prototype next week. Though the code will evolve
to match the APIs and design as it shapes up, but it will be great if
people can take a look and provide feedback.

Couple of observations:
> - Since the input source for each processor is always Kafka, you get basic
> client side partition management out of the box it use the high level
> consumer.
>

That's right. The plan is to propose moving the partition assignment in the
consumer to the client-side (proposal coming up soon) and then use that in
Kafka streams and copycat.


> - The KIP states that cmd line tools will be provided to deploy as a
> separate service. Is the proposed scope limited to providing a library with
> which makes it possible build stream-processing-as- a-service or provide
> such a service within Kafka itself?


I think the KIP might've been a little misleading on this point. The scope
is to provide a library and have any integrations with other resource
management frameworks (Slider for YARN and Marathon for Mesos) live outside
Kafka. Having said that, in order to just get started with a simple stream
processing example, you still need basic scripts to get going. Those are
not anywhere similar in scope to what you'd expect in order to run this as
a service.

Thanks,
Neha

On Mon, Jul 27, 2015 at 10:57 PM, Neha Narkhede  wrote:

> Gwen,
>
> We have a compilation of notes from comparison with other systems. They
> might be missing details that folks who worked on that system might be able
> to point out. We can share that and discuss further on the KIP call.
>
> We do hope to include a DSL since that is the most natural way of
> expressing stream processing operations on top of the processor client. The
> DSL layer should be equivalent to that provided by Spark streaming or Flink
> in terms of expressiveness though there will be differences in
> implementation. Our client is intended to be simpler, with minimum external
> dependencies since it integrates closely with Kafka. This is really what
> most application development is hoping to get - a lightweight library on
> top of Kafka that allows them to process streams of data.
>
> Thanks
> Neha
>
> On Mon, Jul 27, 2015 at 8:20 PM, Gwen Shapira 
> wrote:
>
>> Hi,
>>
>> Since we will be discussing KIP-28 in the call tomorrow, can you
>> update the KIP with the feature-comparison with  existing solutions?
>> I admit that I do not see a need for single-event-producer-consumer
>> pair (AKA Flume Interceptor). I've seen tons of people implement such
>> apps in the last year, and it seemed easy. Now, perhaps we were doing
>> it all wrong... but I'd like to know how :)
>>
>> If we are talking about a bigger story (i.e. DSL, real
>> stream-processing, etc), thats a different discussion. I've seen a
>> bunch of misconceptions about SparkStreaming in this discussion, and I
>> have some thoughts in that regard, but I'd rather not go into that if
>> thats outside the scope of this KIP.
>>
>> Gwen
>>
>>
>> On Fri, Jul 24, 2015 at 9:48 AM, Guozhang Wang 
>> wrote:
>> > Hi Ewen,
>> >
>> > Replies inlined.
>> >
>> > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <
>> e...@confluent.io>
>>
>> > wrote:
>> >
>> >> Just some notes on the KIP doc itself:
>> >>
>> >> * It'd be useful to clarify at what point the plain consumer + custom
>> code
>> >> + producer breaks down. I think trivial filtering and aggregation on a
>> >> single stream usually work fine with this model. Anything where you
>> need
>> >> more complex joins, windowing, etc. are where it breaks down. I think
>> most
>> >> interesting applications require that functionality, but it's helpful
>> to
>> >> make this really clear in the motivation -- right now, Kafka only
>> provides
>> >> the lowest level plumbing for stream processing applications, so most
>> >> interesting apps require very heavyweight frameworks.
>> >>
>> >
>> > I think for users to efficiently express complex logic like joins
>> > windowing, etc, a higher-level programming interface beyond the
>> process()
>> > interface would definitely be better, but that does not necessarily
>> require
>> > a "heavyweight" frameworks, which usually includes more than just the
>> > high-level functional programming model. I would argue that an
>> alternative
>> > solution would better be provided for users who want some high-level
>> > programming interface but not a heavyweight stream processing framework
>> > that include the processor library plus another DSL layer on top of it.
>> >
>> >
>> >
>> >> * I think the feature comparison of plain producer/consumer, stream
>> >> processing frameworks, and this new library is a good start, but we
>> might
>> >> want something more thorough and structured, like a feature matrix.
>> Right
>> >> now it's hard to figure out exactly how they relate to each other.
>> >>
>> >
>> > Cool, I can do that.
>> >
>> >
>> >> * I'd personally p

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Yi Pan
Hi, Neha,

{quote}
We do hope to include a DSL since that is the most natural way of
expressing stream processing operations on top of the processor client. The
DSL layer should be equivalent to that provided by Spark streaming or Flink
in terms of expressiveness though there will be differences in
implementation. Our client is intended to be simpler, with minimum external
dependencies since it integrates closely with Kafka. This is really what
most application development is hoping to get - a lightweight library on
top of Kafka that allows them to process streams of data.
{quote}

I believe that the above itself is worth another KIP. I felt that there
should be already a lot of system level APIs (i.e. process callbacks,
KV-stores, producer/consumer integration, partition manager, multi-clusters
use case, etc.) that needs to be handled in this KIP. Adding DSL/SQL
library here would bring in a whole set of problems/issues in very
different aspects and de-focus the scope of this KIP.

Just my one quick point.

On Mon, Jul 27, 2015 at 10:57 PM, Neha Narkhede  wrote:

> Gwen,
>
> We have a compilation of notes from comparison with other systems. They
> might be missing details that folks who worked on that system might be able
> to point out. We can share that and discuss further on the KIP call.
>
> We do hope to include a DSL since that is the most natural way of
> expressing stream processing operations on top of the processor client. The
> DSL layer should be equivalent to that provided by Spark streaming or Flink
> in terms of expressiveness though there will be differences in
> implementation. Our client is intended to be simpler, with minimum external
> dependencies since it integrates closely with Kafka. This is really what
> most application development is hoping to get - a lightweight library on
> top of Kafka that allows them to process streams of data.
>
> Thanks
> Neha
>
> On Mon, Jul 27, 2015 at 8:20 PM, Gwen Shapira 
> wrote:
>
> > Hi,
> >
> > Since we will be discussing KIP-28 in the call tomorrow, can you
> > update the KIP with the feature-comparison with  existing solutions?
> > I admit that I do not see a need for single-event-producer-consumer
> > pair (AKA Flume Interceptor). I've seen tons of people implement such
> > apps in the last year, and it seemed easy. Now, perhaps we were doing
> > it all wrong... but I'd like to know how :)
> >
> > If we are talking about a bigger story (i.e. DSL, real
> > stream-processing, etc), thats a different discussion. I've seen a
> > bunch of misconceptions about SparkStreaming in this discussion, and I
> > have some thoughts in that regard, but I'd rather not go into that if
> > thats outside the scope of this KIP.
> >
> > Gwen
> >
> >
> > On Fri, Jul 24, 2015 at 9:48 AM, Guozhang Wang 
> wrote:
> > > Hi Ewen,
> > >
> > > Replies inlined.
> > >
> > > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <
> > e...@confluent.io>
> > > wrote:
> > >
> > >> Just some notes on the KIP doc itself:
> > >>
> > >> * It'd be useful to clarify at what point the plain consumer + custom
> > code
> > >> + producer breaks down. I think trivial filtering and aggregation on a
> > >> single stream usually work fine with this model. Anything where you
> need
> > >> more complex joins, windowing, etc. are where it breaks down. I think
> > most
> > >> interesting applications require that functionality, but it's helpful
> to
> > >> make this really clear in the motivation -- right now, Kafka only
> > provides
> > >> the lowest level plumbing for stream processing applications, so most
> > >> interesting apps require very heavyweight frameworks.
> > >>
> > >
> > > I think for users to efficiently express complex logic like joins
> > > windowing, etc, a higher-level programming interface beyond the
> process()
> > > interface would definitely be better, but that does not necessarily
> > require
> > > a "heavyweight" frameworks, which usually includes more than just the
> > > high-level functional programming model. I would argue that an
> > alternative
> > > solution would better be provided for users who want some high-level
> > > programming interface but not a heavyweight stream processing framework
> > > that include the processor library plus another DSL layer on top of it.
> > >
> > >
> > >
> > >> * I think the feature comparison of plain producer/consumer, stream
> > >> processing frameworks, and this new library is a good start, but we
> > might
> > >> want something more thorough and structured, like a feature matrix.
> > Right
> > >> now it's hard to figure out exactly how they relate to each other.
> > >>
> > >
> > > Cool, I can do that.
> > >
> > >
> > >> * I'd personally push the library vs. framework story very strongly --
> > the
> > >> total buy-in and weak integration story of stream processing
> frameworks
> > is
> > >> a big downside and makes a library a really compelling (and currently
> > >> unavailable, as far as I am aware) alternative.
> > >>
> > 

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Guozhang Wang
Hi Adi,

Just to clarify, the cmdline tool would be used, as stated in the wiki
page, to run the client library "as a process", which is still far away
from a "service". It is just like what we have for kafka-console-producer,
kafka-console-consumer, kafka-mirror-maker, etc today.

Guozhang

On Mon, Jul 27, 2015 at 10:46 PM, Aditya Auradkar <
aaurad...@linkedin.com.invalid> wrote:

> +1 on comparison with existing solutions. On a high level, it seems nice to
> have a transform library inside Kafka.. a lot of the building blocks are
> already there to build a stream processing framework. However the details
> are tricky to get right I think this discussion will get a lot more
> interesting when we have something concrete to look at. I'm +1 for the
> general idea.
> How far away are we from having something a prototype patch to play with?
>
> Couple of observations:
> - Since the input source for each processor is always Kafka, you get basic
> client side partition management out of the box it use the high level
> consumer.
> - The KIP states that cmd line tools will be provided to deploy as a
> separate service. Is the proposed scope limited to providing a library with
> which makes it possible build stream-processing-as- a-service or provide
> such a service within Kafka itself?
>
> Aditya
>
> On Mon, Jul 27, 2015 at 8:20 PM, Gwen Shapira 
> wrote:
>
> > Hi,
> >
> > Since we will be discussing KIP-28 in the call tomorrow, can you
> > update the KIP with the feature-comparison with  existing solutions?
> > I admit that I do not see a need for single-event-producer-consumer
> > pair (AKA Flume Interceptor). I've seen tons of people implement such
> > apps in the last year, and it seemed easy. Now, perhaps we were doing
> > it all wrong... but I'd like to know how :)
> >
> > If we are talking about a bigger story (i.e. DSL, real
> > stream-processing, etc), thats a different discussion. I've seen a
> > bunch of misconceptions about SparkStreaming in this discussion, and I
> > have some thoughts in that regard, but I'd rather not go into that if
> > thats outside the scope of this KIP.
> >
> > Gwen
> >
> >
> > On Fri, Jul 24, 2015 at 9:48 AM, Guozhang Wang 
> wrote:
> > > Hi Ewen,
> > >
> > > Replies inlined.
> > >
> > > On Thu, Jul 23, 2015 at 10:25 PM, Ewen Cheslack-Postava <
> > e...@confluent.io>
> > > wrote:
> > >
> > >> Just some notes on the KIP doc itself:
> > >>
> > >> * It'd be useful to clarify at what point the plain consumer + custom
> > code
> > >> + producer breaks down. I think trivial filtering and aggregation on a
> > >> single stream usually work fine with this model. Anything where you
> need
> > >> more complex joins, windowing, etc. are where it breaks down. I think
> > most
> > >> interesting applications require that functionality, but it's helpful
> to
> > >> make this really clear in the motivation -- right now, Kafka only
> > provides
> > >> the lowest level plumbing for stream processing applications, so most
> > >> interesting apps require very heavyweight frameworks.
> > >>
> > >
> > > I think for users to efficiently express complex logic like joins
> > > windowing, etc, a higher-level programming interface beyond the
> process()
> > > interface would definitely be better, but that does not necessarily
> > require
> > > a "heavyweight" frameworks, which usually includes more than just the
> > > high-level functional programming model. I would argue that an
> > alternative
> > > solution would better be provided for users who want some high-level
> > > programming interface but not a heavyweight stream processing framework
> > > that include the processor library plus another DSL layer on top of it.
> > >
> > >
> > >
> > >> * I think the feature comparison of plain producer/consumer, stream
> > >> processing frameworks, and this new library is a good start, but we
> > might
> > >> want something more thorough and structured, like a feature matrix.
> > Right
> > >> now it's hard to figure out exactly how they relate to each other.
> > >>
> > >
> > > Cool, I can do that.
> > >
> > >
> > >> * I'd personally push the library vs. framework story very strongly --
> > the
> > >> total buy-in and weak integration story of stream processing
> frameworks
> > is
> > >> a big downside and makes a library a really compelling (and currently
> > >> unavailable, as far as I am aware) alternative.
> > >>
> > >
> > > Are you suggesting there are still some content missing about the
> > > motivations of adding the proposed library in the wiki page?
> > >
> > >
> > >> * Comment about in-memory storage of other frameworks is interesting
> --
> > it
> > >> is specific to the framework, but is supposed to also give performance
> > >> benefits. The high-level functional processing interface would allow
> for
> > >> combining multiple operations when there's no shuffle, but when there
> > is a
> > >> shuffle, we'll always be writing to Kafka, right? Spark (and
> presumably
> > >> spark streaming) is suppo

Re: [DISCUSS] KIP-28 - Add a transform client for data processing

2015-07-27 Thread Yi Pan
Hi, Jay,

{quote}
1. Yeah we are going to try to generalize the partition management stuff.
We'll get a wiki/JIRA up for that. I think that gives what you want in
terms of moving partitioning to the client side.
{quote}
Great! I am looking forward to that.

{quote}
I think the key observation is that the whole reason
LinkedIn split data over clusters to begin with was because of the lack of
quotas, which are in any case getting implemented.
{quote}
I am not sure that I followed this point. Is your point that with quota, it
is possible to host all data in a single cluster?

-Yi

On Mon, Jul 27, 2015 at 8:53 AM, Jay Kreps  wrote:

> Hey Yi,
>
> Great points. I think for some of this the most useful thing would be to
> get a wip prototype out that we could discuss concretely. I think Yasuhiro
> and Guozhang took that prototype I had done, and had some improvements.
> Give us a bit to get that into understandable shape so we can discuss.
>
> To address a few of your other points:
> 1. Yeah we are going to try to generalize the partition management stuff.
> We'll get a wiki/JIRA up for that. I think that gives what you want in
> terms of moving partitioning to the client side.
> 2. I think consuming from a different cluster you produce to will be easy.
> More than that is more complex, though I agree the pluggable partitioning
> makes it theoretically possible. Let's try to get something that works for
> the first case, it sounds like that solves the use case you describe of
> wanting to directly transform from a given cluster but produce back to a
> different cluster. I think the key observation is that the whole reason
> LinkedIn split data over clusters to begin with was because of the lack of
> quotas, which are in any case getting implemented.
>
> -Jay
>
> On Sun, Jul 26, 2015 at 11:31 PM, Yi Pan  wrote:
>
> > Hi, Jay and all,
> >
> > Thanks for all your quick responses. I tried to summarize my thoughts
> here:
> >
> > - ConsumerRecord as stream processor API:
> >
> >* This KafkaProcessor API is targeted to receive the message from
> Kafka.
> > So, to Yasuhiro's join/transformation example, any join/transformation
> > results that are materialized in Kafka should have ConsumerRecord format
> > (i.e. w/ topic and offsets). Any non-materialized join/transformation
> > results should not be processed by this KafkaProcessor API. One example
> is
> > the in-memory operators API in Samza, which is designed to handle the
> > non-materialzied join/transformation results. And yes, in this case, a
> more
> > abstract data model is needed.
> >
> >* Just to support Jay's point of a general
> > ConsumerRecord/ProducerRecord, a general stream processing on more than
> one
> > data sources would need at least the following info: data source
> > description (i.e. which topic/table), and actual data (i.e. key-value
> > pairs). It would make sense to have the data source name as part of the
> > general metadata in stream processing (think about it as the table name
> for
> > records in standard SQL).
> >
> > - SQL/DSL
> >
> >* I think that this topic itself is worthy of another KIP discussion.
> I
> > would prefer to leave it out of scope in KIP-28.
> >
> > - Client-side pluggable partition manager
> >
> >* Given the use cases we have seen with large-scale deployment of
> > Samza/Kafka in LinkedIn, I would argue that we should make it as the
> > first-class citizen in this KIP. The use cases include:
> >
> >   * multi-cluster Kafka
> >
> >   * host-affinity (i.e. local-state associated w/ certain partitions
> on
> > client)
> >
> > - Multi-cluster scenario
> >
> >* Although I originally just brought it up as a use case that requires
> > client-side partition manager, reading Jay’s comments, I realized that I
> > have one fundamental issue w/ the current copycat + transformation model.
> > If I interpret Jay’s comment correctly, the proposed
> copycat+transformation
> > plays out in the following way: i) copycat takes all data from sources
> (no
> > matter it is Kafka or non-Kafka) into *one single Kafka cluster*; ii)
> > transformation is only restricted to take data sources in *this single
> > Kafka cluster* to perform aggregate/join etc. This is different from my
> > original understanding of the copycat. The main issue I have with this
> > model is: huge data-copy between Kafka clusters. In LinkedIn, we used to
> > follow this model that uses MirrorMaker to map topics from tracking
> > clusters to Samza-specific Kafka cluster and only do stream processing in
> > the Samza-specific Kafka cluster. We moved away from this model and
> started
> > allowing users to directly consume from tracking Kafka clusters due to
> the
> > overhead of copying huge amount of traffic between Kafka clusters. I
> agree
> > that the initial design of KIP-28 would probably need a smaller scope of
> > problem to solve, hence, limiting to solving partition management in a
> > single cluster. However, I would really hop

Re: Number of kafka topics/partitions supported per cluster of n nodes

2015-07-27 Thread Darion Yaphet
Kafka store it meta data in Zookeeper Cluster so evaluate "how many total
number of topics and partitions can be created in a cluster "  maybe same
as to test Zookeeper's expansibility  and disk IO performance .

2015-07-28 13:51 GMT+08:00 Prabhjot Bharaj :

> Hi,
>
> I'm looking forward to a benchmark which can explain how many total number
> of topics and partitions can be created in a cluster of n nodes, given the
> message size varies between x and y bytes and how does it vary with varying
> heap sizes and how it affects the system performance.
>
> e.g. the result should look like: t topics with p partitions each can be
> supported in a cluster of n nodes with a heap size of h MB, before the
> cluster sees things like JVM crashes or high mem usage or system slowdown
> etc.
>
> I think such benchmarks must exist so that we can make better decisions on
> ops side
> If these details dont exist, I'll be doing this test myself on varying the
> values of parameters described above. I would be happy to share the numbers
> with the community
>
> Thanks,
> prabcs
>



-- 

long is the way and hard  that out of Hell leads up to light


<    1   2