Re: Kafka security

2017-04-11 Thread Christian Csar
Don't hard code it. Martin's suggestion allows it to be read from a
configuration file or injected from another source such as an environment
variable at runtime.

If you neither of these are acceptable for corporate policy I suggest
asking how it has been handled before at your company.

Christian


On Apr 11, 2017 11:10, "IT Consultant" <0binarybudd...@gmail.com> wrote:

Thanks for your response .

We aren't allowed to hard code  password in any of our program

On Apr 11, 2017 23:39, "Mar Ian"  wrote:

> Since is a java property you could set the property (keystore password)
> programmatically,
>
> before you connect to kafka (ie, before creating a consumer or producer)
>
> System.setProperty("zookeeper.ssl.keyStore.password", password);
>
> martin
>
> 
> From: IT Consultant <0binarybudd...@gmail.com>
> Sent: April 11, 2017 2:01 PM
> To: users@kafka.apache.org
> Subject: Kafka security
>
> Hi All
>
> How can I avoid using password for keystore creation ?
>
> Our corporate policies doesn'tallow us to hardcore password. We are
> currently passing keystore password while accessing TLS enabled Kafka
> instance .
>
> I would like to use either passwordless keystore or avoid password for
> cleint accessing Kafka .
>
>
> Please help
>


Re: Encryption at Rest

2016-05-02 Thread Christian Csar
"We need to be capable of changing encryption keys on regular
intervals and in case of expected key compromise." is achievable with
full disk encryption particularly if you are willing to add and remove
Kafka servers so that you replicate the data to new machines/disks
with new keys and take the machines with old keys out of use and wipe
them.

For the second part of it I would suggest reevaluating your threat
model since you are looking at a machine that is compromised but not
compromised enough to be able to read the key from Kafka or to use
Kafka to read the data.

While you could add support to encrypt data on the way in and out of
compression I believe you would need either substantial work in Kafka
to support rewriting/reencrypting the logfiles (with performance
penalties) or rotate machines in and out as with full disk encryption.
Though I'll let someone with more knowledge of the implementation
comment further on what would be required.

Christian

On Mon, May 2, 2016 at 9:41 PM, Bruno Rassaerts
 wrote:
> We did try indeed the last scenario you describe as encrypted disks do not 
> fulfil our requirements.
> We need to be capable of changing encryption keys on regular intervals and in 
> case of expected key compromise.
> Also, when a running machine is hacked, disk based or file system based 
> encryption doesn’t offer any protection.
>
> Our goal is indeed to have the content in the broker files encrypted. The 
> problem is that only way to achieve this is through custom serialisers.
> This works, but the overhead is quite dramatic as the messages are no longer 
> efficiently compressed (in batch).
> Compression in the serialiser, before the encryption, doesn’t really solve 
> the performance problem.
>
> The best thing for us would be able to encrypt after the batch compression 
> offered by kafka.
> The hook to do this is missing in the current implementation.
>
> Bruno
>
>> On 02 May 2016, at 22:46, Tom Brown  wrote:
>>
>> I'm trying to understand your use-case for encrypted data.
>>
>> Does it need to be encrypted only over the wire? This can be accomplished
>> using TLS encryption (v0.9.0.0+). See
>> https://issues.apache.org/jira/browse/KAFKA-1690
>>
>> Does it need to be encrypted only when at rest? This can be accomplished
>> using full disk encryption as others have mentioned.
>>
>> Does it need to be encrypted during both? Use both TLS and full disk
>> encryption.
>>
>> Does it need to be encrypted fully from end-to-end so even Kafka can't read
>> it? Since Kafka shouldn't be able to know the contents, the key should not
>> be known to Kafka. What remains is manually encrypting each message before
>> giving it to the producer (or by implementing an encrypting serializer).
>> Either way, each message is still encrypted individually.
>>
>> Have I left out a scenario?
>>
>> --Tom
>>
>>
>> On Mon, May 2, 2016 at 2:01 PM, Bruno Rassaerts >> wrote:
>>
>>> Hello,
>>>
>>> We tried encrypting the data before sending it to kafka, however this
>>> makes the compression done by kafka almost impossible.
>>> Also the performance overhead of encrypting the individual messages was
>>> quite significant.
>>>
>>> Ideally, a pluggable “compression” algorithm would be best. Where message
>>> can first be compressed, then encrypted in batch.
>>> However, the current kafka implementation does not allow this.
>>>
>>> Bruno
>>>
 On 26 Apr 2016, at 19:02, Jim Hoagland 
>>> wrote:

 Another option is to encrypt the data before you hand it to Kafka and
>>> have
 the downstream decrypt it.  This takes care of on-disk on on-wire
 encryption.  We did a proof of concept of this:


>>> http://www.symantec.com/connect/blogs/end-end-encryption-though-kafka-our-p
 roof-concept

 ( http://symc.ly/1pC2CEG )

 -- Jim

 On 4/25/16, 11:39 AM, "David Buschman"  wrote:

> Kafka handles messages which are compose of an array of bytes. Kafka
>>> does
> not care what is in those byte arrays.
>
> You could use a custom Serializer and Deserializer to encrypt and
>>> decrypt
> the data from with your application(s) easily enough.
>
> This give the benefit of having encryption at rest and over the wire.
>>> Two
> birds, one stone.
>
> DaVe.
>
>
>> On Apr 25, 2016, at 2:14 AM, Jens Rantil  wrote:
>>
>> IMHO, I think that responsibility should lie on the file system, not
>> Kafka.
>> Feels like a waste of time and double work to implement that unless
>> there's
>> a really good reason for it. Let's try to keep Kafka a focused product
>> that
>> does one thing well.
>>
>> Cheers,
>> Jens
>>
>> On Fri, Apr 22, 2016 at 3:31 AM Tauzell, Dave
>> 
>> wrote:
>>
>>> I meant encryption of the data at rest.  We utilize filesytem
>>> encryption
>>> for other products; just wondering if anything was on the Kafka
>>> roadmap.
>>>
>>> Dave
>>>
>>

Re: Encryption at Rest

2016-04-21 Thread Christian Csar
>From what I know of previous discussions encryption at rest can be
handled with transparent disk encryption. When that's sufficient it's
nice and easy.

Christian

On Thu, Apr 21, 2016 at 2:31 PM, Tauzell, Dave
 wrote:
> Has there been any discussion or work on at rest encryption for Kafka?
>
> Thanks,
>   Dave
>
> This e-mail and any files transmitted with it are confidential, may contain 
> sensitive information, and are intended solely for the use of the individual 
> or entity to whom they are addressed. If you have received this e-mail in 
> error, please notify the sender by reply e-mail immediately and destroy all 
> copies of the e-mail and any attachments.


Re: Kafka over Satellite links

2016-03-02 Thread Christian Csar
I would not do that. I admit I may be a bit biased due to working for
Buddy Platform (IoT backend stuff including telemetry collection), but
you want to send the data via some protocol (HTTP? MQTT? COAP?) to the
central hub and then have those servers put the data into Kafka. Now
if you want to use Kafka there are the various HTTP front ends that
will basically put the data into Kafka for you without the client
needing to deal with the partition management part. But putting data
into Kafka directly really seems like a bad idea even if it's a large
number of messages per second per node, even if the security parts
work out for you.

Christian

On Wed, Mar 2, 2016 at 9:52 PM, Jan  wrote:
> Hi folks;
> does anyone know of Kafka's ability to work over Satellite links. We have a 
> IoT Telemetry application that uses Satellite communication to send data from 
> remote sites to a Central hub.
> Any help/ input/ links/ gotchas would be much appreciated.
> Regards,Jan


Re: Kafka behind AWS ELB

2015-05-04 Thread Christian Csar
Dillian,

On Mon, May 4, 2015 at 1:52 PM, Dillian Murphey 
wrote:
>
> I'm interested in this topic as well.  If you put kafka brokers inside an
> autoscaling group, then AWS will automatically add brokers if demand
> increases, and the ELB will automatically round-robin across all of your
> kafka instances.  So in your config files and code, you only need to
> provide a single DNS name (the load balancer). You don't need to specify
> all your kafka brokers inside your config file.  If a broker dies, the ELB
> will only route to healthy nodes.
>
> So you get a lot of robustness, scalability, and fault-tolerance by using
> the AWS services. Kafka Brokers will automatically load balance, but the
> question is whether it is ok to put all your brokers behind an ELB and
> expect the system to work properly.
>
You should not expect it to work properly. Broker nodes are data bearing
which means that any operation to scale down would need to be aware of the
data distribution. The client connects to specific nodes to send them data
so even the Level 4 load balancing wouldn't work.

> What alternatives are there to dynamic/scalable broker clusters?  I don't
> want to have to modify my config files or code if I add more brokers, and
I
> want to be able to handle a broker going down. So these are the reasons
AWS
> questions like this come up.
>

The clients already give you options for specifying only a subset of
brokers
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example
one of which must be alive to discover the rest of the cluster. The main
clients handle node failures (you'll still have some operational work).
Kafka and other data storage systems do not work the same as an HTTP driven
web application. While it can certainly be scaled, and automation could be
done to do so in response to load it's going to be more complicated. AWS's
off the shelf solution/low operations offering for some (definitely not
all) of Kafka's use cases is Kinesis, Azure's is EventHubs. Before using
Kakfa or any system in production you'll want to be sure you understand the
operational aspects of it.

Christian

>
> Thanks for any comments too. :)
>
>
>
>
> On Mon, May 4, 2015 at 9:03 AM, Mayuresh Gharat <
gharatmayures...@gmail.com>
> wrote:
>
> > Ok. You can deploy kafka in AWS. You can have brokers on AWS servers.
> > Kafka is not a push system. So you will need someone writing to kafka
and
> > consuming from kafka. It will work. My suggestion will be to try it out
on
> > a smaller instance in AWS and see the effects.
> >
> > As I do not know the actual use case about why you want to use kafka
for, I
> > cannot comment on whether it will work for you personalized use case.
> >
> > Thanks,
> >
> > Mayuresh
> >
> > On Mon, May 4, 2015 at 8:55 AM, Chandrashekhar Kotekar <
> > shekhar.kote...@gmail.com> wrote:
> >
> > > I am sorry but I cannot reveal those details due to confidentiality
> > issues.
> > > I hope you understand.
> > >
> > >
> > > Regards,
> > > Chandrash3khar Kotekar
> > > Mobile - +91 8600011455
> > >
> > > On Mon, May 4, 2015 at 9:18 PM, Mayuresh Gharat <
> > > gharatmayures...@gmail.com>
> > > wrote:
> > >
> > > > Hi Chandrashekar,
> > > >
> > > > Can you please elaborate the use case for Kafka here, like how you
are
> > > > planning to use it.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Mayuresh
> > > >
> > > > On Sat, May 2, 2015 at 9:08 PM, Chandrashekhar Kotekar <
> > > > shekhar.kote...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am new to Apache Kafka. I have played with it on my laptop.
> > > > >
> > > > > I want to use Kafka in AWS. Currently we have tomcat web servers
> > based
> > > > REST
> > > > > API. We want to replace REST API with Apache Kafka, web servers
are
> > > > behind
> > > > > ELB.
> > > > >
> > > > > I would like to know if we can keep Kafka brokers behind ELB?
Will it
> > > > work?
> > > > >
> > > > > Regards,
> > > > > Chandrash3khar Kotekar
> > > > > Mobile - +91 8600011455
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -Regards,
> > > > Mayuresh R. Gharat
> > > > (862) 250-7125
> > > >
> > >
> >
> >
> >
> > --
> > -Regards,
> > Mayuresh R. Gharat
> > (862) 250-7125
> >


Re: Kafka Poll: Version You Use?

2015-03-04 Thread Christian Csar
Do you have a anything on the number of voters, or audience breakdown?

Christian

On Wed, Mar 4, 2015 at 8:08 PM, Otis Gospodnetic  wrote:

> Hello hello,
>
> Results of the poll are here!
> Any guesses before looking?
> What % of Kafka users are on 0.8.2.x already?
> What % of people are still on 0.7.x?
>
>
> http://blog.sematext.com/2015/03/04/poll-results-kafka-version-distribution/
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Thu, Feb 26, 2015 at 3:32 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com> wrote:
>
> > Hi,
> >
> > With 0.8.2 out I thought it might be useful for everyone to see which
> > version(s) of Kafka people are using.
> >
> > Here's a quick poll:
> > http://blog.sematext.com/2015/02/23/kafka-poll-version-you-use/
> >
> > We'll publish the results next week.
> >
> > Thanks,
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
>


Re: kafka producer does not distribute messages to partitions evenly?

2015-03-02 Thread Christian Csar
I believe you are seeing the behavior where the random partitioner is
sticky.
http://mail-archives.apache.org/mod_mbox/kafka-users/201309.mbox/%3ccahwhrrxax5ynimqnacsk7jcggnhjc340y4qbqoqcismm43u...@mail.gmail.com%3E
has details. So with the default 10 minute refresh if your test is only an
hour or two with a single producer you would not expect to see all
partitions be hit.

Christian

On Mon, Mar 2, 2015 at 4:23 PM, Yang  wrote:

> thanks. just checked code below. in the code below, the line that calls
> Random.nextInt() seems to be called only *a few times* , and all the rest
> of the cases getPartition() is called, the
> cached sendPartitionPerTopicCache.get(topic) seems to be called, so
> apparently you won't get an even partition distribution ?
>
> the code I got is from commit 7847e9c703f3a0b70519666cdb8a6e4c8e37c3a7
>
>
> "./core/src/main/scala/kafka/producer/async/DefaultEventHandler.scala" 336
> lines --66%--
> 222,4673%
>
>
>   private def getPartition(topic: String, key: Any, topicPartitionList:
> Seq[PartitionAndLeader]): Int = {
> val numPartitions = topicPartitionList.size
> if(numPartitions <= 0)
>   throw new UnknownTopicOrPartitionException("Topic " + topic + "
> doesn't exist")
> val partition =
>   if(key == null) {
> // If the key is null, we don't really need a partitioner
> // So we look up in the send partition cache for the topic to
> decide the target partition
> val id = sendPartitionPerTopicCache.get(topic)
> id match {
>   case Some(partitionId) =>
> // directly return the partitionId without checking
> availability of the leader,
> // since we want to postpone the failure until the send
> operation anyways
> partitionId
>   case None =>
> val availablePartitions =
> topicPartitionList.filter(_.leaderBrokerIdOpt.isDefined)
> if (availablePartitions.isEmpty)
>   throw new LeaderNotAvailableException("No leader for any
> partition in topic " + topic)
> val index = Utils.abs(Random.nextInt) %
> availablePartitions.size
> val partitionId = availablePartitions(index).partitionId
> sendPartitionPerTopicCache.put(topic, partitionId)
> partitionId
> }
>   } else
> partitioner.partition(key, numPartitions)
> if(partition < 0 || partition >= numPartitions)
>   throw new UnknownTopicOrPartitionException("Invalid partition id: " +
> partition + " for topic " + topic +
> "; Valid values are in the inclusive range of [0, " +
> (numPartitions-1) + "]")
> trace("Assigning message of topic %s and key %s to a selected partition
> %d".format(topic, if (key == null) "[none]" else key.toString, partition))
> partition
>   }
>
>
> On Mon, Mar 2, 2015 at 3:58 PM, Mayuresh Gharat <
> gharatmayures...@gmail.com>
> wrote:
>
> > Probably your keys are getting hashed to only those partitions. I don't
> > think anything is wrong here.
> > You can check how the default hashPartitioner is used in the code and try
> > to do the same for your keys before you send them and check which
> > partitions are those going to.
> >
> > The default hashpartitioner does something like this :
> >
> > hash(key) % numPartitions.
> >
> > Thanks,
> >
> > Mayuresh
> >
> > On Mon, Mar 2, 2015 at 3:52 PM, Yang  wrote:
> >
> > > we have 10 partitions for a topic, and omit the explicit partition
> param
> > in
> > > the message creation:
> > >
> > > KeyedMessage data = new KeyedMessage
> > > (mytopic,   myMessageContent);   // partition key need to be polished
> > > producer.send(data);
> > >
> > >
> > >
> > > but on average 3--5 of the partitions are empty.
> > >
> > >
> > >
> > > what went wrong?
> > >
> > > thanks
> > > Yang
> > >
> >
> >
> >
> > --
> > -Regards,
> > Mayuresh R. Gharat
> > (862) 250-7125
> >
>


Re: Tips for working with Kafka and data streams

2015-02-25 Thread Christian Csar
Yeah, we do have scenarios where we use customer specific keys so our
envelopes end up containing key identification information for accessing
our key repository. I'll certainly follow any changes you propose in this
area with interest, but I'd expect that sort of centralized key thing to be
fairly separate from Kafka even if there's a handy optional layer that
integrates with it.

Christian

On Wed, Feb 25, 2015 at 5:34 PM, Julio Castillo <
jcasti...@financialengines.com> wrote:

> Although full disk encryption appears to be an easy solution, in our case
> that may not be sufficient. For cases where the actual payload needs to be
> encrypted, the cost of encryption is paid by the consumer and producers.
> Further complicating the matter would be the handling of encryption keys,
> etc. I think this is the area where enhancements to Kafka may facilitate
> that key exchange between consumers and producers, still leaving it up to
> the clients, but facilitating the key handling.
>
> Julio
>
> On 2/25/15, 4:24 PM, "Christian Csar"  wrote:
>
> >The questions we get from customers typically end up being general so we
> >break out our answer into network level and on disk scenarios.
> >
> >On disk/at rest scenario may just be use full disk encryption at the OS
> >level and Kafka doesn't need to worry about it. But documenting any issues
> >around it would be good. For example what sort of Kafka specific
> >performance impacts does it have, ie budgeting for better processors.
> >
> >The security story right now is to run on a private network, but I believe
> >some of our customers like to be told that within datacenter transmissions
> >are encrypted on the wire. Based on
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_conf
> >luence_display_KAFKA_Security&d=AwIBaQ&c=cKbMccWasSe6U4u_qE0M-qEjqwAh3shju
> >L5QPa1B7Yk&r=rJHFl4LhCQ-6kvKROhIocflKqVSHRTvT-PgdZ5MFuS0&m=jhFmJTJBQfbq0sN
> >jxtKA4M1tvSVgBLKOr2ePaK6zqww&s=HqZ4N2gLpCZ796dRG7Fo-KLOBc0tgnnvDnC_8VTUo84
> >&e=  that might mean
> >waiting for TLS support, or using a VPN/ssh tunnel for the network
> >connections.
> >
> >Since we're in hosted stream land we can't do either of the above and
> >encrypt the messages themselves. For those enterprises that are like our
> >customers but would run Kafka or use Confluent, having a story like the
> >above so they don't give up the benefits of your schema management layers
> >would be good.
> >
> >Since I didn't mention it before I did find your blog posts handy (though
> >I'm already moving us towards stream centric land).
> >
> >Christian
> >
> >On Wed, Feb 25, 2015 at 3:57 PM, Jay Kreps  wrote:
> >
> >> Hey Christian,
> >>
> >> That makes sense. I agree that would be a good area to dive into. Are
> >>you
> >> primarily interested in network level security or encryption on disk?
> >>
> >> -Jay
> >>
> >> On Wed, Feb 25, 2015 at 1:38 PM, Christian Csar 
> >>wrote:
> >>
> >> > I wouldn't say no to some discussion of encryption. We're running on
> >> Azure
> >> > EventHubs (with preparations for Kinesis for EC2, and Kafka for
> >> deployments
> >> > in customer datacenters when needed) so can't just use disk level
> >> > encryption (which would have its own overhead). We're putting all of
> >>our
> >> > messages inside of encrypted envelopes before sending them to the
> >>stream
> >> > which limits our opportunities for schema verification of the
> >>underlying
> >> > messages to the declared type of the message.
> >> >
> >> > Encryption at rest mostly works out to a sales point for customers who
> >> want
> >> > assurances, and in a Kafka focused discussion might be dealt with by
> >> > covering disk encryption and how the conversations between Kafka
> >> instances
> >> > are protected.
> >> >
> >> > Christian
> >> >
> >> >
> >> > On Wed, Feb 25, 2015 at 11:51 AM, Jay Kreps  wrote:
> >> >
> >> > > Hey guys,
> >> > >
> >> > > One thing we tried to do along with the product release was start to
> >> put
> >> > > together a practical guide for using Kafka. I wrote this up here:
> >> > >
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.confluent.io_201
> >>5_02_25_stream-2Ddata-2Dplat

Re: Tips for working with Kafka and data streams

2015-02-25 Thread Christian Csar
The questions we get from customers typically end up being general so we
break out our answer into network level and on disk scenarios.

On disk/at rest scenario may just be use full disk encryption at the OS
level and Kafka doesn't need to worry about it. But documenting any issues
around it would be good. For example what sort of Kafka specific
performance impacts does it have, ie budgeting for better processors.

The security story right now is to run on a private network, but I believe
some of our customers like to be told that within datacenter transmissions
are encrypted on the wire. Based on
https://cwiki.apache.org/confluence/display/KAFKA/Security that might mean
waiting for TLS support, or using a VPN/ssh tunnel for the network
connections.

Since we're in hosted stream land we can't do either of the above and
encrypt the messages themselves. For those enterprises that are like our
customers but would run Kafka or use Confluent, having a story like the
above so they don't give up the benefits of your schema management layers
would be good.

Since I didn't mention it before I did find your blog posts handy (though
I'm already moving us towards stream centric land).

Christian

On Wed, Feb 25, 2015 at 3:57 PM, Jay Kreps  wrote:

> Hey Christian,
>
> That makes sense. I agree that would be a good area to dive into. Are you
> primarily interested in network level security or encryption on disk?
>
> -Jay
>
> On Wed, Feb 25, 2015 at 1:38 PM, Christian Csar  wrote:
>
> > I wouldn't say no to some discussion of encryption. We're running on
> Azure
> > EventHubs (with preparations for Kinesis for EC2, and Kafka for
> deployments
> > in customer datacenters when needed) so can't just use disk level
> > encryption (which would have its own overhead). We're putting all of our
> > messages inside of encrypted envelopes before sending them to the stream
> > which limits our opportunities for schema verification of the underlying
> > messages to the declared type of the message.
> >
> > Encryption at rest mostly works out to a sales point for customers who
> want
> > assurances, and in a Kafka focused discussion might be dealt with by
> > covering disk encryption and how the conversations between Kafka
> instances
> > are protected.
> >
> > Christian
> >
> >
> > On Wed, Feb 25, 2015 at 11:51 AM, Jay Kreps  wrote:
> >
> > > Hey guys,
> > >
> > > One thing we tried to do along with the product release was start to
> put
> > > together a practical guide for using Kafka. I wrote this up here:
> > > http://blog.confluent.io/2015/02/25/stream-data-platform-1/
> > >
> > > I'd like to keep expanding on this as good practices emerge and we
> learn
> > > more stuff. So two questions:
> > > 1. Anything you think other people should know about working with data
> > > streams? What did you wish you knew when you got started?
> > > 2. Anything you don't know about but would like to hear more about?
> > >
> > > -Jay
> > >
> >
>


Re: Tips for working with Kafka and data streams

2015-02-25 Thread Christian Csar
I wouldn't say no to some discussion of encryption. We're running on Azure
EventHubs (with preparations for Kinesis for EC2, and Kafka for deployments
in customer datacenters when needed) so can't just use disk level
encryption (which would have its own overhead). We're putting all of our
messages inside of encrypted envelopes before sending them to the stream
which limits our opportunities for schema verification of the underlying
messages to the declared type of the message.

Encryption at rest mostly works out to a sales point for customers who want
assurances, and in a Kafka focused discussion might be dealt with by
covering disk encryption and how the conversations between Kafka instances
are protected.

Christian


On Wed, Feb 25, 2015 at 11:51 AM, Jay Kreps  wrote:

> Hey guys,
>
> One thing we tried to do along with the product release was start to put
> together a practical guide for using Kafka. I wrote this up here:
> http://blog.confluent.io/2015/02/25/stream-data-platform-1/
>
> I'd like to keep expanding on this as good practices emerge and we learn
> more stuff. So two questions:
> 1. Anything you think other people should know about working with data
> streams? What did you wish you knew when you got started?
> 2. Anything you don't know about but would like to hear more about?
>
> -Jay
>


Re: How to mark a message as needing to retry in Kafka?

2015-01-28 Thread Christian Csar
noodles,
   Without an external mechanism you won't be able to mark individual
messages/offsets as needing to be retried at a later time. Guozhang is
describing a way to get the offset of a message that's been received so
that you can find it later. You would need to save that into a 'failed
messages' store somewhere else and have code that looks in there to make
retries happen (assuming you want the failure/retry to persist beyond the
lifetime of the process).

Christian


On Wed, Jan 28, 2015 at 7:00 PM, Guozhang Wang  wrote:

> I see. If you are using the high-level consumer, once the message is
> returned to the application it is considered "consumed", and current it is
> not supported to "re-wind" to a previously consumed message.
>
> With the new consumer coming in 0.8.3 release, we have an api for you to
> get the offset of each message and do the rewinding based on offsets. For
> example, you can do sth. like
>
> 
>
>   message = // get one message from consumer
>
>   try {
> // process message
>   } catch {
> consumer.seek(message.offset)
>   }
>
> 
>
> Guozhang
>
> On Wed, Jan 28, 2015 at 6:26 PM, noodles  wrote:
>
> > I did not describe my problem clearly. In my case, I got the message from
> > Kakfa, but I could not handle this message because of some reason, for
> > example the external server is down. So I want to mark the message as not
> > being consumed directly.
> >
> > 2015-01-28 23:26 GMT+08:00 Guozhang Wang :
> >
> > > Hi,
> > >
> > > Which consumer are you using? If you are using a high level consumer
> then
> > > retry would be automatic upon network exceptions.
> > >
> > > Guozhang
> > >
> > > On Wed, Jan 28, 2015 at 1:32 AM, noodles 
> wrote:
> > >
> > > > Hi group:
> > > >
> > > > I'm working for building a webhook notification service based on
> > Kafka. I
> > > > produce all of the payloads into Kafka, and consumers consume these
> > > > payloads by offset.
> > > >
> > > > Sometimes some payloads cannot be consumed because of network
> exception
> > > or
> > > > http server exception. So I want to mark the failed payloads and
> retry
> > > them
> > > > by timers. But I have no idea if I don't use a storage (like MySQL)
> > > except
> > > > kafka and zookeeper.
> > > >
> > > >
> > > > --
> > > > *noodles!*
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
> >
> >
> > --
> > *Yeah, I'm noodles!*
> >
>
>
>
> --
> -- Guozhang
>


Re: Proper Relationship Between Partition and Threads

2015-01-28 Thread Christian Csar
Ricardo,
   The parallelism of each logical consumer (consumer group) is the number
of partitions. So with four partitions it could make sense to have one
logical consumer (application) have two processes on different machines
each with two threads, or one process with four. While with two logical
consumers (two different applications) you would want each to have 4
threads (4*2 = 8 threads total).

There are also considerations depending on which consumer code you are
using (which I'm decidedly not someone with good information on)

Christian

On Wed, Jan 28, 2015 at 1:28 PM, Ricardo Ferreira <
jricardoferre...@gmail.com> wrote:

> Hi experts,
>
> I'm newbie in the Kafka world, so excuse me for such basic question.
>
> I'm in the process of designing a client for Kafka, and after few hours of
> study, I was told that to achieve a proper level of parallelism, it is a
> best practice having one thread for each partition of an topic.
>
> My question is that this rule-of-thumb also applies for multiple consumer
> applications. For instance:
>
> Considering a topic with 4 partitions, it is OK to have one consumer
> application with 4 threads, just like would be OK to have two consumer
> applications with 2 threads each. But what about having two consumer
> applications with 4 threads each? It would break any load-balancing made by
> Kafka brokers?
>
> Anyway, I'd like to understand if the proper number of threads that should
> match the number of partitions is per application or if there is some other
> best practice.
>
> Thanks in advance,
>
> Ricardo Ferreira
>


Re: Is kafka suitable for our architecture?

2014-10-10 Thread Christian Csar
Kafka might well be equivalent to a simple
>> linear Storm topology.
>>
> Exactly, that's why we are evaluating if only with Kafka is enough.
> Because if Storm gives us the same benefits than Kafka it's better to stick
> with only one technology to keep everything as simple as possible.
> 

I think it is more a question of will using Storm make managing your
various consumers easier. Since I haven't used Storm in a production
environment I can't speak to that. I don't think there is any reason you
*need* to use Storm rather than just Kafka to achieve your needs though.

Christian

> 
>> Christian
>>
> 
> Thanks
> 
> 
> 
>>
>> On Thu, Oct 9, 2014 at 11:57 PM, Albert Vila 
>> wrote:
>>
>>> Hi
>>>
>>> We process data in real time, and we are taking a look at Storm and Spark
>>> streaming too, however our actions are atomic, done at a document level
>> so
>>> I don't know if it fits on something like Storm/Spark.
>>>
>>> Regarding what you Christian said, isn't Kafka used for scenarios like
>> the
>>> one I described? I mean, we do have work queues right now with Gearman,
>> but
>>> with a bunch of workers on each step. I thought we could change that to a
>>> producer and a bunch of consumers (where the message should only reach
>> one
>>> and exact one consumer).
>>>
>>> And what I said about the data locally, it was only an optimization we
>> did
>>> some time ago because we was moving more data back then. Maybe now its
>> not
>>> necessary and we could move messages around the system using Kafka, so it
>>> will allow us to simplify the architecture a little bit. I've seen people
>>> saying they move Tb of data every day using Kafka.
>>>
>>> Just to be clear on the size of each document/message, we are talking
>> about
>>> tweets, blog posts, ... (on 90% of cases the size is less than 50Kb)
>>>
>>> Regards
>>>
>>> On 9 October 2014 20:02, Christian Csar  wrote:
>>>
>>>> Apart from your data locality problem it sounds like what you want is a
>>>> workqueue. Kafka's consumer structure doesn't lend itself too well to
>>>> that use case as a single partition of a topic should only have one
>>>> consumer instance per logical subscriber of the topic, and that
>> consumer
>>>> would not be able to mark jobs as completed except in a strict order
>>>> (while maintaining a processed successfully at least once guarantee).
>>>> This is not to say it cannot be done, but I believe your workqueue
>> would
>>>> end up working a bit strangely if built with Kafka.
>>>>
>>>> Christian
>>>>
>>>> On 10/09/2014 06:13 AM, William Briggs wrote:
>>>>> Manually managing data locality will become difficult to scale. Kafka
>>> is
>>>>> one potential tool you can use to help scale, but by itself, it will
>>> not
>>>>> solve your problem. If you need the data in near-real time, you could
>>>> use a
>>>>> technology like Spark or Storm to stream data from Kafka and perform
>>> your
>>>>> processing. If you can batch the data, you might be better off
>> pulling
>>> it
>>>>> into a distributed filesystem like HDFS, and using MapReduce, Spark
>> or
>>>>> another scalable processing framework to handle your transformations.
>>>> Once
>>>>> you've paid the initial price for moving the document into HDFS, your
>>>>> network traffic should be fairly manageable; most clusters built for
>>> this
>>>>> purpose will schedule work to be run local to the data, and typically
>>>> have
>>>>> separate, high-speed network interfaces and a dedicated switch in
>> order
>>>> to
>>>>> optimize intra-cluster communications when moving data is
>> unavoidable.
>>>>>
>>>>> -Will
>>>>>
>>>>> On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila 
>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I just came across Kafta when I was trying to find solutions to
>> scale
>>>> our
>>>>>> current architecture.
>>>>>>
>>>>>> We are currently downloading and processing 6M documents per day
>> from
>>>>>> online and soc

Re: Is kafka suitable for our architecture?

2014-10-09 Thread Christian Csar
Apart from your data locality problem it sounds like what you want is a
workqueue. Kafka's consumer structure doesn't lend itself too well to
that use case as a single partition of a topic should only have one
consumer instance per logical subscriber of the topic, and that consumer
would not be able to mark jobs as completed except in a strict order
(while maintaining a processed successfully at least once guarantee).
This is not to say it cannot be done, but I believe your workqueue would
end up working a bit strangely if built with Kafka.

Christian

On 10/09/2014 06:13 AM, William Briggs wrote:
> Manually managing data locality will become difficult to scale. Kafka is
> one potential tool you can use to help scale, but by itself, it will not
> solve your problem. If you need the data in near-real time, you could use a
> technology like Spark or Storm to stream data from Kafka and perform your
> processing. If you can batch the data, you might be better off pulling it
> into a distributed filesystem like HDFS, and using MapReduce, Spark or
> another scalable processing framework to handle your transformations. Once
> you've paid the initial price for moving the document into HDFS, your
> network traffic should be fairly manageable; most clusters built for this
> purpose will schedule work to be run local to the data, and typically have
> separate, high-speed network interfaces and a dedicated switch in order to
> optimize intra-cluster communications when moving data is unavoidable.
> 
> -Will
> 
> On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila  wrote:
> 
>> Hi
>>
>> I just came across Kafta when I was trying to find solutions to scale our
>> current architecture.
>>
>> We are currently downloading and processing 6M documents per day from
>> online and social media. We have a different workflow for each type of
>> document, but some of the steps are keyword extraction, language detection,
>> clustering, classification, indexation,  We are using Gearman to
>> dispatch the job to workers and we have some queues on a database.
>>
>> I'm wondering if we could integrate Kafka on the current workflow and if
>> it's feasible. One of our main discussions are if we have to go to a fully
>> distributed architecture or to a semi-distributed one. I mean, distribute
>> everything or process some steps on the same machine (crawling, keyword
>> extraction, language detection, indexation). We don't know which one scales
>> more, each one has pros and cont.
>>
>> Now we have a semi-distributed one as we had network problems taking into
>> account the amount of data we were moving around. So now, all documents
>> crawled on server X, later on are dispatched through Gearman to the same
>> server. What we dispatch on Gearman is only the document id, and the
>> document data remains on the crawling server on a Memcached, so the network
>> traffic is keep at minimum.
>>
>> What do you think?
>> It's feasible to remove all database queues and Gearman and move to Kafka?
>> As Kafka is mainly based on messages I think we should move the messages
>> around, should we take into account the network? We may face the same
>> problems?
>> If so, there is a way to isolate some steps to be processed on the same
>> machine, to avoid network traffic?
>>
>> Any help or comment will be appreciate. And If someone has had a similar
>> problem and has knowledge about the architecture approach will be more than
>> welcomed.
>>
>> Thanks
>>
> 



Re: Use case

2014-09-05 Thread Christian Csar
The thought experiment I did ended up having a set of front end servers
corresponding to a given chunk of the user id space, each of which was a
separate subscriber to the same set of partitions. The you have one or
more partitions corresponding to that same chunk of users. You want the
chunk/set of partitions to be of a size where each of those front end
servers can process all the messages in it and send out the
chats/notifications/status change notifications perhaps/read receipts,
to those users who happen to be connected to the particular front end node.

You would need to handle some deduplication on the consumers/FE servers
and would need to decide where to produce. Producing from every front
end server to potentially every broker could be expensive in terms of
connections and you might want to first relay the messages to the
corresponding front end cluster, but since we don't use large numbers of
producers it's hard for me to say.

For persistence and offline delivery you can probably accept a delay in
user receipt so you can use another set of consumers that persist the
messages to the longer latency datastore on the backend and then get the
last 50 or so messages with a bit of lag when the user first looks at
history (see hipchat and hangouts lag).

This gives you a smaller number of partitions and avoids the issue of
having to keep too much history on the Kafka brokers. There are
obviously a significant number of complexities to deal with. For example
if you are using default consumer code that commits offsets into
zookeeper it may be inadvisable at large scales you probably don't need
to worry about reaching. And remember I had done this only as a thought
experiment not a proper technical evaluation. I expect Kafka, used
correctly, can make aspects of building such a chat system much much
easier (you can avoid writing your own message replication system) but
it is definitely not plug and play using topics for users.

Christian


On 09/05/2014 09:46 AM, Jonathan Weeks wrote:
> +1
> 
> Topic Deletion with 0.8.1.1 is extremely problematic, and coupled with the 
> fact that rebalance/broker membership changes pay a cost per partition today, 
> whereby excessive partitions extend downtime in the case of a failure; this 
> means fewer topics (e.g. hundreds or thousands) is a best practice in the 
> published version of kafka. 
> 
> There are also secondary impacts on topic count — e.g. useful operational 
> tools such as: http://quantifind.com/KafkaOffsetMonitor/ start to become 
> problematic in terms of UX with a massive number of topics.
> 
> Once topic deletion is a supported feature, the use-case outlined might be 
> more tenable.
> 
> Best Regards,
> 
> -Jonathan
> 
> On Sep 5, 2014, at 4:20 AM, Sharninder  wrote:
> 
>> I'm not really sure about your exact use-case but I don't think having a
>> topic per user is very efficient. Deleting topics in kafka, at the moment,
>> isn't really straightforward. You should rethink your date pipeline a bit.
>>
>> Also, just because kafka has the ability to store messages for a certain
>> time, don't think of it as a data store. Kafka is a streaming system, think
>> of it as a fast queue that gives you the ability to move your pointer back.
>>
>> --
>> Sharninder
>>
>>
>>
>> On Fri, Sep 5, 2014 at 4:27 PM, Aris Alexis 
>> wrote:
>>
>>> Thanks for the reply. If I use it only for activity streams like twitter:
>>>
>>> I would want a topic for each #tag and a topic for each user and maybe
>>> foreach city. Would that be too many topics or it doesn't matter since most
>>> of them will be deleted in a specified interval.
>>>
>>>
>>>
>>> Best Regards,
>>> Aris Giachnis
>>>
>>>
>>> On Fri, Sep 5, 2014 at 6:57 AM, Sharninder  wrote:
>>>
 Since you want all chats and mail history persisted all the time, I
 personally wouldn't recommend kafka for your requirement. Kafka is more
 suitable as a streaming system where events expire after a certain time.
 Look at something more general purpose like hbase for persisting data
 indefinitely.

 So, for example all activity streams can go into kafka from where
>>> consumers
 will pick up messages to parse and put them to hbase or other clients.

 --
 Sharninder





 On Fri, Sep 5, 2014 at 12:05 AM, Aris Alexis 
 wrote:

> Hello,
>
> I am building a big web application that I want to be massively
>>> scalable
 (I
> am using cassandra and titan as a general db).
>
> I want to implement the following:
>
> real time web chat that is persisted so that user a in the future can
> recall his chat with user b,c,d much like facebook.
> mail like messages in the web application (not sure about this as it is
> somewhat covered by the first one)
> user activity streams
> users subscribing to topics for example florida/musicevents
>
> Could i use kafka for this? can you recommend another technology mayb

Re: Handling send failures with async producer

2014-08-26 Thread Christian Csar
TLDR: I use one Callback per job I send to Kafka and include that sort
of information by reference in the Callback instance.

Our system is currently moving data from beanstalkd to Kafka due to
historical reasons so we use the callback to either delete or release
the message depending on success. The
org.apache.kafka.clients.producer.Callback I give to the send method is
an instance of a class that stores all the additional information I need
to process the callback. Remember that the async call operates in the
Kafka producer thread so they must be fast to avoid constraining the
throughput. My call back ends up putting information about the call to
beanstalk into another executor service for later processing.

Christian

On 08/26/2014 12:35 PM, Ryan Persaud wrote:
> Hello,
> 
> I'm looking to insert log lines from log files into kafka, but I'm concerned 
> with handling asynchronous send() failures.  Specifically, if some of the log 
> lines fail to send, I want to be notified of the failure so that I can 
> attempt to resend them.
> 
> Based on previous threads on the mailing list 
> (http://comments.gmane.org/gmane.comp.apache.kafka.user/1322), I know that 
> the trunk version of kafka supports callbacks for dealing with failures.  
> However, the callback function is not passed any metadata that can be used by 
> the producer end to reference the original message.  Including the key of the 
> message in the RecordMetadata seems like it would be really useful for 
> recovery purposes.  Is anyone using the callback functionality to trigger 
> resends of failed messages?  If so, how are they tying the callbacks to 
> messages?  Is anyone using other methods for handling async errors/resending 
> today?  I can’t imagine that I am the only one trying to do this.  I asked 
> this question on the IRC channel today, and it sparked some discussion, but I 
> wanted to hear from a wider audience.
> 
> Thanks for the information,
> -Ryan
> 
> 




signature.asc
Description: OpenPGP digital signature


Re: EBCDIC support

2014-08-25 Thread Christian Csar
Having been spared any EBCDIC experience whatsoever (ie from a positio
of thorough ignorance), if you are transmitting text or things with a
designated textual form (presumably) I would recommend that your
conversion be to unicode rather than ascii if you don't already have
consumers expecting a given conversion. That way you will avoid losing
information, particularly if you expect any of your conversion tools to
be of more general use.

Christian

On 08/25/2014 05:36 PM, Gwen Shapira wrote:
> Personally, I like converting data before writing to Kafka, so I can
> easily support many consumers who don't know about EBCDIC.
> 
> A third option is to have a consumer that reads EBCDIC data from one
> Kafka topic and writes ASCII to another Kafka topic. This has the
> benefits of preserving the raw data in Kafka, in case you need it for
> troubleshooting, and also supporting non-EBCDIC consumers.
> 
> The cost is a more complex architecture, but if you already have a
> stream processing system around (Storm, Samza, Spark), it can be an
> easy addition.
> 
> 
> On Mon, Aug 25, 2014 at 5:28 PM,   wrote:
>> Thanks Gwen! makes sense. So I'll have to weigh the pros and cons of doing 
>> an EBCDIC to ASCII conversion before sending to Kafka Vs. using an ebcdic 
>> library after in the consumer
>>
>> Thanks!
>> S
>>
>> -Original Message-
>> From: Gwen Shapira [mailto:gshap...@cloudera.com]
>> Sent: Monday, August 25, 2014 5:22 PM
>> To: users@kafka.apache.org
>> Subject: Re: EBCDIC support
>>
>> Hi Sonali,
>>
>> Kafka doesn't really care about EBCDIC or any other format -  for Kafka bits 
>> are just bits. So they are all supported.
>>
>> Kafka does not "read" data from a socket though. Well, it does, but the data 
>> has to be sent by a Kafka producer. Most likely you'll need to implement a 
>> producer that will get the data from the socket and send it as a message to 
>> Kafka. The content of the message can be anything, including EBCDIC -.
>>
>> Then  you'll need a consumer to read the data from Kafka and do something 
>> with this - the consumer will need to know what to do with a message that 
>> contains EBCDIC data. Perhaps you have EBCDIC libraries you can reuse there.
>>
>> Hope this helps.
>>
>> Gwen
>>
>> On Mon, Aug 25, 2014 at 5:14 PM,   wrote:
>>> Hey all,
>>>
>>> This might seem like a silly question, but does kafka have support for 
>>> EBCDIC? Say I had to read data from an IBM mainframe via a TCP/IP socket 
>>> where the data resides in EBCDIC format, can Kafka read that directly?
>>>
>>> Thanks,
>>> Sonali
>>>
>>> 
>>>
>>> This message is for the designated recipient only and may contain 
>>> privileged, proprietary, or otherwise confidential information. If you have 
>>> received it in error, please notify the sender immediately and delete the 
>>> original. Any other use of the e-mail by you is prohibited. Where allowed 
>>> by local law, electronic communications with Accenture and its affiliates, 
>>> including e-mail and instant messaging (including content), may be scanned 
>>> by our systems for the purposes of information security and assessment of 
>>> internal compliance with Accenture policy.
>>> __
>>> 
>>>
>>> www.accenture.com




signature.asc
Description: OpenPGP digital signature


Re: Which producer to use?

2014-06-23 Thread Christian Csar
I ended up coding against the new one,
org.apache.kafka.clients.producer.Producer, though it is not yet in
production here. It might be slightly more painful to select a partition
since there isn't a place to plug in a partitioner class, but overall it
was quite easy and had the key feature of an Async callback.

Christian

On 06/23/2014 04:54 PM, Guozhang Wang wrote:
> Hi Kyle,
> 
> We have not fully completed the test in production yet for the new
> producer, currently some improvement jiras like KAFKA-1498 are still open.
> Once we have it stabilized in production at LinkedIn we plan to update the
> wiki in favor of the new producer.
> 
> Guozhang
> 
> 
> On Mon, Jun 23, 2014 at 3:39 PM, Kyle Banker  wrote:
> 
>> As of today, the latest Kafka docs show kafka.javaapi.producer.Producer in
>> their example of the producer API (
>> https://kafka.apache.org/documentation.html#producerapi).
>>
>> Is there a reason why the latest producer client
>> (org.apache.kafka.clients.producer.Producer)
>> isn't mentioned? Is this client not preferred or production-ready?
>>
> 
> 
> 




signature.asc
Description: OpenPGP digital signature


Re: 0.8.1 Java Producer API Callbacks

2014-05-01 Thread Christian Csar
On 05/01/2014 07:22 PM, Christian Csar wrote:
> I'm looking at using the java producer api for 0.8.1 and I'm slightly
> confused by this passage from section 4.4 of
> https://kafka.apache.org/documentation.html#theproducer
> "Note that as of Kafka 0.8.1 the async producer does not have a
> callback, which could be used to register handlers to catch send errors.
> Adding such callback functionality is proposed for Kafka 0.9, see
> [Proposed Producer
> API](https://cwiki.apache.org/confluence/display/KAFKA/Client+Rewrite#ClientRewrite-ProposedProducerAPI)."
> 
> org.apache.kafka.clients.producer.KafkaProducer in 0.8.1 appears to have
> public Future send(ProducerRecord record, Callback
> callback) which looks like the mentioned callback.
> 
> How do the callbacks with the async producer? Is it as described in the
> comment on the send method (see
> https://github.com/apache/kafka/blob/0.8.1/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L151
> for reference)?
> 
> Looking around it seems plausible the language in the documentation
> might refer to a separate sort of callback that existed in 0.7 but not
> 0.8. In our use case we have something useful to do if we can detect
> messages failing to be sent.
> 
> Christian
> 

It appears that I was looking at the Java client rather than the Scala
java api referenced by the documentation
https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/javaapi/producer/Producer.scala

Are both of these currently suited for use from java and still
supported? Given the support for callbacks in the event of failure I am
inclined to use the Java one despite the currently limited support for
specifying partitioners (though it supports specifying the partition) or
encoders.

Any guidance on this would be appreciated.

Christian



signature.asc
Description: OpenPGP digital signature


0.8.1 Java Producer API Callbacks

2014-05-01 Thread Christian Csar
I'm looking at using the java producer api for 0.8.1 and I'm slightly
confused by this passage from section 4.4 of
https://kafka.apache.org/documentation.html#theproducer
"Note that as of Kafka 0.8.1 the async producer does not have a
callback, which could be used to register handlers to catch send errors.
Adding such callback functionality is proposed for Kafka 0.9, see
[Proposed Producer
API](https://cwiki.apache.org/confluence/display/KAFKA/Client+Rewrite#ClientRewrite-ProposedProducerAPI)."

org.apache.kafka.clients.producer.KafkaProducer in 0.8.1 appears to have
public Future send(ProducerRecord record, Callback
callback) which looks like the mentioned callback.

How do the callbacks with the async producer? Is it as described in the
comment on the send method (see
https://github.com/apache/kafka/blob/0.8.1/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L151
for reference)?

Looking around it seems plausible the language in the documentation
might refer to a separate sort of callback that existed in 0.7 but not
0.8. In our use case we have something useful to do if we can detect
messages failing to be sent.

Christian



signature.asc
Description: OpenPGP digital signature