Re: What is the best way to write Kafka data into HDFS?
R P, happy to walk you through https://github.com/DemandCube/Scribengin if your interested On Wed, Feb 10, 2016 at 5:09 PM, R P <hadoo...@outlook.com> wrote: > Hello All, > New Kafka user here. What is the best way to write Kafka data into HDFS? > I have looked into following options and found that Flume is quickest and > easiest to setup. > > 1. Flume > 2. KaBoom > 3. Kafka Hadoop Loader > 4. Camus -> Gobblin > > Although Flume can result into small file problems when your data is > partitioned and some partitions generate sporadic data. > > What are some best practices and options to write data from Kafka to HDFS? > > Thanks, > R P > > > > -- *Steve Morin | Managing Partner - CTO* *Nvent* O 800-407-1156 ext 803 <800-407-1156;803> | M 347-453-5579 smo...@nventdata.com <smo...@nventdata.com> *Enabling the Data Driven Enterprise* *(Ask us how we can setup scalable open source realtime billion+ event/data collection/analytics infrastructure in weeks)* Service Areas: Management & Strategy Consulting | Data Engineering | Data Science & Visualization BigData Technologies: Hadoop & Ecosystem | NoSql| Hbase | Cassandra | Storm | Spark | Kafka | Mesos | Docker | & More Industries: IoT | Advertising | Retail | Manufacturing | TV & Cable | Energy | Oil & Gas | Insurance | Finance | Telecom
Re: Kakfa question about starting kafka with same broker id
Why would you want to ever do that? On Feb 18, 2015, at 15:16, Deepak Dhakal ddha...@salesforce.com wrote: Hi, My name is Deepak and I work for salesforce. We are using kafka 8.11 and have a question about starting kafka with same broker id. Steps: Start a kakfa broker with broker id =1 - it starts fine with external ZK Start another kafka with same broker id =1 .. it does not start the kafka which is expected but I am seeing the following log and it keeps retrying forever. Is there way to control how many time a broker tries to starts itself with the same broker id ? Thanks Deepak [2015-02-18 14:47:20,713] INFO conflict in /controller data: {version:1,brokerid:19471,timestamp:1424299100135} stored data: {version:1,brokerid:19471,timestamp:1424288444314} (kafka.utils.ZkUtils$) [2015-02-18 14:47:20,716] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:19471,timestamp:1424299100135}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) [2015-02-18 14:47:30,719] INFO conflict in /controller data: {version:1,brokerid:19471,timestamp:1424299100135} stored data: {version:1,brokerid:19471,timestamp:1424288444314} (kafka.utils.ZkUtils$) [2015-02-18 14:47:30,722] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:19471,timestamp:1424299100135}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$)
Re: New Producer - ONLY sync mode?
Jay, Thanks I'll look at that more closely. On Sat, Feb 7, 2015 at 1:23 PM, Jay Kreps jay.kr...@gmail.com wrote: Steve In terms of mimicing the sync behavior, I think that is what .get() does, no? We are always returning the offset and error information. The example I gave didn't make use of it, but you definitely can make use of it if you want to. -Jay On Wed, Feb 4, 2015 at 9:58 AM, Steve Morin st...@stevemorin.com wrote: Looking at this thread I would ideally want something at least the right recipe to mimic sync behavior like Otis is talking about. In the second case, would like to be able to individually know if messages have failed even regardless if they are in separate batches, sort of like what Kinesis does as Pradeep mentioned. -Steve On Wed, Feb 4, 2015 at 11:19 AM, Jay Kreps jay.kr...@gmail.com wrote: Yeah totally. Using a callback is, of course, the Right Thing for this kind of stuff. But I have found that kind of asynchronous thinking can be hard for people. Even if you get out of the pre-java 8 syntactic pain that anonymous inner classes inflict just dealing with multiple threads of control without creating async spaghetti can be a challenge for complex stuff. That is really the only reason for the futures in the api, they are strictly less powerful than the callbacks, but at least using them you can just call .get() and pretend it is blocking. -Jay On Wed, Feb 4, 2015 at 7:19 AM, Joe Stein joe.st...@stealth.ly wrote: Now that 0.8.2.0 is in the wild I look forward to working with more and seeing what folks start to-do with this function https://dist.apache.org/repos/dist/release/kafka/0.8.2.0/javadoc/org/apache/kafka/clients/producer/KafkaProducer.html#send(org.apache.kafka.clients.producer.ProducerRecord , org.apache.kafka.clients.producer.Callback) and keeping it fully non blocking. One sprint I know of coming up is going to have the new producer as a component in their reactive calls and handling bookkeeping and retries through that type of call back approach. Should work well (haven't tried but don't see why not) with Akka, ScalaZ, RxJava, Finagle, etc, etc, etc in functional languages and frameworks. I think as JDK 8 starts to get out in the wild too more (may after jdk7 eol) the use of .get will be reduced (imho) and folks will be thinking more about non-blocking vs blocking and not as so much sync vs async but my crystal ball just back from the shop so well see =8^) /*** Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop http://www.twitter.com/allthingshadoop / On Tue, Feb 3, 2015 at 10:45 PM, Jay Kreps jay.kr...@gmail.com wrote: Hey guys, I guess the question is whether it really matters how many underlying network requests occur? It is very hard for an application to depend on this even in the old producer since it depends on the partitions placement (a send to two partitions may go to either one machine or two and so it will send either one or two requests). So when you send a batch in one call you may feel that is all at once, but that is only actually guaranteed if all messages have the same partition. The challenge is allowing even this in the presence of bounded request sizes which we have in the new producer. The user sends a list of objects and the serialized size that will result is not very apparent to them. If you break it up into multiple requests then that is kind of further ruining the illusion of a single send. If you don't then you have to just error out which is equally annoying to have to handle. But I'm not sure if from your description you are saying you actually care how many physical requests are issued. I think it is more like it is just syntactically annoying to send a batch of data now because it needs a for loop. Currently to do this you would do: List responses = new ArrayList(); for(input: recordBatch) responses.add(producer.send(input)); for(response: responses) response.get If you don't depend on the offset/error info we could add a flush call so you could instead do for(input: recordBatch) producer.send(input); producer.flush(); But if you do want the error/offset then you are going to be back to the original case. Thoughts? -Jay On Mon, Feb 2, 2015 at 1:48 PM, Gwen Shapira gshap...@cloudera.com wrote: I've been thinking about that too, since both Flume and Sqoop
Re: [ANNOUNCEMENT] Apache Kafka 0.8.2.0 Released
Congratz team it's a big accomplishment On Feb 5, 2015, at 14:22, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Big thanks to Jun and everyone else involved! We're on 0.8.2 as of today. :) Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Tue, Feb 3, 2015 at 8:37 PM, Jun Rao j...@confluent.io wrote: The Apache Kafka community is pleased to announce the release for Apache Kafka 0.8.2.0. The 0.8.2.0 release introduces many new features, improvements and fixes including: - A new Java producer for ease of implementation and enhanced performance. - A Kafka-based offset storage. - Delete topic support. - Per topic configuration of preference for consistency over availability. - Scala 2.11 support and dropping support for Scala 2.8. - LZ4 Compression. All of the changes in this release can be found: https://archive.apache.org/dist/kafka/0.8.2.0/RELEASE_NOTES.html Apache Kafka is high-throughput, publish-subscribe messaging system rethought of as a distributed commit log. ** Fast = A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. ** Scalable = Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. ** Durable = Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. ** Distributed by Design = Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. You can download the release from: http://kafka.apache.org/downloads.html We welcome your help and feedback. For more information on how to report problems, and to get involved, visit the project website at http://kafka.apache.org/ Thanks, Jun
Re: Client Offset Storage
Suren, Like out of the box storage or roll your own? -Steve On Fri, Dec 12, 2014 at 6:33 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: My team is using Kafka 0.8.1 and we may not be able to upgrade to 0.8.2 to take advantage of the broker-side commit of client offsets. Is anyone aware of a Java/Scala library for client offset storage outside of ZK? -- Suren SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Programmatic Kafka version detection/extraction?
That would be great! On Mon, Nov 10, 2014 at 9:45 PM, Jun Rao jun...@gmail.com wrote: Otis, We don't have an api for that now. We can probably expose this as a JMX as part of kafka-1481. Thanks, Jun On Mon, Nov 10, 2014 at 7:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Is there a way to detect which version of Kafka one is running? Is there an API for that, or a constant with this value, or maybe an MBean or some other way to get to this info? Thanks, Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/
Re: migrating log data to new locations
Neha, I log volume or can it be volumes plural? -Steve On Tue, Oct 7, 2014 at 6:41 AM, Neha Narkhede neha.narkh...@gmail.com wrote: Is it possible to perform this migration without losing the data currently stored in the kafka cluster? Though I haven't tested this, the way this is designed should allow you to shut down a broker, move some partition directories over to the new log volume and restart the broker. You will have to do this manually per broker though. Thanks, Neha On Tue, Oct 7, 2014 at 3:31 AM, Javier Alba m...@fjavieralba.com wrote: Hi, I have a Kafka 0.8.1.1 cluster consisting in 4 servers with several topics on it. The cluster was initially configured to store kafka log data in a single directory on each server (log.dirs = /tmp/kafka-logs) Now, I have assigned 3 new disks to each server and I would like to use them to store log data, instead the old directory. (logs.dirs = /srv/data/1,/srv/data/2,/srv/data/3) What would be the recommended way of doing such a migration? Is it possible to perform this migration without losing the data currently stored in the kafka cluster? Would it be possible to achieve that kind of change without having to stop the cluster and losing service? Thanks,
Re: Right Tool
What record format are you writing to Kafka with? On Sep 12, 2014, at 17:45, Patrick Barker patrickbarke...@gmail.com wrote: O, I'm not trying to use it for persistence, I'm wanting to sync 3 databases: sql, mongo, graph. I want to publish to kafka and then have it update the db's. I'm wanting to keep this as efficient as possible. On Fri, Sep 12, 2014 at 6:39 PM, cac...@gmail.com cac...@gmail.com wrote: I would say that it depends upon what you mean by persistence. I don't believe Kafka is intended to be your permanent data store, but it would work if you were basically write once with appropriate query patterns. It would be an odd way to describe it though. Christian On Fri, Sep 12, 2014 at 4:05 PM, Stephen Boesch java...@gmail.com wrote: Hi Patrick, Kafka can be used at any scale including small ones (initially anyways). The issues I ran into personally various issues with ZooKeeper management and a bug in deleting topics (is that fixed yet?) In any case you might try out Kafka - given its highly performant, scalable, and flexible backbone. After that you will have little worry about scale - given Kafka's use within massive web scale deployments. 2014-09-12 15:18 GMT-07:00 Patrick Barker patrickbarke...@gmail.com: Hey, I'm new to kafka and I'm trying to get a handle on how it all works. I want to integrate polyglot persistence into my application. Kafka looks like exactly what I want just on a smaller scale. I am currently only dealing with about 2,000 users, which may grow, but is kafka a good use case here, or is there another technology thats better suited? Thanks
Re: Right Tool
You would need make sure they were all persisted down properly to each database? Why are you persisting it to three different databases (sql, mongo, graph)? -Steve On Fri, Sep 12, 2014 at 7:35 PM, Patrick Barker patrickbarke...@gmail.com wrote: I'm just getting familiar with kafka, currently I just save everything to all my db's in a single transaction, if any of them fail I roll them all back. However, this is slowing my app down. So, as I understand it I could write to kafka, close the transaction, and then it would keep on publishing out to my databases. I'm not sure what format I would write it in yet, I guess json On Fri, Sep 12, 2014 at 7:00 PM, Steve Morin steve.mo...@gmail.com wrote: What record format are you writing to Kafka with? On Sep 12, 2014, at 17:45, Patrick Barker patrickbarke...@gmail.com wrote: O, I'm not trying to use it for persistence, I'm wanting to sync 3 databases: sql, mongo, graph. I want to publish to kafka and then have it update the db's. I'm wanting to keep this as efficient as possible. On Fri, Sep 12, 2014 at 6:39 PM, cac...@gmail.com cac...@gmail.com wrote: I would say that it depends upon what you mean by persistence. I don't believe Kafka is intended to be your permanent data store, but it would work if you were basically write once with appropriate query patterns. It would be an odd way to describe it though. Christian On Fri, Sep 12, 2014 at 4:05 PM, Stephen Boesch java...@gmail.com wrote: Hi Patrick, Kafka can be used at any scale including small ones (initially anyways). The issues I ran into personally various issues with ZooKeeper management and a bug in deleting topics (is that fixed yet?) In any case you might try out Kafka - given its highly performant, scalable, and flexible backbone. After that you will have little worry about scale - given Kafka's use within massive web scale deployments. 2014-09-12 15:18 GMT-07:00 Patrick Barker patrickbarke...@gmail.com : Hey, I'm new to kafka and I'm trying to get a handle on how it all works. I want to integrate polyglot persistence into my application. Kafka looks like exactly what I want just on a smaller scale. I am currently only dealing with about 2,000 users, which may grow, but is kafka a good use case here, or is there another technology thats better suited? Thanks
Re: Consume more than produce
You have to remember statsd uses udp and possibly lossy which might account for the errors. -Steve On Fri, Aug 1, 2014 at 1:28 AM, Guy Doulberg guy.doulb...@perion.com wrote: Hey, After a year or so I have Kafka as my streaming layer in my production, I decided it is time to audit, and to test how many events do I lose, if I lose events at all. I discovered something interesting which I can't explain. The producer produces less events that the consumer group consumes. It is not much more, it is about 0.1% more events I use the Consumer API (not the simple consumer API) I was thinking I might had rebalancing going on in my system, but it doesn't look like that. Did anyone see such a behaviour In order to audit, I calculated for each event the minute it arrived, and assigned this value to the event, I used statsd do to count all events from all my producer cluster, and all consumer group cluster. I must say that it is not a happening for every minute, Thanks, Guy
Re: kafka support in collectd and syslog-ng
Cool On Fri, Jul 25, 2014 at 9:25 AM, Joe Stein joe.st...@stealth.ly wrote: Awesome! /*** Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop http://www.twitter.com/allthingshadoop / On Fri, Jul 25, 2014 at 12:08 PM, Jay Kreps jay.kr...@gmail.com wrote: This is great! -Jay On Fri, Jul 25, 2014 at 8:55 AM, Pierre-Yves Ritschard p...@spootnik.org wrote: Hi list, Just a quick note to let you know that kafka support has now been merged in collectd, which means that system and application metrics can directly be produced on a topic from the collectd daemon. Additionally, syslog-ng will soon ship with a kafka producing module as well, it will be part of the next release of the syslog-ng-incubator module collection: https://github.com/balabit/syslog-ng-incubator. What this means is that there is now a very lightweight way to create a infrastructure event stream on top of kafka, with known tools in the ops world. I relied on the librdkafka library to provide kafka producing, which I can recommend for C needs. - pyr
Re: Kafka on yarn
Kam, Give it some time and think it's getting better as a real possibility for Kafka on Yarn. There are new capabilities coming out in Yarn/HDFS to allow for node groups/label that can work with locality and secondarily new functionality in HDFS that depending on the use-case can be very interesting with in-memory files. -Steve On Wed, Jul 23, 2014 at 4:44 PM, Kam Kasravi kamkasr...@yahoo.com.invalid wrote: Thanks Joe for the input related to Mesos as well as acknowledging the need for YARN to support this type of cluster allocation - long running services with node locality priority. Thanks Jay - That's an interesting fact that I wasn't aware of - though I imagine there could possibly be a long latency for the replica data to be transferred to the new broker (depending on #/size of partitions). It does open up some possibilities to restart brokers on app master restart using different containers (as well as some complications if an old container with old data were reallocated on restart). I had used zookeeper to store broker locations so the app master on restart would look for this information and attempt to reallocate containers on these nodes. All this said, would this be part of kafka or some other framework? I can see kafka benefitting from this at the same time kafka's appeal IMO is it's simplicity. Spark has chosen to include YARN within its distribution, not sure what the kafka team thinks. On Wednesday, July 23, 2014 4:19 PM, Jay Kreps jay.kr...@gmail.com wrote: Hey Kam, It would be nice to have a way to get a failed node back with it's original data, but this isn't strictly necessary, it is just a good optimization. As long as you run with replication you can restart a broker elsewhere with no data, and it will restore it's state off the other replicas. -Jay On Wed, Jul 23, 2014 at 3:47 PM, Kam Kasravi kamkasr...@yahoo.com.invalid wrote: Hi Kafka-on-yarn requires YARN to consistently allocate a kafka broker at a particular resource since the broker needs to always use its local data. YARN doesn't do this well, unless you provide (override) the default scheduler (CapacityScheduler or FairScheduler). SequenceIO did something along these lines for a different use case. Unfortunately replacing the scheduler is a global operation which would affect all App masters. Additionally one could argue that the broker should be run as an OS service and auto restarted on failure if necessary. Slider (incubating) did some of this groundwork but YARN still has lots of limitations in providing guarantees to consistently allocate a container on a particular node especially on appmaster restart (eg ResourceManager dies). That said, it might be worthwhile to enumerate all of this here with some possible solutions. If there is interest I could certainly list the relevant JIRA's along with some additional JIRA's required IMO. Thanks Kam On Wednesday, July 23, 2014 2:37 PM, hsy...@gmail.com hsy...@gmail.com wrote: Hi guys, Kafka is getting more and more popular and in most cases people run kafka as long-term service in the cluster. Is there a discussion of running kafka on yarn cluster which we can utilize the convenient configuration/resource management and HA. I think there is a big potential and requirement for that. I found a project https://github.com/kkasravi/kafka-yarn. But is there a official roadmap/plan for this? Thank you very much! Best, Siyuan
Re: Performance/Stress tools
We are working on this Yarn based tool , but it's still in alpha https://github.com/DemandCube/DemandSpike On Wed, Jul 16, 2014 at 11:30 AM, Dayo Oliyide dayo.oliy...@gmail.com wrote: Hi, I'm setting up a Kafka Cluster and would like to carry out some performance/stress tests on different configurations. Other than the performance testing scripts that come with Kafka, are there any other tools that anyone would recommend? Regards, Dayo
Re: kafka producer pulling from custom restAPI
The answer is no, it doesn't work that way. You would have to write a process to consume from the API back end and have that back end write to Kafka On Jun 27, 2014, at 19:35, sonali.parthasara...@accenture.com wrote: Hi, I have a quick question. Say I have a custom REST API with data in JSON format. Can the Kafka Producer read directly from the REST API and write to a topic? If so, can you point me to the code base? Thanks, Sonali Sonali Parthasarathy RD Developer, Data Insights Accenture Technology Labs 703-341-7432 This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com
Re: Building Kafka on Mac OS X
Have seen if you have a write with zero data it will hang On Jun 16, 2014, at 21:02, Timothy Chen tnac...@gmail.com wrote: Can you try running it in debug mode? (./gradlew jar -d) Tim On Mon, Jun 16, 2014 at 8:44 PM, Jorge Marizan jorge.mari...@gmail.com wrote: It just hangs there without any output at all. Jorge. On Jun 16, 2014, at 11:27 PM, Timothy Chen tnac...@gmail.com wrote: What output was it stuck on? Tim On Mon, Jun 16, 2014 at 6:39 PM, Jorge Marizan jorge.mari...@gmail.com wrote: Hi team, I’m a newcomer to Kafka, but I’m having some troubles trying to get it to run on OS X. Basically building Kafka on OS X with 'gradlew jar’ gets stuck forever without any progress (Indeed I tried to leave it building all night with no avail). Any advices will be greatly appreciated. Thanks in advance.
Re: Use Kafka To Send Files
You also wouldn't have any meta data about the file so I would avoid doing this. On Jun 15, 2014, at 20:51, Mark Roberts wiz...@gmail.com wrote: You would ship the contents of the file across as a message. In general this would mean that your maximum file size must be smaller than your maximum message size. It would generally be a better choice to put a pointer to the file in some shared location on the queue. -Mark On Jun 15, 2014, at 19:41, huqiming0...@live.com wrote: Hi Sometimes we want use kafka to send files(like textfile,xml...), but I didn't see this in documention. Can kafka use to tansfer files? If can ,how can I do Thanks
Re: Make kafka storage engine pluggable and provide a HDFS plugin?
Hangjun, Does having Kafka in Yarn would be a big architectural change from where it is now? From what I have seen on most typical setup you want machines optimized for Kafka, not just it on top of hdfs. -Steve On Tue, May 20, 2014 at 8:37 PM, Hangjun Ye yehang...@gmail.com wrote: Thanks Jun and Francois. We used Kafka 0.8.0 previously. We got some weird error when expanding cluster and it couldn't be finished. Now we use 0.8.1.1, I would have a try on cluster expansion sometime. I read the discussion on that jira issue and I agree with points raised there. HDFS was also improved a lot since then and many issues have been resolved (e.g. SPOF). We have a team for building and providing storage/computing platform for our company and we have already provided a Hadoop cluster. If Kafka has an option to store data on HDFS, we just need to allocate some space quota for it on our cluster (and increase it on demand) and it might reduce our operational cost a lot. Another (and maybe more aggressive) thought is about the deployment. Jun has a good point: HDFS only provides data redundancy, but not computational redundancy. If Kafka could be deployed on YARN, it could offload some computational resource management to YARN and we don't have to allocate machines physically. Kafka still needs to take care of load balance and partition assignment among brokers by itself. Many computational frameworks like spark/samza have such an option and it's a big attractive point for us. Best, Hangjun 2014-05-20 21:00 GMT+08:00 François Langelier f.langel...@gmail.com: Take a look at Camus https://github.com/linkedin/camus/ François Langelier Étudiant en génie Logiciel - École de Technologie Supérieurehttp://www.etsmtl.ca/ Capitaine Club Capra http://capra.etsmtl.ca/ VP-Communication - CS Games http://csgames.org 2014 Jeux de Génie http://www.jdgets.com/ 2011 à 2014 Argentier Fraternité du Piranha http://fraternitedupiranha.com/ 2012-2014 Comité Organisateur Olympiades ÉTS 2012 Compétition Québécoise d'Ingénierie 2012 - Compétition Senior On 19 May 2014 05:28, Hangjun Ye yehang...@gmail.com wrote: Hi there, I recently started to use Kafka for our data analysis pipeline and it works very well. One problem to us so far is expanding our cluster when we need more storage space. Kafka provides some scripts for helping do this but the process wasn't smooth. To make it work perfectly, seems Kafka needs to do some jobs that a distributed file system has already done. So just wondering if any thoughts to make Kafka work on top of HDFS? Maybe make the Kafka storage engine pluggable and HDFS is one option? The pros might be that HDFS has already handled storage management (replication, corrupted disk/machine, migration, load balance, etc.) very well and it frees Kafka and the users from the burden, and the cons might be performance degradation. As Kafka does very well on performance, possibly even with some degree of degradation, it's still competitive for the most situations. Best, -- Hangjun Ye -- Hangjun Ye
Re: Anouncing Kafka Offset Monitor 0.1
Very nice On Mar 7, 2014, at 11:55, Pierre Andrews pie...@quantifind.com wrote: Claude, we should join forces ;) On Fri, Mar 7, 2014 at 4:45 PM, Claude Mamo claude.m...@gmail.com wrote: Awesome!!! ;-) Claude On Fri, Mar 7, 2014 at 4:03 PM, Pierre Andrews pie...@quantifind.com wrote: Great! Thanks! On Fri, Mar 7, 2014 at 3:59 PM, Jay Kreps jay.kr...@gmail.com wrote: This is really useful! I added it to the ecosystem page: https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem -Jay On Fri, Mar 7, 2014 at 10:49 AM, Pierre Andrews pie...@quantifind.com wrote: Hello everyone, at Quantifind, we are big users of Kafka and we like it a lot! In a few use cases, we had to figure out if a queue was growing and how its consumers were behaving. There are a few command-line tools to try to figure out what's going on, but it's not always easy to debug and to see what has happened while everyone was sleeping. To be able to monitor our kafka queues and consumers, we thus developed a tiny web app that could tell us the log size of each topic in the brokers and the offsets of each consumers. That's very similar to what the kafka ConsumerOffsetChecker tool* is doing, but instead of having a one off snapshot, our app keeps an history and displays a nice graph of what's going on. You can find screenshots and more details here: http://quantifind.github.io/KafkaOffsetMonitor/ the code is on github: https://github.com/quantifind/KafkaOffsetMonitor If you have kafka 0.8 setup, it's very easy to use: 1- download the current jar http://quantifind.github.io/KafkaOffsetMonitor/dist/KafkaOffsetMonitor-assembly-0.1.0-SNAPSHOT.jar 2- run it, pointing at your kafka brokers: java -cp KafkaOffsetMonitor-assembly-0.1.0-SNAPSHOT.jar \ com.quantifind.kafka.offsetapp.OffsetGetterWeb \ --zk zk-server1,zk-server2 \ --port 8080 \ --refresh 10.seconds \ --retain 2.days 3- open your browser and point it to localhost:8080 You can run it locally or host it on a server if you prefer. It's all open source and we'll be happy to receive issue report and pull requests. I hope that you like it and that it can find some uses in the rest of the Kafka community. Best Pierre PS: we are aware of https://github.com/claudemamo/kafka-web-consolebut are currently offering slightly different features. Hopefully we can merge the projects in the future. * here: https://github.com/apache/kafka/blob/0.8/core/src/main/scala/kafka/tools/ConsumerOffsetChecker.scala
Re: Logging in new clients
My vote would be with log4j, I don't have that much experience with log4j2 or a good feel for how much the industry is moving towards it. On Mon, Feb 3, 2014 at 11:17 AM, Joel Koshy jjkosh...@gmail.com wrote: We are already using other libraries in various parts of our code (e.g., metrics, zkclient, joptsimple, etc) some of which pull in these other logging dependencies anyway. i.e., what do we gain by using jul? There may be a good reason why people don't use jul so I think we should fully understand that before going with jul. So it may be simpler to just stick with log4j for the client rewrites and investigate logging later. log4j2 is becoming more widespread and many users seem to be favorable toward logback. slf4j would cover all of these very easily. From what I understand jul does not make it very easy to plug in with these various options but I could be wrong. I completely agree on the need to fix our client logging as that will go a long way in usability for end-users unless we want to keep getting asked the Why do I see this ERROR in my logs..? questions. Joel On Mon, Feb 03, 2014 at 11:08:39AM -0800, Neha Narkhede wrote: Basically my preference would be java.util.logging unless there is some known problem with it, otherwise I guess slf4j, and if not that then log4j. +1. My preference is to use java.util.logging to avoid adding an external dependency, but I'm not too sure about what's the standard out there, so open to suggestions on picking a different library. On Mon, Feb 3, 2014 at 11:00 AM, Jay Kreps jay.kr...@gmail.com wrote: We probably need to add a small amount of logging in the new producer and (soon) consumer clients. I wanted to have a quick discussion on logging libraries before I start adding this in the producer. Previously we have been pretty verbose loggers and I think we should stop that. For clients you mostly don't need to log: if there is an error you should throw it back not log it, so you don't need ERROR logging. Likewise I think it is rude to pollute peoples logs with the details of client initialization (establishing connections, etc), so you don't need INFO logging. However perhaps there is an argument to be made for WARN and DEBUG. I think it is perhaps useful to log a WARN when a server breaks a connection or metadata initialization fails. It can sometimes also be useful to be able to enable debug logging to see step by step processing in the client, which is the case for DEBUG logging. Here is my knowledge about the state of Java logging: 1. Most people still use log4j. The server is using log4j. 2. Second runner-up in slf4j. I personally consider slf4j pretty silly but this is perhaps the more flexible choice since people can plug in different stuff. 3. java.util.logging ships with the jdk, but for some reason no one uses it. 4. There is no critical mass around any other logging library. The context for how to think about this is the following. We are not trying to pick the best logging library. Fundamentally logging is pretty straight-forward and for our simple use case it is inconceivable that any particular library could be much better than any other in terms of feature set. We want the most standard library. My goal is to minimize the dependencies of the client and make our basic logging cases work for most cases. Is there a reason not to just the java.util.logging? It comes with the jdk and supports pluggable appenders so people who have some other library can plug that in right? Basically my preference would be java.util.logging unless there is some known problem with it, otherwise I guess slf4j, and if not that then log4j. Thougts? -Jay
Re: There is insufficient memory for the Java Runtime Environment to continue.
Do you have anything like Graphite or Ganglia monitoring the box to see exactly what's going on? On Fri, Jan 31, 2014 at 1:45 PM, David Montgomery davidmontgom...@gmail.com wrote: Welll...I did get kafka to run on a digiocean box with 4 gigs or ram. All great but now i am paying 40 USD a month for dev servers when I was paying 5. I have 5 dev servers around the worlds. Would be great to get back ot 5 USD for boxes that just need to start up rather just doing anything substantial. Going back to the original issue. Is this the only lever I have to get kafka to work on a 512megs ram box? KAFKA_HEAP_OPTS=-Xmx256M -Xms128M? Any other trick? Thanks On Sat, Feb 1, 2014 at 5:21 AM, Benjamin Black b...@b3k.us wrote: Sorry, was looking at pre-release 0.8 code. No idea now why they are not being set as expected. On Fri, Jan 31, 2014 at 1:20 PM, Benjamin Black b...@b3k.us wrote: kafka-run-class.sh in 0.8 does not define KAFKA_HEAP_OPTS. i think you want KAFKA_OPTS. On Fri, Jan 31, 2014 at 1:14 PM, David Montgomery davidmontgom...@gmail.com wrote: What d you mean? This is appended to kafka-run-class.sh KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false prior to running the below in the shel I run = export KAFKA_HEAP_OPTS=-Xmx1G -Xms1G I start kafka like this: /var/lib/kafka-0.8.0-src/bin/kafka-server-start.sh /var/lib/kafka-0.8.0-src/config/server.properties What else is there do? I did not have an issue with kafka 7. Only 8. Thanks On Fri, Jan 31, 2014 at 4:47 AM, Benjamin Black b...@b3k.us wrote: are you sure the java opts are being set as you expect? On Jan 30, 2014 12:41 PM, David Montgomery davidmontgom...@gmail.com wrote: Hi, This is a dedicated machine on DO.But i can say I did not have a problem with kafka 7. I just upgraded the macine to 1gig on digi ocean. Same error. export KAFKA_HEAP_OPTS=-Xmx1G -Xms1G root@do-kafka-sf-development-20140130195343 :/etc/supervisor/conf.d# /var/lib/kafka-0.8.0-src/bin/kafka-server-start.sh /var/lib/kafka-0.8.0-src/config/server.properties OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0xbad3, 986513408, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 986513408 bytes for committing reserved memory. # An error report file with more information is saved as: # /etc/supervisor/conf.d/hs_err_pid11635.log Below is how I install via chef version = '0.8.0' tar -xzf kafka-#{version}-src.tgz cd kafka-#{version}-src ./sbt update ./sbt package ./sbt assembly-package-dependency echo 'KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false' | tee -a /var/lib/kafka-#{version}-src/bin/kafka-run-class.sh echo 'export JMX_PORT=${JMX_PORT:-}' | tee -a /var/lib/kafka-#{version}-src/bin/kafka-server-start.sh On Fri, Jan 31, 2014 at 4:04 AM, Jay Kreps jay.kr...@gmail.com wrote: I think this may be more a general java thing. Can you try running any java class with the same command line options you are using for kafka and confirm that that also doesn't work. -Jay On Thu, Jan 30, 2014 at 11:23 AM, David Montgomery davidmontgom...@gmail.com wrote: Hi, %20us...@kafka.apache.org Why oh whhy can I nt start kafka 8? I am on a machine with 512 megs of ram on digi ocean. What does one have to do to get kafka to work? root@do-kafka-sf-development-20140130051956: export KAFKA_HEAP_OPTS=-Xmx256M -Xms128M root@do-kafka-sf-development-20140130051956 :/etc/supervisor/conf.d# /var/lib/kafka-0.8.0-src/bin/kafka-server-start.sh /var/lib/kafka-0.8.0-src/config/server.properties OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0xbad3, 986513408, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 986513408 bytes for committing reserved memory. # An error report file with more information is saved as: # /etc/supervisor/conf.d/hs_err_pid17533.log Thanks
Re: New Producer Public API
Is the new producer API going to maintain protocol compatibility with old version if the API under the hood? On Jan 29, 2014, at 10:15, Jay Kreps jay.kr...@gmail.com wrote: The challenge of directly exposing ProduceRequestResult is that the offset provided is just the base offset and there is no way to know for a particular message where it was in relation to that base offset because the batching is transparent and non-deterministic. So I think we do need some kind of per-message result. I started with FutureRequestResult, I think for the same reason you prefer it but then when I actually looked at some code samples it wasn't too great--checked exceptions, methods that we can't easily implement, etc. I moved away from that for two reasons: 1. When I actually wrote out some code samples of usage they were a little ugly for the reasons I described--checked exceptions, methods we can't implement, no helper methods, etc. 2. I originally intended to make the result send work like a ListenableFuture so that you would register the callback on the result rather than as part of the call. I moved away from this primarily because the implementation complexity was a little higher. Whether or not the code prettiness on its own outweighs the familiarity of a normal Future I don't know, but that was the evolution of my thinking. -Jay On Wed, Jan 29, 2014 at 10:06 AM, Jay Kreps jay.kr...@gmail.com wrote: Hey Neha, Error handling in RecordSend works as in Future you will get the exception if there is one from any of the accessor methods or await(). The purpose of hasError was that you can write things slightly more simply (which some people expressed preference for): if(send.hasError()) // do something long offset = send.offset(); Instead of the more the slightly longer: try { long offset = send.offset(); } catch (KafkaException e) { // do something } On Wed, Jan 29, 2014 at 10:01 AM, Neha Narkhede neha.narkh...@gmail.comwrote: Regarding the use of Futures - Agree that there are some downsides to using Futures but both approaches have some tradeoffs. - Standardization and usability Future is a widely used and understood Java API and given that the functionality that RecordSend hopes to provide is essentially that of Future, I think it makes sense to expose a widely understood public API for our clients. RecordSend, on the other hand, seems to provide some APIs that are very similar to that of Future, in addition to exposing a bunch of APIs that belong to ProduceRequestResult. As a user, I would've really preferred to deal with ProduceRequestResult directly - FutureProduceRequestResult send(...) - Error handling RecordSend's error handling is quite unintuitive where the user has to remember to invoke hasError and error, instead of just throwing the exception. Now there are some downsides regarding error handling with the Future as well, where the user has to catch InterruptedException when we would never run into it. However, it seems like a price worth paying for supporting a standard API and error handling - Unused APIs This is a downside of using Future, where the cancel() operation would always return false and mean nothing. But we can mention that caveat in our Java docs. To summarize, I would prefer to expose a well understood and widely adopted Java API and put up with the overhead of catching one unnecessary checked exception, rather than wrap the useful ProduceRequestResult in a custom async object (RecordSend) and explain that to our many users. Thanks, Neha On Tue, Jan 28, 2014 at 8:10 PM, Jay Kreps jay.kr...@gmail.com wrote: Hey Neha, Can you elaborate on why you prefer using Java's Future? The downside in my mind is the use of the checked InterruptedException and ExecutionException. ExecutionException is arguable, but forcing you to catch InterruptedException, often in code that can't be interrupted, seems perverse. It also leaves us with the cancel() method which I don't think we really can implement. Option 1A, to recap/elaborate, was the following. There is no Serializer or Partitioner api. We take a byte[] key and value and an optional integer partition. If you specify the integer partition it will be used. If you do not specify a key or a partition the partition will be chosen in a round robin fashion. If you specify a key but no partition we will chose a partition based on a hash of the key. In order to let the user find the partition we will need to given them access to the Cluster instance directly from the producer. -Jay On Tue, Jan 28, 2014 at 6:25 PM, Neha Narkhede neha.narkh...@gmail.com wrote: Here are more thoughts on the public APIs - - I suggest we use java's Future instead of custom Future especially since it is part of the public API - Serialization: I like the simplicity of the producer APIs with the absence of
Re: Kafka @ apachecon Denver?
Are you using mesos? On Jan 27, 2014, at 8:39, Joe Stein joe.st...@stealth.ly wrote: I was going to submit a talk on Kafka and Mesos. I still am trying to nail down the dates in my schedule though. Anyone else going? Maybe we could do a meetup or bof or something? /*** Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop / On Jan 27, 2014, at 9:54 AM, David Corley davidcor...@gmail.com wrote: Kafka Devs, Just wondering if there'll be anything in the line of Kafka presentations and/or tutorials at ApacheCon in Denver in April?
Re: Anyone working on a Kafka book?
Shafaq, That's pretty cool, have you already connected Kafka to spark RRD/DStream or is that something you have to figure out? -Steve On Tue, Dec 10, 2013 at 7:10 PM, Shafaq s.abdullah...@gmail.com wrote: Hi Steve, The first phase would be pretty simple, essentially hooking up the Kafka-DStream-consumer to perform KPI aggregation over the streamed data from Kafka Broker cluster in real-time. We would like to maximize the throughput by choosing the right message payload size, correct kafka topic/partition mapped to spark RRD/DStream and minimizing I/O. Next, we would be using featurizing the stream to be able to develop machine learning models using SVMs (Support Vector Machines) etc to provide rich insights. Would be soon giving talk on this one, so keep tuned. Regards, S.Abdullah On Tue, Dec 10, 2013 at 10:22 AM, Steve Morin st...@stevemorin.comwrote: Shafaq, What does the architecture of what your building look like? -Steve On Tue, Dec 10, 2013 at 10:19 AM, Shafaq s.abdullah...@gmail.com wrote: Hey Guys, I would love to contribute to the book specially in the portion of Kafka-Spark integration or parts of kafka in general. Am building a Kafka-Spark Real-time framework here at Gree Intl Inc processing order of MBs of data per second. My profile: www.linkedin.com/in/shafaqabdullah/ Let me know, Im open in whatever ways. Regards, S.Abdullah On Tue, Dec 10, 2013 at 9:15 AM, chetan conikee coni...@gmail.comwrote: Hey Guys Yes, Ben Lorica (Oreilly) and I are planning to pen a Beginning Kafka book. We only finalized this late October are hoping to start this mid-month Chetan On Tue, Dec 10, 2013 at 8:45 AM, Steve Morin st...@stevemorin.com wrote: I'll let chetan comment if he's up for it. -Steve On Tue, Dec 10, 2013 at 8:40 AM, David Arthur mum...@gmail.com wrote: There was some talk a few months ago, not sure what the current status is. On 12/10/13 10:01 AM, S Ahmed wrote: Is there a book or this was just an idea? On Mon, Mar 25, 2013 at 12:42 PM, Chris Curtin curtin.ch...@gmail.com wrote: Thanks Jun, I've updated the example with this information. I've also removed some of the unnecessary newlines. Thanks, Chris On Mon, Mar 25, 2013 at 12:04 PM, Jun Rao jun...@gmail.com wrote: Chris, This looks good. One thing about partitioning. Currently, if a message doesn't have a key, we always use the random partitioner (regardless of what partitioner.class is set to). Thanks, Jun -- Kind Regards, Shafaq -- Kind Regards, Shafaq
Re: Anyone working on a Kafka book?
Would you mind sharing your connection setup? On Dec 13, 2013, at 10:36, Shafaq s.abdullah...@gmail.com wrote: Thats already done. On Dec 13, 2013 9:28 AM, Steve Morin st...@stevemorin.com wrote: Shafaq, That's pretty cool, have you already connected Kafka to spark RRD/DStream or is that something you have to figure out? -Steve On Tue, Dec 10, 2013 at 7:10 PM, Shafaq s.abdullah...@gmail.com wrote: Hi Steve, The first phase would be pretty simple, essentially hooking up the Kafka-DStream-consumer to perform KPI aggregation over the streamed data from Kafka Broker cluster in real-time. We would like to maximize the throughput by choosing the right message payload size, correct kafka topic/partition mapped to spark RRD/DStream and minimizing I/O. Next, we would be using featurizing the stream to be able to develop machine learning models using SVMs (Support Vector Machines) etc to provide rich insights. Would be soon giving talk on this one, so keep tuned. Regards, S.Abdullah On Tue, Dec 10, 2013 at 10:22 AM, Steve Morin st...@stevemorin.com wrote: Shafaq, What does the architecture of what your building look like? -Steve On Tue, Dec 10, 2013 at 10:19 AM, Shafaq s.abdullah...@gmail.com wrote: Hey Guys, I would love to contribute to the book specially in the portion of Kafka-Spark integration or parts of kafka in general. Am building a Kafka-Spark Real-time framework here at Gree Intl Inc processing order of MBs of data per second. My profile: www.linkedin.com/in/shafaqabdullah/ Let me know, Im open in whatever ways. Regards, S.Abdullah On Tue, Dec 10, 2013 at 9:15 AM, chetan conikee coni...@gmail.com wrote: Hey Guys Yes, Ben Lorica (Oreilly) and I are planning to pen a Beginning Kafka book. We only finalized this late October are hoping to start this mid-month Chetan On Tue, Dec 10, 2013 at 8:45 AM, Steve Morin st...@stevemorin.com wrote: I'll let chetan comment if he's up for it. -Steve On Tue, Dec 10, 2013 at 8:40 AM, David Arthur mum...@gmail.com wrote: There was some talk a few months ago, not sure what the current status is. On 12/10/13 10:01 AM, S Ahmed wrote: Is there a book or this was just an idea? On Mon, Mar 25, 2013 at 12:42 PM, Chris Curtin curtin.ch...@gmail.com wrote: Thanks Jun, I've updated the example with this information. I've also removed some of the unnecessary newlines. Thanks, Chris On Mon, Mar 25, 2013 at 12:04 PM, Jun Rao jun...@gmail.com wrote: Chris, This looks good. One thing about partitioning. Currently, if a message doesn't have a key, we always use the random partitioner (regardless of what partitioner.class is set to). Thanks, Jun -- Kind Regards, Shafaq -- Kind Regards, Shafaq
Re: Anyone working on a Kafka book?
I forget but think Chetan was with oreilly On Dec 10, 2013, at 7:01, S Ahmed sahmed1...@gmail.com wrote: Is there a book or this was just an idea? On Mon, Mar 25, 2013 at 12:42 PM, Chris Curtin curtin.ch...@gmail.comwrote: Thanks Jun, I've updated the example with this information. I've also removed some of the unnecessary newlines. Thanks, Chris On Mon, Mar 25, 2013 at 12:04 PM, Jun Rao jun...@gmail.com wrote: Chris, This looks good. One thing about partitioning. Currently, if a message doesn't have a key, we always use the random partitioner (regardless of what partitioner.class is set to). Thanks, Jun
Re: Anyone working on a Kafka book?
I'll let chetan comment if he's up for it. -Steve On Tue, Dec 10, 2013 at 8:40 AM, David Arthur mum...@gmail.com wrote: There was some talk a few months ago, not sure what the current status is. On 12/10/13 10:01 AM, S Ahmed wrote: Is there a book or this was just an idea? On Mon, Mar 25, 2013 at 12:42 PM, Chris Curtin curtin.ch...@gmail.com wrote: Thanks Jun, I've updated the example with this information. I've also removed some of the unnecessary newlines. Thanks, Chris On Mon, Mar 25, 2013 at 12:04 PM, Jun Rao jun...@gmail.com wrote: Chris, This looks good. One thing about partitioning. Currently, if a message doesn't have a key, we always use the random partitioner (regardless of what partitioner.class is set to). Thanks, Jun
Re: Anyone working on a Kafka book?
Shafaq, What does the architecture of what your building look like? -Steve On Tue, Dec 10, 2013 at 10:19 AM, Shafaq s.abdullah...@gmail.com wrote: Hey Guys, I would love to contribute to the book specially in the portion of Kafka-Spark integration or parts of kafka in general. Am building a Kafka-Spark Real-time framework here at Gree Intl Inc processing order of MBs of data per second. My profile: www.linkedin.com/in/shafaqabdullah/ Let me know, Im open in whatever ways. Regards, S.Abdullah On Tue, Dec 10, 2013 at 9:15 AM, chetan conikee coni...@gmail.com wrote: Hey Guys Yes, Ben Lorica (Oreilly) and I are planning to pen a Beginning Kafka book. We only finalized this late October are hoping to start this mid-month Chetan On Tue, Dec 10, 2013 at 8:45 AM, Steve Morin st...@stevemorin.com wrote: I'll let chetan comment if he's up for it. -Steve On Tue, Dec 10, 2013 at 8:40 AM, David Arthur mum...@gmail.com wrote: There was some talk a few months ago, not sure what the current status is. On 12/10/13 10:01 AM, S Ahmed wrote: Is there a book or this was just an idea? On Mon, Mar 25, 2013 at 12:42 PM, Chris Curtin curtin.ch...@gmail.com wrote: Thanks Jun, I've updated the example with this information. I've also removed some of the unnecessary newlines. Thanks, Chris On Mon, Mar 25, 2013 at 12:04 PM, Jun Rao jun...@gmail.com wrote: Chris, This looks good. One thing about partitioning. Currently, if a message doesn't have a key, we always use the random partitioner (regardless of what partitioner.class is set to). Thanks, Jun -- Kind Regards, Shafaq
Re: Using Kafka 0.8 from Scala and Akka
Chetan, Are you also releasing a Scala RxJava producer as well? -Steve On Tue, Dec 3, 2013 at 10:42 PM, Richard Rodseth rrods...@gmail.com wrote: Any update on this, Chetan? Thanks. On Thu, Oct 31, 2013 at 4:11 PM, chetan conikee coni...@gmail.com wrote: I am in the process of releasing out Scala and RxJava consumer(s) on github. Will be releasing it soon. Keep an eye out. On Thu, Oct 31, 2013 at 3:49 PM, Richard Rodseth rrods...@gmail.com wrote: So I have the 0.8. Beta 1 consumer Java example running now. Is there a Scala API documented somewhere? What about Akka examples? RxJava? When I Google for that I get old links to using Kafka as a durable Akka mailbox. I don't think that's what I'm looking for. I just want something reactive that doesn't use java.util.concurrent. Any pointers? Thanks.
Re: Loggly's use of Kafka on AWS
Philip this is definitely useful On Dec 2, 2013, at 14:55, Surendranauth Hiraman suren.hira...@sociocast.com wrote: S Ahmed, This combination of Kafka and Storm to process streaming data is becoming pretty common. Definitely worth looking at. The throughput will vary depending on your workload (cpu usage, etc.) and if you're talking to a backend, of course. But it scales very well. -Suren On Mon, Dec 2, 2013 at 5:49 PM, S Ahmed sahmed1...@gmail.com wrote: Interesting. So twitter storm is used to basically process the messages on kafka? I'll have to read-up on storm b/c I always thought the use case was a bit different. On Sun, Dec 1, 2013 at 9:59 PM, Joe Stein joe.st...@stealth.ly wrote: Awesome Philip, thanks for sharing! On Sun, Dec 1, 2013 at 9:17 PM, Philip O'Toole phi...@loggly.com wrote: A couple of us here at Loggly recently spoke at AWS reinvent, on how we use Kafka 0.72 in our ingestion pipeline. The slides are at the link below, and may be of interest to people on this list. http://www.slideshare.net/AmazonWebServices/infrastructure-at-scale-apache-kafka-twitter-storm-elastic-search-arc303-aws-reinvent-2013 Any questions, let me know, though I can't promise I can answer everything. Can't give the complete game away. :-) As always, Kafka rocks! Philip -- ___ Available at these partners: [image: CloudFlare | shopify | Bigcommerce] SUREN HIRAMAN, VP TECHNOLOGY SOCIOCAST Simple. Powerful. Predictions. 96 SPRING STREET, 7TH FLOOR NEW YORK, NY 10012 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hira...@sociocast.com W: www.sociocast.com Increase Conversion Rates up to 500%. Go to www.sociocast.com and enter your URL for a free trial!
Re: kafka producer - retry messages
Demian, I have been looking at building that into Sparkngin ( https://github.com/DemandCube/Sparkngin) What kind of window are you looking for? -Steve On Thu, Nov 28, 2013 at 7:23 AM, Demian Berjman dberj...@despegar.comwrote: Joe, i meant that all the kafka cluster is down, even the replicas of that topic. Thanks, On Thu, Nov 28, 2013 at 12:10 PM, Joe Stein joe.st...@stealth.ly wrote: Have you tried the replication feature added to 0.8 http://kafka.apache.org/documentation.html#replication /*** Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop / On Nov 28, 2013, at 9:37 AM, Demian Berjman dberj...@despegar.com wrote: Hi. Anyone has build a retry system (durable) for the producer in case the kakfa cluster is down? We have certain messages that must be sent because there are after a transaction that cannot be undo. We could set the property message.send.max.retries, but if the producer goes down, we lost the messages. Thanks,
Re: kafka producer - retry messages
What I mean by that is that your looking to have the Kafka cluster able to be down for like 5 minutes or upto a day. The problem is estimating how long it will take to recover. Is this work your doing for a consulting project? Or are you doing something on behalf of an employer. Basically would like to know more about the use-case. You can email me directly at st...@stevemorin.com so we don't clog the message board. On Thu, Nov 28, 2013 at 8:05 AM, Demian Berjman dberj...@despegar.comwrote: Steve, thanks for the response! I don't understand what you mean by what kind of window? I am looking for something like, i think you did it in Sparkngin: Log persistence if the log producer connection is down.
Re: kafka producer - retry messages
Philip, Do you do that at loggly? Otis, How was your retry code structured? Have you open sourced it? On Nov 28, 2013, at 16:08, Philip O'Toole phi...@loggly.com wrote: By FS I guess you mean file system. In that case, if one is that concerned, why not run a single Kafka broker on the same machine, and connect to it over localhost? And disable ZK mode too, perhaps. I may be missing something, but I never fully understand why people try really hard to build a stream-to-disk backup approach, when they might be able to couple tightly to Kafka, which, well, just streams to disk. Philip On Nov 28, 2013, at 3:58 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, We've done this at Sematext, where we use Kafka in all 3 products/services you see in my signature. When we fail to push a message into Kafka we store it in the FS and from there we can process it later. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Thu, Nov 28, 2013 at 9:37 AM, Demian Berjman dberj...@despegar.comwrote: Hi. Anyone has build a retry system (durable) for the producer in case the kakfa cluster is down? We have certain messages that must be sent because there are after a transaction that cannot be undo. We could set the property message.send.max.retries, but if the producer goes down, we lost the messages. Thanks,
Re: kafka producer - retry messages
Philip, How would do you mirror this to a main Kafka instance? -Steve On Nov 28, 2013, at 16:14, Philip O'Toole phi...@loggly.com wrote: I should add in our custom producers we buffer in RAM if required, so Kafka can be restarted etc. But I would never code streaming to disk now. I would just run a Kafka instance on the same node. Philip On Nov 28, 2013, at 4:08 PM, Philip O'Toole phi...@loggly.com wrote: By FS I guess you mean file system. In that case, if one is that concerned, why not run a single Kafka broker on the same machine, and connect to it over localhost? And disable ZK mode too, perhaps. I may be missing something, but I never fully understand why people try really hard to build a stream-to-disk backup approach, when they might be able to couple tightly to Kafka, which, well, just streams to disk. Philip On Nov 28, 2013, at 3:58 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, We've done this at Sematext, where we use Kafka in all 3 products/services you see in my signature. When we fail to push a message into Kafka we store it in the FS and from there we can process it later. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Thu, Nov 28, 2013 at 9:37 AM, Demian Berjman dberj...@despegar.comwrote: Hi. Anyone has build a retry system (durable) for the producer in case the kakfa cluster is down? We have certain messages that must be sent because there are after a transaction that cannot be undo. We could set the property message.send.max.retries, but if the producer goes down, we lost the messages. Thanks,
Re: producer exceptions when broker dies
Kane and Aniket, I am interested in knowing what the pattern/solution that people usually use to implement exactly once as well. -Steve On Fri, Oct 25, 2013 at 11:39 AM, Kane Kane kane.ist...@gmail.com wrote: Guozhang, but i've posted a piece from kafka documentation above: So effectively Kafka guarantees at-least-once delivery by default and allows the user to implement at most once delivery by disabling retries on the producer. What i want is at-most-once and docs claim it's possible with certain settings. Did i miss anything here? On Fri, Oct 25, 2013 at 11:35 AM, Guozhang Wang wangg...@gmail.com wrote: Aniket is exactly right. In general, Kafka provides at least once guarantee instead of exactly once. Guozhang On Fri, Oct 25, 2013 at 11:13 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: As per my understanding, if the broker says the msg is committed, its guaranteed to have been committed as per ur ack config. If it says it did not get committed, then its very hard to figure out if this was just a false error. Since there is concept of unique ids for messages, a replay of the same message will result in duplication. I think its a reasonable behaviour considering kafka prefers to append data to partitions fot performance reasons. The best way to right now deal with duplicate msgs is to build the processing engine (layer where your consumer sits) to deal with at least once semantics of the broker. On 25 Oct 2013 23:23, Kane Kane kane.ist...@gmail.com wrote: Or, to rephrase it more generally, is there a way to know exactly if message was committed or no? On Fri, Oct 25, 2013 at 10:43 AM, Kane Kane kane.ist...@gmail.com wrote: Hello Guozhang, My partitions are split almost evenly between broker, so, yes - broker that I shutdown is the leader for some of them. Does it mean i can get an exception and data is still being written? Is there any setting on the broker where i can control this? I.e. can i make broker replication timeout shorter than producer timeout, so i can ensure if i get an exception data is not being committed? Thanks. On Fri, Oct 25, 2013 at 10:36 AM, Guozhang Wang wangg...@gmail.com wrote: Hello Kane, As discussed in the other thread, even if a timeout response is sent back to the producer, the message may still be committed. Did you shut down the leader broker of the partition or a follower broker? Guozhang On Fri, Oct 25, 2013 at 8:45 AM, Kane Kane kane.ist...@gmail.com wrote: I have cluster of 3 kafka brokers. With the following script I send some data to kafka and in the middle do the controlled shutdown of 1 broker. All 3 brokers are ISR before I start sending. When i shutdown the broker i get a couple of exceptions and I expect data shouldn't be written. Say, I send 1500 lines and get 50 exceptions. I expect to consume 1450 lines, but instead i always consume more, i.e. 1480 or 1490. I want to decide if I want to retry sending myself, not using message.send.max.retries. But looks like if I retry sending if there is an exception - I will end up with duplicates. Is there anything I'm doing wrong or having wrong assumptions about kafka? Thanks. val prod = new MyProducer(10.80.42.147:9092,10.80.42.154:9092, 10.80.42.156:9092) var count = 0 for(line - Source.fromFile(file).getLines()){ try { prod.send(benchmark, buffer.toList) count += 1 println(sent %s, count) } catch { case _ = println(Exception!) } } class MyProducer(brokerList: String) { val sync = true val requestRequiredAcks = -1 val props = new Properties() props.put(metadata.broker.list, brokerList) props.put(producer.type, if(sync) sync else async) props.put(request.required.acks, requestRequiredAcks) props.put(key.serializer.class, classOf[StringEncoder].getName) props.put(serializer.class, classOf[StringEncoder].getName) props.put(message.send.max.retries, 0) props.put(request.timeout.ms, 2000) val producer = new Producer[AnyRef, AnyRef](new ProducerConfig(props)) def send(topic: String, messages: List[String]) = { val requests = new ArrayBuffer[KeyedMessage[AnyRef, AnyRef]] for (message - messages) { requests += new KeyedMessage(topic, null, message, message) } producer.send(requests) } } -- -- Guozhang -- -- Guozhang
Re: Flush configuration per topic
Is there only time delay or can it be set to flush for every message with the obvious performance hit? On Wed, Oct 16, 2013 at 9:49 AM, Jay Kreps jay.kr...@gmail.com wrote: Yes, the change in trunk is that all log configurations are automatically available at both the log level and the global default level and can be set at topic creation time or changed later without bouncing any servers. -Jay On Tue, Oct 15, 2013 at 5:47 PM, Simon Hørup Eskildsen si...@sirupsen.comwrote: Do you mean that it's possible to override log configurations per topic in trunk? Yeah, you're right. :-) I wasn't sure what to call it if not consistency, even though I know that sort of has another meaning in this context. On Tue, Oct 15, 2013 at 6:53 PM, Jay Kreps jay.kr...@gmail.com wrote: Yeah, looks like you are right, we don't have the per-topic override in 0.8 :-( All log configurations are overridable in trunk which will be 0.8.1. Just to be totally clear this setting does not impact consistency (i.e. all replicas will have the same messages in the same order), nor even durability (if you have replication 1), but just recoverability on a single server in the event of a hard machine crash. -Jay On Tue, Oct 15, 2013 at 2:07 PM, Simon Hørup Eskildsen si...@sirupsen.comwrote: 0.8, we're not on master, but we definitely can be. On Tue, Oct 15, 2013 at 5:03 PM, Jay Kreps jay.kr...@gmail.com wrote: Hey Simon, What version of Kafka are you using? -Jay On Tue, Oct 15, 2013 at 9:56 AM, Simon Hørup Eskildsen si...@sirupsen.comwrote: Hi Kafkas! Reading through the documentation and code of Kafka, it seems there is no feature to set flushing interval (messages/time) for a specific topic. I am interested in this to get consistency for certain topics by flushing after every message, while having eventual consistency for other topics (default). Would there be interest in such feature? Then I might be curious to give the Log module a dive and see if I can foster enough Scala to add this. Thanks! -- Simon http://sirupsen.com/ -- Simon http://sirupsen.com/ -- Simon http://sirupsen.com/
Re: Kafka and Zookeeper node removal from two nodes Kafka cluster
If you have a double broker failure with replication factor of 2 and only have 2 brokers in the cluster. Wouldn't every partition be not available? On Tue, Oct 15, 2013 at 8:48 AM, Jun Rao jun...@gmail.com wrote: If you have double broker failures with a replication factor of 2, some partitions will not be available. When one of the brokers comes back, the partition is made available again (there is potential data loss), but in an under replicated mode. After the second broker comes back, it will catch up from the other replica and the partition will eventually be fully replicated. There is no need to change the replication factor during this process. As for ZK, you can always use the full connection string. ZK will pick live servers to establish connections. Thanks, Jun On Tue, Oct 15, 2013 at 3:46 AM, Monika Garg gargmon...@gmail.com wrote: I have 2 nodes kafka cluster with default.replication.factor=2,is set in server.properties file. I removed one node-in removing that node,I killed Kafka process,removed all the kafka-logs and bundle from that node. Then I stopped my remaining running node in the cluster and started again(default.replication.factor is still set to 2 in this node server.properties file). I was expecting some error/exception as now I don't have two nodes in my cluster.But I didn't get any error/exception and my node successfully started and I am able to create topics on it. So should the default.replication.factor be updated from default.replication.factor=2 to default.replication.factor=1 , in the remaining running node? Similarly if there are two external zookeeper nodes(zookeeper.connect=host1:port1,host2:port1) in my cluster and now I have removed one zookeeper node(host1:port1) from the cluster,So should the property zookeeper.connect be updated from (zookeeper.connect=host1:port1,host2:port1) to (zookeeper.connect=host2:port1)? -- *Moniii*