[
https://issues.apache.org/jira/browse/CRUNCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274583#comment-15274583
]
Micah Whitacre commented on CRUNCH-606:
---------------------------------------
[~joshwills] what's the best way to handle converting data coming out of Kafka?
Currently I'm having the consumers pass in the new Kafka Deserializer class
instances for Key and Value so that when I read data from Kafka I don't get
byte[] but instead get values like String, Avro SpecificRecord, etc. I also am
having them provide PTableTypes<K, V> where the K and V values correspond to
whatever is coming out of Kafka.
Using the existing PTypes, I'm hitting errors like:
{noformat}
java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.avro.mapred.AvroWrapper
at
org.apache.crunch.types.avro.AvroKeyConverter.convertInput(AvroKeyConverter.java:25)
at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:109)
at org.apache.crunch.impl.mr.run.CrunchMapper.map(CrunchMapper.java:55)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
{noformat}
Is there a workaround I'm missing or should I create something similar to
Orcs/HBaseTypes where I have them provide instances of Kafka
Serializer/Deserializer for the payloads? Downside is requiring them to have a
Serializer when they wouldn't normally but not a deal breaker b/c I'm guessing
most have that already written somewhere on the produce side of things.
> Create a KafkaSource
> --------------------
>
> Key: CRUNCH-606
> URL: https://issues.apache.org/jira/browse/CRUNCH-606
> Project: Crunch
> Issue Type: New Feature
> Components: IO
> Reporter: Micah Whitacre
> Assignee: Micah Whitacre
> Attachments: CRUNCH-606.patch
>
>
> Pulling data out of Kafka is a common use case and some of the ways to do it
> Kafka Connect, Camus, Gobblin do not integrate nicely with existing
> processing pipelines like Crunch. With Kafka 0.9, the consuming API is a lot
> easier so we should build a Source implementation that can read from Kafka.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)