[ 
https://issues.apache.org/jira/browse/CRUNCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274583#comment-15274583
 ] 

Micah Whitacre commented on CRUNCH-606:
---------------------------------------

[~joshwills] what's the best way to handle converting data coming out of Kafka?

Currently I'm having the consumers pass in the new Kafka Deserializer class 
instances for Key and Value so that when I read data from Kafka I don't get 
byte[] but instead get values like String, Avro SpecificRecord, etc.  I also am 
having them provide PTableTypes<K, V> where the K and V values correspond to 
whatever is coming out of Kafka. 

Using the existing PTypes, I'm hitting errors like:

{noformat}
java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.avro.mapred.AvroWrapper
        at 
org.apache.crunch.types.avro.AvroKeyConverter.convertInput(AvroKeyConverter.java:25)
        at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:109)
        at org.apache.crunch.impl.mr.run.CrunchMapper.map(CrunchMapper.java:55)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
{noformat}

Is there a workaround I'm missing or should I create something similar to 
Orcs/HBaseTypes where I have them provide instances of Kafka 
Serializer/Deserializer for the payloads?  Downside is requiring them to have a 
Serializer when they wouldn't normally but not a deal breaker b/c I'm guessing 
most have that already written somewhere on the produce side of things.

> Create a KafkaSource
> --------------------
>
>                 Key: CRUNCH-606
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-606
>             Project: Crunch
>          Issue Type: New Feature
>          Components: IO
>            Reporter: Micah Whitacre
>            Assignee: Micah Whitacre
>         Attachments: CRUNCH-606.patch
>
>
> Pulling data out of Kafka is a common use case and some of the ways to do it 
> Kafka Connect, Camus, Gobblin do not integrate nicely with existing 
> processing pipelines like Crunch.  With Kafka 0.9, the consuming API is a lot 
> easier so we should build a Source implementation that can read from Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to