[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

Sean Owen (JIRA) Sun, 05 Mar 2017 03:25:24 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896186#comment-15896186
 ]


Sean Owen commented on SPARK-19656:
-----------------------------------

I guess I mean, have you really tried it? it doesn't result in a compile error, 
and you didn't say what the compile error is. This works, yes:

{code}
  public static class Foo {}

 ...
    Dataset<Row> ds = spark.read....;
    ds.map((MapFunction<Row,Foo>) row -> (Foo) row.get(0), new MyFooEncoder());
 ...
{code}

Meaning, the cast in question works and you can map to a new Dataset if you 
have an encoder for your type.

The rest of the example you provide above doesn't work; it looks like a Hadoop 
API version problem. That's up to your code though. You're trying to use old 
Hadoop API Avro classes with newAPIHadoopFile.

This should be on the mailing list until it's narrowed down.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> ------------------------------------------------------------------
>
>                 Key: SPARK-19656
>                 URL: https://issues.apache.org/jira/browse/SPARK-19656
>             Project: Spark
>          Issue Type: Question
>          Components: Java API
>    Affects Versions: 2.0.2
>            Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey<MyCustomClass>{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase<MyCustomAvroKey, NullWritable, MyCustomClass> {
> // with my custom schema and all the required methods...
>     }
> public static class MyCustomInputFormat extends 
> FileInputFormat<MyCustomAvroKey, NullWritable>{
>         @Override
>         public RecordReader<MyCustomAvroKey, NullWritable> 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
>             return new MyCustomAvroReader();
>         }
>     }
> ...
> JavaPairRDD<MyCustomAvroKey, NullWritable> records =
>                 sc.newAPIHadoopFile("file:/path/to/datafile.avro",
>                         MyCustomInputFormat.class, MyCustomAvroKey.class,
>                         NullWritable.class,
>                         sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

Reply via email to