[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896590#comment-15896590 ] Nira Amit commented on SPARK-19656: --- I will not, but please consider documenting the correct way to work with the newAPIHadoopFile in Java. It is not as easy as working with it in Scala and I've been googling this enough to know that it is not clear to many Java developers who try to use it. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896581#comment-15896581 ] Nira Amit commented on SPARK-19656: --- I found a problem in my schema and managed to load my custom type. So the answer to my original question is basically no, there is nothing like {code} ctx.hadoopFile("/path/to/the/avro/file.avro", classOf[AvroInputFormat[MyClassInAvroFile]], classOf[AvroWrapper[MyClassInAvroFile]], classOf[NullWritable]) {code} for loading custom types into RDDs with the Java API. We have to create all the wrapper classes and implement our own RecordReader. I think this should be documented somewhere. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896343#comment-15896343 ] Sean Owen commented on SPARK-19656: --- It accepts it because you tell it that's what the InputFormat will return, but it doesn't. The Class arg is there just for its compile-time type. That doesn't make it so and it doesn't have a way of verifying it's what your InputFormat returns. newAPIHadoopFile doesn't load as anything in particular; the InputFormat does. You are still really talking about Hadoop and Avro APIs. I'm going to leave the conversation there and close this, as this is as much as is reasonable to consider in the context of Spark. This is not a bug as-is. You can take this info to explore how to work with Avro values elsewhere. A JIRA can be reopened if you have a clear and reproducible problem in what Spark is supposed to return or do and what it does. That does require understanding the operation of Hadoop APIs. Questions should stay on the mailing list or SO, if it's still in the realm of "how can I get this to work?" > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896314#comment-15896314 ] Nira Amit commented on SPARK-19656: --- And by the way, "what is in the file" is bytes. The question is what I load these bytes into. I'm trying to load them into a MyCustomClass, apparently what newAPIHadoopFile is loading them into is GenericData$Record. Even though the return type it promises is JavaPairRDD. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896286#comment-15896286 ] Nira Amit commented on SPARK-19656: --- But then why does the compiler accepts what newAPIHadoopFile returns as MyCustomClass? If what you are saying is correct, than the only return type that should be acceptable is a GenericData$Record or something that can be casted to it. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896282#comment-15896282 ] Sean Owen commented on SPARK-19656: --- Well, at the least, I'd suggest posting a more compilable example. But the last point is I think your problem: you are correctly getting a GenericData$Record because that is what is in the file. You need to call methods on that object to get your type object out. That's an Avro usage issue in your code. You need to investigate that before opening a JIRA. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896277#comment-15896277 ] Nira Amit commented on SPARK-19656: --- The only reason my code sample doesn't compile is because it doesn't include my actual custom class implementation. Otherwise it's a copy-paste of my valid code which compiles, runs, and then crushes due to a RuntimeException. And that is because the class it's getting in runtime isn't what the compiler gave it. I understand that the solution is to migrate my code to Datasets. But this seems like a problem in the newAPIHadoopFile API. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896272#comment-15896272 ] Sean Owen commented on SPARK-19656: --- PS I should be concrete about why I think the original code doesn't work -- it doesn't compile because you're using newAPIHadoopFile whereas the example you follow uses hadoopFile. If you adjusted that, then I think you're getting back an Avro GenericRecord as expected. Avro has its own records in a file, not your objects. You need to get() your type out of it? But that's an issue in your code. I think the reason this went to DataFrame / Dataset is that there is first-class support for Avro there where your types get unpacked. That's the righter way to do this anyway, although, shouldn't be much reason you can't do this with RDDs if you must. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896268#comment-15896268 ] Sean Owen commented on SPARK-19656: --- Yes, I just tried to compile your code example above, and it doesn't work, but, for more basic reasons. That much is "Not A Problem" because you've got more basic usage errors. That is, this is not an example of code that should work but doesn't due to Avro issues. To the additional narrower question of whether Datasets and casting works, it does, and I verified that it compiles. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896266#comment-15896266 ] Nira Amit commented on SPARK-19656: --- [~sowen] I have been trying this for weeks every way I could possibly think of. Have you (really) tried any of my code samples? With RDDs, not Datasets? If this is not possible with the newAPIHadoopFile then it's not an issue for mailing lists but rather be mentioned explicitly in the documentation because apparently I'm not the only one who expected this to work and couldn't figure out what I'm doing wrong. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896262#comment-15896262 ] Nira Amit commented on SPARK-19656: --- Thanks Eric, but my question is about RDDs. Is it correct that it is not possible to load custom classes directly to RDDs in Java? Only to Dataframes? > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896194#comment-15896194 ] Eric Maynard commented on SPARK-19656: -- Here is a complete working example in Java: {code:title=AvroTest.java|borderStyle=solid} import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import java.util.ArrayList; public class AvroTest { public static void main(String[] args){ //build spark session: System.setProperty("hadoop.home.dir", "C:\\Hadoop");//windows hack SparkSession spark = SparkSession.builder().master("local").appName("Avro Test") .config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")//another windows hack .getOrCreate(); //create data: ArrayList list = new ArrayList(); CustomClass cc = new CustomClass(); cc.setValue(5); list.add(cc); spark.createDataFrame(list, CustomClass.class).write().format("com.databricks.spark.avro").save("C:\\tmp\\file.avro"); //read data: Row row = (spark.read().format("com.databricks.spark.avro").load("C:\\tmp\\file.avro").head()); System.out.println("Success =\t" + ((Integer)row.get(0) == 5)); } } {code} With a simple custom class: {code:title=CustomClass.java|borderStyle=solid} import java.io.Serializable; public class CustomClass implements Serializable { public int value; public void setValue(int value){this.value = value;} public int getValue(){return this.value;} } {code} Everything looks ok to me, and the main function prints "Success = true". In the future please make sure that you don't have an issue in your application before opening a JIRA. Also, as an aside, I really recommend picking up some Scala as IMO the Scala API is much friendlier, esp. around the edges for things like the avro library. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896186#comment-15896186 ] Sean Owen commented on SPARK-19656: --- I guess I mean, have you really tried it? it doesn't result in a compile error, and you didn't say what the compile error is. This works, yes: {code} public static class Foo {} ... Dataset ds = spark.read; ds.map((MapFunction) row -> (Foo) row.get(0), new MyFooEncoder()); ... {code} Meaning, the cast in question works and you can map to a new Dataset if you have an encoder for your type. The rest of the example you provide above doesn't work; it looks like a Hadoop API version problem. That's up to your code though. You're trying to use old Hadoop API Avro classes with newAPIHadoopFile. This should be on the mailing list until it's narrowed down. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896179#comment-15896179 ] Nira Amit commented on SPARK-19656: --- Yes, I did, and answered him that it gives a compilation error in Java. Have you tried doing this in Java? If this is possible then please give a working code example, there should be no discussion if there is a correct way of doing this. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896174#comment-15896174 ] Sean Owen commented on SPARK-19656: --- Have you tried Eric's suggestion? asInstanceOf is just casting in Java. That is the kind of discussion to have on the mailing list and homework to do before opening a JIRA. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896172#comment-15896172 ] Nira Amit commented on SPARK-19656: --- But if this is not possible to do in Java then it IS an actionable change, isn't it? I already posted this question several weeks ago in StackOverflow and got many upvotes but no answer, which is why I posted it in the "Question" category of your Jira. Is it possible in Java or isn't it? From Eric's answer it sounds like it should be, yet nobody seems to know how to do it. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896169#comment-15896169 ] Sean Owen commented on SPARK-19656: --- Mostly, it is that questions should go to the mailing list. I don't keep track of your JIRAs. This should be reserved for actionable changes not questions. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896166#comment-15896166 ] Nira Amit commented on SPARK-19656: --- [~emaynard] There is no "asInstanceOf" method in the Java API. And if I try to cast it directly I get a compilation error. [~sowen] Are you not handling tickets about the Java API? It's the second time you close a ticket I open about loading custom objects from Avro in Java and mark it as "Not a problem". Either this is not possible in Java, in which case it's at least a missing feature (and misleading, because it looks like it should be possible), or I'm not doing it right and in this case you can provide a working code example in Java. > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895998#comment-15895998 ] Eric Maynard commented on SPARK-19656: -- Normally after getting the `datum` you should call `asInstanceOf` to cast it properly. In any event in Spark 2.0 the easier way to achieve what you want is probably something like this: {code:scala} import com.databricks.spark.avro._ val df = spark.read.avro("file.avro") val extracted = df.map(row => (row(0).asInstanceOf[MyCustomClass])) {code} > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873090#comment-15873090 ] Nira Amit commented on SPARK-19656: --- I also tried to do this without writing my own `AvroKey` and `AvroKeyInputFormat`: {code} JavaPairRDD, NullWritable> records = sc.newAPIHadoopFile("file:/path/to/file.avro", new AvroKeyInputFormat().getClass(), new AvroKey().getClass(), NullWritable.class, sc.hadoopConfiguration()); {code} Which I think should have worked but instead results in a compilation error: {code} Error:(263, 36) java: incompatible types: inferred type does not conform to equality constraint(s) inferred: org.apache.avro.mapred.AvroKey equality constraints(s): org.apache.avro.mapred.AvroKey,capture#1 of ? extends org.apache.avro.mapred.AvroKey {code} > Can't load custom type from avro file to RDD with newAPIHadoopFile > -- > > Key: SPARK-19656 > URL: https://issues.apache.org/jira/browse/SPARK-19656 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.2 >Reporter: Nira Amit > > If I understand correctly, in scala it's possible to load custom objects from > avro files to RDDs this way: > {code} > ctx.hadoopFile("/path/to/the/avro/file.avro", > classOf[AvroInputFormat[MyClassInAvroFile]], > classOf[AvroWrapper[MyClassInAvroFile]], > classOf[NullWritable]) > {code} > I'm not a scala developer, so I tried to "translate" this to java as best I > could. I created classes that extend AvroKey and FileInputFormat: > {code} > public static class MyCustomAvroKey extends AvroKey{}; > public static class MyCustomAvroReader extends > AvroRecordReaderBase { > // with my custom schema and all the required methods... > } > public static class MyCustomInputFormat extends > FileInputFormat{ > @Override > public RecordReader > createRecordReader(InputSplit inputSplit, TaskAttemptContext > taskAttemptContext) throws IOException, InterruptedException { > return new MyCustomAvroReader(); > } > } > ... > JavaPairRDD records = > sc.newAPIHadoopFile("file:/path/to/datafile.avro", > MyCustomInputFormat.class, MyCustomAvroKey.class, > NullWritable.class, > sc.hadoopConfiguration()); > MyCustomClass first = records.first()._1.datum(); > System.out.println("Got a result, some custom field: " + > first.getSomeCustomField()); > {code} > This compiles fine, but using a debugger I can see that `first._1.datum()` > actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` > instance. > And indeed, when the following line executes: > {code} > MyCustomClass first = records.first()._1.datum(); > {code} > I get an exception: > {code} > java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record > cannot be cast to my.package.containing.MyCustomClass > {code} > Am I doing it wrong? Or is this not possible in Java? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org