[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896590#comment-15896590
 ] 

Nira Amit commented on SPARK-19656:
---

I will not, but please consider documenting the correct way to work with the 
newAPIHadoopFile in Java. It is not as easy as working with it in Scala and 
I've been googling this enough to know that it is not clear to many Java 
developers who try to use it.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896581#comment-15896581
 ] 

Nira Amit commented on SPARK-19656:
---

I found a problem in my schema and managed to load my custom type. So the 
answer to my original question is basically no, there is nothing like 
{code}
ctx.hadoopFile("/path/to/the/avro/file.avro",
  classOf[AvroInputFormat[MyClassInAvroFile]],
  classOf[AvroWrapper[MyClassInAvroFile]],
  classOf[NullWritable])
{code}
for loading custom types into RDDs with the Java API. We have to create all the 
wrapper classes and implement our own RecordReader.

I think this should be documented somewhere.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896343#comment-15896343
 ] 

Sean Owen commented on SPARK-19656:
---

It accepts it because you tell it that's what the InputFormat will return, but 
it doesn't. The Class arg is there just for its compile-time type. That doesn't 
make it so and it doesn't have a way of verifying it's what your InputFormat 
returns.

newAPIHadoopFile doesn't load as anything in particular; the InputFormat does. 
You are still really talking about Hadoop and Avro APIs.

I'm going to leave the conversation there and close this, as this is as much as 
is reasonable to consider in the context of Spark. This is not a bug as-is. You 
can take this info to explore how to work with Avro values elsewhere. A JIRA 
can be reopened if you have a clear and reproducible problem in what Spark is 
supposed to return or do and what it does. That does require understanding the 
operation of Hadoop APIs. Questions should stay on the mailing list or SO, if 
it's still in the realm of "how can I get this to work?"

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896314#comment-15896314
 ] 

Nira Amit commented on SPARK-19656:
---

And by the way, "what is in the file" is bytes. The question is what I load 
these bytes into. I'm trying to load them into a MyCustomClass, apparently what 
newAPIHadoopFile is loading them into is GenericData$Record. Even though the 
return type it promises is JavaPairRDD. 

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896286#comment-15896286
 ] 

Nira Amit commented on SPARK-19656:
---

But then why does the compiler accepts what newAPIHadoopFile returns as 
MyCustomClass? If what you are saying is correct, than the only return type 
that should be acceptable is a GenericData$Record or something that can be 
casted to it.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896282#comment-15896282
 ] 

Sean Owen commented on SPARK-19656:
---

Well, at the least, I'd suggest posting a more compilable example. But the last 
point is I think your problem: you are correctly getting a GenericData$Record 
because that is what is in the file. You need to call methods on that object to 
get your type object out. That's an Avro usage issue in your code. You need to 
investigate that before opening a JIRA.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896277#comment-15896277
 ] 

Nira Amit commented on SPARK-19656:
---

The only reason my code sample doesn't compile is because it doesn't include my 
actual custom class implementation. Otherwise it's a copy-paste of my valid 
code which compiles, runs, and then crushes due to a RuntimeException. And that 
is because the class it's getting in runtime isn't what the compiler gave it. I 
understand that the solution is to migrate my code to Datasets. But this seems 
like a problem in the newAPIHadoopFile API.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896272#comment-15896272
 ] 

Sean Owen commented on SPARK-19656:
---

PS I should be concrete about why I think the original code doesn't work -- it 
doesn't compile because you're using newAPIHadoopFile whereas the example you 
follow uses hadoopFile. If you adjusted that, then I think you're getting back 
an Avro GenericRecord as expected. Avro has its own records in a file, not your 
objects. You need to get() your type out of it?

But that's an issue in your code. I think the reason this went to DataFrame / 
Dataset is that there is first-class support for Avro there where your types 
get unpacked. That's the righter way to do this anyway, although, shouldn't be 
much reason you can't do this with RDDs if you must.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896268#comment-15896268
 ] 

Sean Owen commented on SPARK-19656:
---

Yes, I just tried to compile your code example above, and it doesn't work, but, 
for more basic reasons. That much is "Not A Problem" because you've got more 
basic usage errors. That is, this is not an example of code that should work 
but doesn't due to Avro issues. To the additional narrower question of  whether 
Datasets and casting works, it does, and I verified that it compiles.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896266#comment-15896266
 ] 

Nira Amit commented on SPARK-19656:
---

[~sowen] I have been trying this for weeks every way I could possibly think of. 
Have you (really) tried any of my code samples? With RDDs, not Datasets? If 
this is not possible with the newAPIHadoopFile then it's not an issue for 
mailing lists but rather be mentioned explicitly in the documentation because 
apparently I'm not the only one who expected this to work and couldn't figure 
out what I'm doing wrong.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896262#comment-15896262
 ] 

Nira Amit commented on SPARK-19656:
---

Thanks Eric, but my question is about RDDs. Is it correct that it is not 
possible to load custom classes directly to RDDs in Java? Only to Dataframes?

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896194#comment-15896194
 ] 

Eric Maynard commented on SPARK-19656:
--

Here is a complete working example in Java:

{code:title=AvroTest.java|borderStyle=solid}
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;

public class AvroTest {

public static void main(String[] args){

//build spark session:
System.setProperty("hadoop.home.dir", "C:\\Hadoop");//windows hack
SparkSession spark = 
SparkSession.builder().master("local").appName("Avro Test")
.config("spark.sql.warehouse.dir", 
"file:///c:/tmp/spark-warehouse")//another windows hack
.getOrCreate();

//create data:
ArrayList list = new ArrayList();
CustomClass cc = new CustomClass();
cc.setValue(5);
list.add(cc);
spark.createDataFrame(list, 
CustomClass.class).write().format("com.databricks.spark.avro").save("C:\\tmp\\file.avro");

//read data:
Row row = 
(spark.read().format("com.databricks.spark.avro").load("C:\\tmp\\file.avro").head());
System.out.println("Success =\t" + ((Integer)row.get(0) == 5));
}
}



{code}

With a simple custom class:
{code:title=CustomClass.java|borderStyle=solid}
import java.io.Serializable;

public class CustomClass implements Serializable {
public int value;
public void setValue(int value){this.value = value;}
public int getValue(){return this.value;}
}
{code}  
  
Everything looks ok to me, and the main function prints "Success = true". In 
the future please make sure that you don't have an issue in your application 
before opening a JIRA. Also, as an aside, I really recommend picking up some 
Scala as IMO the Scala API is much friendlier, esp. around the edges for things 
like the avro library.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896186#comment-15896186
 ] 

Sean Owen commented on SPARK-19656:
---

I guess I mean, have you really tried it? it doesn't result in a compile error, 
and you didn't say what the compile error is. This works, yes:

{code}
  public static class Foo {}

 ...
Dataset ds = spark.read;
ds.map((MapFunction) row -> (Foo) row.get(0), new MyFooEncoder());
 ...
{code}

Meaning, the cast in question works and you can map to a new Dataset if you 
have an encoder for your type.

The rest of the example you provide above doesn't work; it looks like a Hadoop 
API version problem. That's up to your code though. You're trying to use old 
Hadoop API Avro classes with newAPIHadoopFile.

This should be on the mailing list until it's narrowed down.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896179#comment-15896179
 ] 

Nira Amit commented on SPARK-19656:
---

Yes, I did, and answered him that it gives a compilation error in Java. Have 
you tried doing this in Java? If this is possible then please give a working 
code example, there should be no discussion if there is a correct way of doing 
this.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896174#comment-15896174
 ] 

Sean Owen commented on SPARK-19656:
---

Have you tried Eric's suggestion? asInstanceOf is just casting in Java. That is 
the kind of discussion to have on the mailing list and homework to do before 
opening a JIRA.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896172#comment-15896172
 ] 

Nira Amit commented on SPARK-19656:
---

But if this is not possible to do in Java then it IS an actionable change, 
isn't it? I already posted this question several weeks ago in StackOverflow and 
got many upvotes but no answer, which is why I posted it in the "Question" 
category of your Jira. Is it possible in Java or isn't it? From Eric's answer 
it sounds like it should be, yet nobody seems to know how to do it.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896169#comment-15896169
 ] 

Sean Owen commented on SPARK-19656:
---

Mostly, it is that questions should go to the mailing list. I don't keep track 
of your JIRAs. This should be reserved for actionable changes not questions.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-05 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896166#comment-15896166
 ] 

Nira Amit commented on SPARK-19656:
---

[~emaynard] There is no "asInstanceOf" method in the Java API. And if I try to 
cast it directly I get a compilation error.
[~sowen] Are you not handling tickets about the Java API? It's the second time 
you close a ticket I open about loading custom objects from Avro in Java and 
mark it as "Not a problem". Either this is not possible in Java, in which case 
it's at least a missing feature (and misleading, because it looks like it 
should be possible), or I'm not doing it right and in this case you can provide 
a working code example in Java.

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-03-04 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895998#comment-15895998
 ] 

Eric Maynard commented on SPARK-19656:
--

Normally after getting the `datum` you should call `asInstanceOf` to cast it 
properly.
  
In any event in Spark 2.0 the easier way to achieve what you want is probably 
something like this:

{code:scala}
import com.databricks.spark.avro._
val df = spark.read.avro("file.avro")
val extracted = df.map(row => (row(0).asInstanceOf[MyCustomClass]))
{code}

> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19656) Can't load custom type from avro file to RDD with newAPIHadoopFile

2017-02-18 Thread Nira Amit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873090#comment-15873090
 ] 

Nira Amit commented on SPARK-19656:
---

I also tried to do this without writing my own `AvroKey` and 
`AvroKeyInputFormat`:

{code}
JavaPairRDD, NullWritable> records =
sc.newAPIHadoopFile("file:/path/to/file.avro",
new AvroKeyInputFormat().getClass(), new 
AvroKey().getClass(), NullWritable.class,
sc.hadoopConfiguration());
{code}

Which I think should have worked but instead results in a compilation error:
{code}
Error:(263, 36) java: incompatible types: inferred type does not conform to 
equality constraint(s)
inferred: 
org.apache.avro.mapred.AvroKey
equality constraints(s): 
org.apache.avro.mapred.AvroKey,capture#1 
of ? extends org.apache.avro.mapred.AvroKey
{code}


> Can't load custom type from avro file to RDD with newAPIHadoopFile
> --
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from 
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
>   classOf[AvroInputFormat[MyClassInAvroFile]],
>   classOf[AvroWrapper[MyClassInAvroFile]],
>   classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I 
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey{};
> public static class MyCustomAvroReader extends 
> AvroRecordReaderBase {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends 
> FileInputFormat{
> @Override
> public RecordReader 
> createRecordReader(InputSplit inputSplit, TaskAttemptContext 
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " + 
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()` 
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass` 
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record 
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org