Re: Using Hadoop InputFormat in Python
Yes, thanks great. This seems to be the issue. At least running with spark-submit works as well. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12126.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using Hadoop InputFormat in Python
Good timing! I encountered that same issue recently and to address it, I changed the default Class.forName call to Utils.classForName. See my patch at https://github.com/apache/spark/pull/1916. After that change, my bin/pyspark --jars worked. On Wed, Aug 13, 2014 at 11:47 PM, Tassilo Klein wrote: > Thanks. This was already helping a bit. But the examples don't use custom > InputFormats. Rather, org.apache fully qualified InputFormat. If I want to > use my own custom InputFormat in form of .class (or jar) how can I use it? > I > tried providing it to pyspark with --jars > > and then using sc.newAPIHadoopFile(path, > , .) > > However, that didn't work as it couldn't find the class. > > Any other idea? > > Thanks so far, > -Tassilo > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12092.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Using Hadoop InputFormat in Python
Thanks. This was already helping a bit. But the examples don't use custom InputFormats. Rather, org.apache fully qualified InputFormat. If I want to use my own custom InputFormat in form of .class (or jar) how can I use it? I tried providing it to pyspark with --jars and then using sc.newAPIHadoopFile(path, , .) However, that didn't work as it couldn't find the class. Any other idea? Thanks so far, -Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12092.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using Hadoop InputFormat in Python
Tassilo, newAPIHadoopRDD has been added to PySpark in master and yet-to-be-released 1.1 branch. It allows you specify your custom InputFormat. Examples of using it include hbase_inputformat.py and cassandra_inputformat.py in examples/src/main/python. Check it out. On Wed, Aug 13, 2014 at 3:12 PM, Sunny Khatri wrote: > Not that much familiar with Python APIs, but You should be able to > configure a job object with your custom InputFormat and pass in the > required configuration (:- job.getConfiguration()) to newAPIHadoopRDD to > get the required RDD > > > On Wed, Aug 13, 2014 at 2:59 PM, Tassilo Klein wrote: > >> Hi, >> >> I'd like to read in a (binary) file from Python for which I have defined a >> Java InputFormat (.java) definition. However, now I am stuck in how to use >> that in Python and didn't find anything in newsgroups either. >> As far as I know, I have to use this newAPIHadoopRDD function. However, I >> am >> not sure how to use that in combination with my custom InputFormat. >> Does anybody have a short snipped of code how to do it? >> Thanks in advance. >> Best, >> Tassilo >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Using Hadoop InputFormat in Python
Yes, somehow seems logical. But where / how to pass -the InputFormat definition (.jar/.java/.class) Spark. I mean when using Hadoop I need to call something like 'hadoop jar -inFormat other stuff' to register the file format definition file. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12069.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using Hadoop InputFormat in Python
Not that much familiar with Python APIs, but You should be able to configure a job object with your custom InputFormat and pass in the required configuration (:- job.getConfiguration()) to newAPIHadoopRDD to get the required RDD On Wed, Aug 13, 2014 at 2:59 PM, Tassilo Klein wrote: > Hi, > > I'd like to read in a (binary) file from Python for which I have defined a > Java InputFormat (.java) definition. However, now I am stuck in how to use > that in Python and didn't find anything in newsgroups either. > As far as I know, I have to use this newAPIHadoopRDD function. However, I > am > not sure how to use that in combination with my custom InputFormat. > Does anybody have a short snipped of code how to do it? > Thanks in advance. > Best, > Tassilo > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Using Hadoop InputFormat in Python
Hi, I'd like to read in a (binary) file from Python for which I have defined a Java InputFormat (.java) definition. However, now I am stuck in how to use that in Python and didn't find anything in newsgroups either. As far as I know, I have to use this newAPIHadoopRDD function. However, I am not sure how to use that in combination with my custom InputFormat. Does anybody have a short snipped of code how to do it? Thanks in advance. Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org