Re: Using Hadoop InputFormat in Python

2014-08-14 Thread Tassilo Klein
Thanks. This was already helping a bit. But the examples don't use custom
InputFormats. Rather, org.apache fully qualified InputFormat. If I want to
use my own custom InputFormat in form of .class (or jar) how can I use it? I
tried providing it to pyspark with --jars myCustomInputFormat.jar

and then using sc.newAPIHadoopFile(path,
myCustomFullyQualifiedPackageName.ClassName, .)

However, that didn't work as it couldn't find the class.

Any other idea?

Thanks so far,
 -Tassilo 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12092.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using Hadoop InputFormat in Python

2014-08-14 Thread Kan Zhang
Good timing! I encountered that same issue recently and to address it, I
changed the default Class.forName call to Utils.classForName. See my patch
at https://github.com/apache/spark/pull/1916. After that change, my
bin/pyspark --jars worked.


On Wed, Aug 13, 2014 at 11:47 PM, Tassilo Klein tjkl...@gmail.com wrote:

 Thanks. This was already helping a bit. But the examples don't use custom
 InputFormats. Rather, org.apache fully qualified InputFormat. If I want to
 use my own custom InputFormat in form of .class (or jar) how can I use it?
 I
 tried providing it to pyspark with --jars myCustomInputFormat.jar

 and then using sc.newAPIHadoopFile(path,
 myCustomFullyQualifiedPackageName.ClassName, .)

 However, that didn't work as it couldn't find the class.

 Any other idea?

 Thanks so far,
  -Tassilo



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12092.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Using Hadoop InputFormat in Python

2014-08-14 Thread TJ Klein
Yes, thanks great. This seems to be the issue. 
At least running with spark-submit works as well.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12126.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Using Hadoop InputFormat in Python

2014-08-13 Thread Tassilo Klein
Hi,

I'd like to read in a (binary) file from Python for which I have defined a
Java InputFormat (.java) definition. However, now I am stuck in how to use
that in Python and didn't find anything in newsgroups either.
As far as I know, I have to use this newAPIHadoopRDD function. However, I am
not sure how to use that in combination with my custom InputFormat.
Does anybody have a short snipped of code how to do it?
Thanks in advance.
Best,
 Tassilo



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using Hadoop InputFormat in Python

2014-08-13 Thread Sunny Khatri
Not that much familiar with Python APIs, but You should be able to
configure a job object with your custom InputFormat and pass in the
required configuration (:- job.getConfiguration()) to newAPIHadoopRDD to
get the required RDD


On Wed, Aug 13, 2014 at 2:59 PM, Tassilo Klein tjkl...@gmail.com wrote:

 Hi,

 I'd like to read in a (binary) file from Python for which I have defined a
 Java InputFormat (.java) definition. However, now I am stuck in how to use
 that in Python and didn't find anything in newsgroups either.
 As far as I know, I have to use this newAPIHadoopRDD function. However, I
 am
 not sure how to use that in combination with my custom InputFormat.
 Does anybody have a short snipped of code how to do it?
 Thanks in advance.
 Best,
  Tassilo



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Using Hadoop InputFormat in Python

2014-08-13 Thread Kan Zhang
Tassilo, newAPIHadoopRDD has been added to PySpark in master and
yet-to-be-released 1.1 branch. It allows you specify your custom
InputFormat. Examples of using it include hbase_inputformat.py and
cassandra_inputformat.py in examples/src/main/python. Check it out.


On Wed, Aug 13, 2014 at 3:12 PM, Sunny Khatri sunny.k...@gmail.com wrote:

 Not that much familiar with Python APIs, but You should be able to
 configure a job object with your custom InputFormat and pass in the
 required configuration (:- job.getConfiguration()) to newAPIHadoopRDD to
 get the required RDD


 On Wed, Aug 13, 2014 at 2:59 PM, Tassilo Klein tjkl...@gmail.com wrote:

 Hi,

 I'd like to read in a (binary) file from Python for which I have defined a
 Java InputFormat (.java) definition. However, now I am stuck in how to use
 that in Python and didn't find anything in newsgroups either.
 As far as I know, I have to use this newAPIHadoopRDD function. However, I
 am
 not sure how to use that in combination with my custom InputFormat.
 Does anybody have a short snipped of code how to do it?
 Thanks in advance.
 Best,
  Tassilo



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Using Hadoop InputFormat in Python

2014-08-13 Thread Tassilo Klein
Yes, somehow seems logical. But where / how to pass -the InputFormat
definition (.jar/.java/.class) Spark. 
I mean when using Hadoop  I need to call something like  'hadoop jar
myInputFormat.jar -inFormat myFormat other stuff' to register the file
format definition file.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12069.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org