Re: Spark 1.1.0 hbase_inputformat.py not work

2014-10-01 Thread Kan Zhang
? Thanks 2014-10-01 1:37 GMT-03:00 Kan Zhang kzh...@apache.org: I somehow missed this. Do you still have problem? You probably didn't specify the correct spark-examples jar using --driver-class-path. See the following for an example. MASTER=local ./bin/spark-submit --driver-class-path ./examples

Re: pyspark cassandra examples

2014-09-30 Thread Kan Zhang
java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected Most likely it is the Hadoop 1 vs Hadoop 2 issue. The example was given for Hadoop 1 (default Hadoop version for Spark). You may try to set the output format class in conf for

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-09-30 Thread Kan Zhang
I somehow missed this. Do you still have problem? You probably didn't specify the correct spark-examples jar using --driver-class-path. See the following for an example. MASTER=local ./bin/spark-submit --driver-class-path ./examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar

Re: pyspark and cassandra

2014-09-10 Thread Kan Zhang
there are couple of strategies for storage level, in case my data set is quite big and I have no enough memory to process - can I use DISK_ONLY option without hadoop (having only cassandra)? Thanks Oleg On Wed, Sep 3, 2014 at 3:08 AM, Kan Zhang kzh...@apache.org wrote: In Spark 1.1

Re: pyspark and cassandra

2014-09-02 Thread Kan Zhang
In Spark 1.1, it is possible to read from Cassandra using Hadoop jobs. See examples/src/main/python/cassandra_inputformat.py for an example. You may need to write your own key/value converters. On Tue, Sep 2, 2014 at 11:10 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi All , Is it

Re: Using Hadoop InputFormat in Python

2014-08-14 Thread Kan Zhang
Good timing! I encountered that same issue recently and to address it, I changed the default Class.forName call to Utils.classForName. See my patch at https://github.com/apache/spark/pull/1916. After that change, my bin/pyspark --jars worked. On Wed, Aug 13, 2014 at 11:47 PM, Tassilo Klein

Re: Using Hadoop InputFormat in Python

2014-08-13 Thread Kan Zhang
Tassilo, newAPIHadoopRDD has been added to PySpark in master and yet-to-be-released 1.1 branch. It allows you specify your custom InputFormat. Examples of using it include hbase_inputformat.py and cassandra_inputformat.py in examples/src/main/python. Check it out. On Wed, Aug 13, 2014 at 3:12

Re: hdfs replication on saving RDD

2014-07-15 Thread Kan Zhang
Andrew, there are overloaded versions of saveAsHadoopFile or saveAsNewAPIHadoopFile that allow you to pass in a per-job Hadoop conf. saveAsTextFile is just a convenience wrapper on top of saveAsHadoopFile. On Mon, Jul 14, 2014 at 11:22 PM, Andrew Ash and...@andrewash.com wrote: In general it

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread Kan Zhang
Yes, it can if you set the output format to SequenceFileOutputFormat. The difference is saveAsSequenceFile does the conversion to Writable for you if needed and then calls saveAsHadoopFile. On Fri, Jun 20, 2014 at 12:43 AM, abhiguruvayya sharath.abhis...@gmail.com wrote: Does

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-19 Thread Kan Zhang
Can you use saveAsObjectFile? On Thu, Jun 19, 2014 at 5:54 PM, abhiguruvayya sharath.abhis...@gmail.com wrote: I want to store JavaRDD as a sequence file instead of textfile. But i don't see any Java API for that. Is there a way for this? Please let me know. Thanks! -- View this message