Re: Spark 1.1.0 hbase_inputformat.py not work

2014-10-01 Thread Kan Zhang
Thanks > > 2014-10-01 1:37 GMT-03:00 Kan Zhang : > >> I somehow missed this. Do you still have problem? You probably didn't >> specify the correct spark-examples jar using --driver-class-path. See >> the following for an example. >> >> MASTER=local ./b

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-09-30 Thread Kan Zhang
I somehow missed this. Do you still have problem? You probably didn't specify the correct spark-examples jar using --driver-class-path. See the following for an example. MASTER=local ./bin/spark-submit --driver-class-path ./examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar

Re: pyspark cassandra examples

2014-09-30 Thread Kan Zhang
> java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected Most likely it is the Hadoop 1 vs Hadoop 2 issue. The example was given for Hadoop 1 (default Hadoop version for Spark). You may try to set the output format class in conf for

Re: pyspark and cassandra

2014-09-10 Thread Kan Zhang
possible to use only >> cassandra - input/output without hadoop? >> 3) I know there are couple of strategies for storage level, in case my >> data set is quite big and I have no enough memory to process - can I use >> DISK_ONLY option without hadoop (having only cassandra)

Re: pyspark and cassandra

2014-09-02 Thread Kan Zhang
In Spark 1.1, it is possible to read from Cassandra using Hadoop jobs. See examples/src/main/python/cassandra_inputformat.py for an example. You may need to write your own key/value converters. On Tue, Sep 2, 2014 at 11:10 AM, Oleg Ruchovets wrote: > Hi All , >Is it possible to have cassand

Re: Using Hadoop InputFormat in Python

2014-08-14 Thread Kan Zhang
Good timing! I encountered that same issue recently and to address it, I changed the default Class.forName call to Utils.classForName. See my patch at https://github.com/apache/spark/pull/1916. After that change, my bin/pyspark --jars worked. On Wed, Aug 13, 2014 at 11:47 PM, Tassilo Klein wrote

Re: Using Hadoop InputFormat in Python

2014-08-13 Thread Kan Zhang
Tassilo, newAPIHadoopRDD has been added to PySpark in master and yet-to-be-released 1.1 branch. It allows you specify your custom InputFormat. Examples of using it include hbase_inputformat.py and cassandra_inputformat.py in examples/src/main/python. Check it out. On Wed, Aug 13, 2014 at 3:12 PM,

Re: hdfs replication on saving RDD

2014-07-15 Thread Kan Zhang
Andrew, there are overloaded versions of saveAsHadoopFile or saveAsNewAPIHadoopFile that allow you to pass in a per-job Hadoop conf. saveAsTextFile is just a convenience wrapper on top of saveAsHadoopFile. On Mon, Jul 14, 2014 at 11:22 PM, Andrew Ash wrote: > In general it would be nice to be a

Re: zip in pyspark truncates RDD to number of processors

2014-06-21 Thread Kan Zhang
I couldn't reproduce your issue locally, but I suspect it has something to do with partitioning. zip() does it by partition and it assumes the two RDDs have the same number of partitions and the same number of elements in each partition. By default, map() doesn't preserve partitioning. Try set pres

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread Kan Zhang
Yes, it can if you set the output format to SequenceFileOutputFormat. The difference is saveAsSequenceFile does the conversion to Writable for you if needed and then calls saveAsHadoopFile. On Fri, Jun 20, 2014 at 12:43 AM, abhiguruvayya wrote: > Does JavaPairRDD.saveAsHadoopFile store data as

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-19 Thread Kan Zhang
Can you use saveAsObjectFile? On Thu, Jun 19, 2014 at 5:54 PM, abhiguruvayya wrote: > I want to store JavaRDD as a sequence file instead of textfile. But i don't > see any Java API for that. Is there a way for this? Please let me know. > Thanks! > > > > -- > View this message in context: > http