Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-15 Thread Wush Wu
are super slow in spark. 100x slower than hadoop Sent from my iPhone On 14-Jul-2015, at 10:59 PM, Wush Wu wush...@gmail.com wrote: I don't understand. By the way, the `joinWithCassandraTable` does improve my query time from 40 mins to 3 mins. 2015-07-15 13:19 GMT+08:00 ÐΞ€ρ@Ҝ

Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
Dear all, I am trying to join two RDDs, named rdd1 and rdd2. rdd1 is loaded from a textfile with about 33000 records. rdd2 is loaded from a table in cassandra which has about 3 billions records. I tried the following code: ```scala val rdd1 : (String, XXX) = sc.textFile(...).map(...) import

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
/datastax/spark-cassandra-connector/blob/v1.3.0-M2/doc/2_loading.md Wush 2015-07-15 12:15 GMT+08:00 Wush Wu wush...@gmail.com: Dear all, I am trying to join two RDDs, named rdd1 and rdd2. rdd1 is loaded from a textfile with about 33000 records. rdd2 is loaded from a table in cassandra which has

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
, 2015 at 9:35 PM, Wush Wu wush...@gmail.com wrote: Dear all, I have found a post discussing the same thing: https://groups.google.com/a/lists.datastax.com/forum/#!searchin/spark-connector-user/join/spark-connector-user/q3GotS-n0Wk/g-LPTteCEg0J The solution is using joinWithCassandraTable

Difference behaviour of DateType in SparkSQL between 1.2 and 1.3

2015-03-26 Thread Wush Wu
Dear all, I am trying to upgrade the spark from 1.2 to 1.3 and switch the existed API of creating SchemaRDD to DataFrame. After testing, I notice that the following behavior is changed: ``` import java.sql.Date import com.bridgewell.SparkTestUtils import org.apache.spark.rdd.RDD import

Construct model matrix from SchemaRDD automatically

2015-03-05 Thread Wush Wu
Dear all, I am a new spark user from R. After exploring the schemaRDD, I notice that it is similar to data.frame. Is there a feature like `model.matrix` in R to convert schemaRDD to model matrix automatically according to the type without explicitly converting them one by one? Thanks, Wush

Re: Global sequential access of elements in RDD

2015-02-27 Thread Wush Wu
:38 GMT+08:00 Wush Wu w...@bridgewell.com: Dear all, I want to implement some sequential algorithm on RDD. For example: val conf = new SparkConf() conf.setMaster(local[2]). setAppName(SequentialSuite) val sc = new SparkContext(conf) val rdd = sc. parallelize(Array(1, 3, 2, 7, 1, 4

Global sequential access of elements in RDD

2015-02-26 Thread Wush Wu
Dear all, I want to implement some sequential algorithm on RDD. For example: val conf = new SparkConf() conf.setMaster(local[2]). setAppName(SequentialSuite) val sc = new SparkContext(conf) val rdd = sc. parallelize(Array(1, 3, 2, 7, 1, 4, 2, 5, 1, 8, 9), 2). sortBy(x = x, true)

Re: Extract hour from Timestamp in Spark SQL

2015-02-16 Thread Wush Wu
Dear Cheng Hao, You are right! After using the HiveContext, the issue is solved. Thanks, Wush 2015-02-15 10:42 GMT+08:00 Cheng, Hao hao.ch...@intel.com: Are you using the SQLContext? I think the HiveContext is recommended. Cheng Hao *From:* Wush Wu [mailto:w...@bridgewell.com

Extract hour from Timestamp in Spark SQL

2015-02-11 Thread Wush Wu
Dear all, I am new to Spark SQL and have no experience of Hive. I tried to use the built-in Hive Function to extract the hour from timestamp in spark sql, but got : java.util.NoSuchElementException: key not found: hour How should I extract the hour from timestamp? And I am very confusing about

Re: Using String Dataset for Logistic Regression

2014-06-02 Thread Wush Wu
Dear all, Does spark support sparse matrix/vector for LR now? Best, Wush 2014/6/2 下午3:19 於 praveshjain1991 praveshjain1...@gmail.com 寫道: Thank you for your replies. I've now been using integer datasets but ran into another issue.

Recommended way to develop spark application with both java and python

2014-04-07 Thread Wush Wu
Dear all, We have a spark 0.8.1 cluster on mesos 0.15. Some of my colleagues are familiar with python, but some of features are developed under java. I am looking for a way to integrate java and python on spark. I notice that the initialization of pyspark does not include a field to distribute