Compute the global rank of the column

2016-05-31 Thread Dai, Kevin
Hi, All I want to compute the rank of some column in a table. Currently, I use the window function to do it. However all data will be in one partition. Is there better solution to do it? Regards, Kevin.

java.lang.OutOfMemoryError: Direct buffer memory when using broadcast join

2016-03-21 Thread Dai, Kevin
Hi, All I'm joining a small table (about 200m) with a huge table using broadcast join, however, spark throw the exception as follows: 16/03/20 22:32:06 WARN TransportChannelHandler: Exception in connection from java.lang.OutOfMemoryError: Direct buffer memory at

RE: Use pig load function in spark

2015-03-23 Thread Dai, Kevin
-avro, and csvhttps://github.com/databricks/spark-csv. Thanks, Yin On Mon, Mar 23, 2015 at 7:14 PM, Dai, Kevin yun...@ebay.commailto:yun...@ebay.com wrote: Hi, Paul You are right. The story is that we have a lot of pig load function to load our different data. And now we want to use spark

Use pig load function in spark

2015-03-23 Thread Dai, Kevin
Hi, all Can spark use pig's load function to load data? Best Regards, Kevin.

RE: Use pig load function in spark

2015-03-23 Thread Dai, Kevin
. From: Paul Brown [mailto:p...@mult.ifario.us] Sent: 2015年3月24日 4:11 To: Dai, Kevin Subject: Re: Use pig load function in spark The answer is Maybe, but you probably don't want to do that.. A typical Pig load function is devoted to bridging external data into Pig's type system, but you don't

RE: A strange problem in spark sql join

2015-03-09 Thread Dai, Kevin
No, I don’t have tow master instances. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 2015年3月9日 15:03 To: Dai, Kevin Cc: user@spark.apache.org Subject: Re: A strange problem in spark sql join Make sure you don't have two master instances running on the same machine. It could happen

A strange problem in spark sql join

2015-03-09 Thread Dai, Kevin
Hi, guys I encounter a strange problem as follows: I joined two tables(which are both parquet files) and then did the groupby. The groupby took 19 hours to finish. However, when I kill this job twice in the groupby stage. The third try will su But after I killed this job and run it again. It

RE: Implement customized Join for SparkSQL

2015-01-09 Thread Dai, Kevin
. Best Regards, Kevin From: Rishi Yadav [mailto:ri...@infoobjects.com] Sent: 2015年1月9日 6:52 To: Dai, Kevin Cc: user@spark.apache.org Subject: Re: Implement customized Join for SparkSQL Hi Kevin, Say A has 10 ids, so you are pulling data from B's data source only for these 10 ids? What if you

Implement customized Join for SparkSQL

2015-01-05 Thread Dai, Kevin
Hi, All Suppose I want to join two tables A and B as follows: Select * from A join B on A.id = B.id A is a file while B is a database which indexed by id and I wrapped it by Data source API. The desired join flow is: 1. Generate A's RDD[Row] 2. Generate B's RDD[Row] from A by

Window function by Spark SQL

2014-12-04 Thread Dai, Kevin
Hi, ALL How can I group by one column and order by another one, then select the first row for each group (which is just like window function doing) by SparkSQL? Best Regards, Kevin.

Transform RDD.groupBY result to multiple RDDs

2014-11-19 Thread Dai, Kevin
Hi, all Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K. My question is how to make each groupBy resukt whick is (K, iterable[V]) a RDD. BTW, can we transform it as a DStream and also each groupBY result is a RDD in it? Best Regards, Kevin.

Is there setup and cleanup function in spark?

2014-11-13 Thread Dai, Kevin
HI, all Is there setup and cleanup function as in hadoop mapreduce in spark which does some initialization and cleanup work? Best Regards, Kevin.

ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId not found

2014-10-31 Thread Dai, Kevin
Hi, all My job failed and there are a lot of ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId not found information in the log. Can anyone tell me what's wrong and how to fix it? Best Regards, Kevin.

Use RDD like a Iterator

2014-10-28 Thread Dai, Kevin
Hi, ALL I have a RDD[T], can I use it like a iterator. That means I can compute every element of this RDD lazily. Best Regards, Kevin.

SchemaRDD Convert

2014-10-22 Thread Dai, Kevin
Hi, ALL I have a RDD of case class T and T contains several primitive types and a Map. How can I convert this to a SchemaRDD? Best Regards, Kevin.

Convert Iterable to RDD

2014-10-20 Thread Dai, Kevin
Hi, All Is there any way to convert iterable to RDD? Thanks, Kevin.

RE: Convert Iterable to RDD

2014-10-20 Thread Dai, Kevin
In addition, how to convert Iterable[Iterable[T]] to RDD[T] Thanks, Kevin. From: Dai, Kevin [mailto:yun...@ebay.com] Sent: 2014年10月21日 10:58 To: user@spark.apache.org Subject: Convert Iterable to RDD Hi, All Is there any way to convert iterable to RDD? Thanks, Kevin.

Interactive interface tool for spark

2014-10-08 Thread Dai, Kevin
Hi, All We need an interactive interface tool for spark in which we can run spark job and plot graph to explorer the data interactively. Ipython notebook is good, but it only support python (we want one supporting scala)... BR, Kevin.