Compute the global rank of the column

2016-05-31 Thread Dai, Kevin
Hi, All


I want to compute the rank of some column in a table.


Currently, I use the window function to do it.


However all data will be in one partition.


Is there better solution to do it?


Regards,

Kevin.


java.lang.OutOfMemoryError: Direct buffer memory when using broadcast join

2016-03-21 Thread Dai, Kevin
Hi,  All


I'm joining a small table (about 200m) with a huge table using broadcast join, 
however, spark throw the exception as follows:


16/03/20 22:32:06 WARN TransportChannelHandler: Exception in connection from
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at 
io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:651)
at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
at 
io.netty.buffer.CompositeByteBuf.allocBuffer(CompositeByteBuf.java:1345)
at 
io.netty.buffer.CompositeByteBuf.consolidateIfNeeded(CompositeByteBuf.java:276)
at 
io.netty.buffer.CompositeByteBuf.addComponent(CompositeByteBuf.java:116)
at 
org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:82)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

Can anyone tell me what's wrong and how to fix it?

Best Regards,
Kevin.



RE: Use pig load function in spark

2015-03-23 Thread Dai, Kevin
Hi, Yin

But our data is customized sequence file which can be read by our customized 
load in pig

And I want to use spark to reuse these load function to read data and transfer 
them to the RDD.

Best Regards,
Kevin.

From: Yin Huai [mailto:yh...@databricks.com]
Sent: 2015年3月24日 11:53
To: Dai, Kevin
Cc: Paul Brown; user@spark.apache.org
Subject: Re: Use pig load function in spark

Hello Kevin,

You can take a look at our generic load 
functionhttps://spark.apache.org/docs/1.3.0/sql-programming-guide.html#generic-loadsave-functions.

For example, you can use

val df = sqlContext.load(/myData, parquet)
To load a parquet dataset stored in /myData as a 
DataFramehttps://spark.apache.org/docs/1.3.0/sql-programming-guide.html#dataframes.

You can use it to load data stored in various formats, like json (Spark 
built-in), parquet (Spark built-in), 
avrohttps://github.com/databricks/spark-avro, and 
csvhttps://github.com/databricks/spark-csv.

Thanks,

Yin

On Mon, Mar 23, 2015 at 7:14 PM, Dai, Kevin 
yun...@ebay.commailto:yun...@ebay.com wrote:
Hi, Paul

You are right.

The story is that we have a lot of pig load function to load our different data.

And now we want to use spark to read and process these data.

So we want to figure out a way to reuse our existing load function in spark to 
read these data.

Any idea?

Best Regards,
Kevin.

From: Paul Brown [mailto:p...@mult.ifario.usmailto:p...@mult.ifario.us]
Sent: 2015年3月24日 4:11
To: Dai, Kevin
Subject: Re: Use pig load function in spark


The answer is Maybe, but you probably don't want to do that..

A typical Pig load function is devoted to bridging external data into Pig's 
type system, but you don't really need to do that in Spark because it is 
(thankfully) not encumbered by Pig's type system.  What you probably want to do 
is to figure out a way to use native Spark facilities (e.g., textFile) coupled 
with some of the logic out of your Pig load function necessary to turn your 
external data into an RDD.


—
p...@mult.ifario.usmailto:p...@mult.ifario.us | Multifarious, Inc. | 
http://mult.ifario.us/

On Mon, Mar 23, 2015 at 2:29 AM, Dai, Kevin 
yun...@ebay.commailto:yun...@ebay.com wrote:
Hi, all

Can spark use pig’s load function to load data?

Best Regards,
Kevin.




Use pig load function in spark

2015-03-23 Thread Dai, Kevin
Hi, all

Can spark use pig's load function to load data?

Best Regards,
Kevin.


RE: Use pig load function in spark

2015-03-23 Thread Dai, Kevin
Hi, Paul

You are right.

The story is that we have a lot of pig load function to load our different data.

And now we want to use spark to read and process these data.

So we want to figure out a way to reuse our existing load function in spark to 
read these data.

Any idea?

Best Regards,
Kevin.

From: Paul Brown [mailto:p...@mult.ifario.us]
Sent: 2015年3月24日 4:11
To: Dai, Kevin
Subject: Re: Use pig load function in spark


The answer is Maybe, but you probably don't want to do that..

A typical Pig load function is devoted to bridging external data into Pig's 
type system, but you don't really need to do that in Spark because it is 
(thankfully) not encumbered by Pig's type system.  What you probably want to do 
is to figure out a way to use native Spark facilities (e.g., textFile) coupled 
with some of the logic out of your Pig load function necessary to turn your 
external data into an RDD.


—
p...@mult.ifario.usmailto:p...@mult.ifario.us | Multifarious, Inc. | 
http://mult.ifario.us/

On Mon, Mar 23, 2015 at 2:29 AM, Dai, Kevin 
yun...@ebay.commailto:yun...@ebay.com wrote:
Hi, all

Can spark use pig’s load function to load data?

Best Regards,
Kevin.



RE: A strange problem in spark sql join

2015-03-09 Thread Dai, Kevin
No, I don’t have tow master instances.

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 2015年3月9日 15:03
To: Dai, Kevin
Cc: user@spark.apache.org
Subject: Re: A strange problem in spark sql join

Make sure you don't have two master instances running on the same machine. It 
could happen like you were running the job and in the middle you tried to stop 
the cluster which didn't completely stopped it and you did a start-all again 
which will eventually end up having 2 master instances running, and the former 
one will still be having your data computed/cached somewhere in the memory.

Thanks
Best Regards

On Mon, Mar 9, 2015 at 11:45 AM, Dai, Kevin 
yun...@ebay.commailto:yun...@ebay.com wrote:
Hi, guys

I encounter a strange problem as follows:

I joined two tables(which are both parquet files) and then did the groupby. The 
groupby took 19 hours to finish.

However, when I kill this job twice in the groupby stage. The third try will su

But after I killed this job and run it again. It succeeded and finished in 
15mins.

What’s wrong with it?

Best Regards,
Kevin.




A strange problem in spark sql join

2015-03-09 Thread Dai, Kevin
Hi, guys

I encounter a strange problem as follows:

I joined two tables(which are both parquet files) and then did the groupby. The 
groupby took 19 hours to finish.

However, when I kill this job twice in the groupby stage. The third try will su

But after I killed this job and run it again. It succeeded and finished in 
15mins.

What's wrong with it?

Best Regards,
Kevin.



RE: Implement customized Join for SparkSQL

2015-01-09 Thread Dai, Kevin
Hi,  Rishi

You are right. But the ids may be tens of thousands and B is a database with 
index for id,  which means querying by id is very fast.

In fact we load A and B as separate schemaRDDs as you suggested. But we hope we 
can extend the join implementation to achieve it in the parsing stage.

Best Regards,
Kevin

From: Rishi Yadav [mailto:ri...@infoobjects.com]
Sent: 2015年1月9日 6:52
To: Dai, Kevin
Cc: user@spark.apache.org
Subject: Re: Implement customized Join for SparkSQL

Hi Kevin,

Say A has 10 ids, so you are pulling data from B's data source only for these 
10 ids?

What if you load A and B as separate schemaRDDs and then do join. Spark will 
optimize the path anyway when action is fired .

On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin 
yun...@ebay.commailto:yun...@ebay.com wrote:
Hi, All

Suppose I want to join two tables A and B as follows:

Select * from A join B on A.id = B.id

A is a file while B is a database which indexed by id and I wrapped it by Data 
source API.
The desired join flow is:

1.   Generate A’s RDD[Row]

2.   Generate B’s RDD[Row] from A by using A’s id and B’s data source api 
to get row from the database

3.   Merge these two RDDs to the final RDD[Row]

However it seems existing join strategy doesn’t support it?

Any way to achieve it?

Best Regards,
Kevin.



Implement customized Join for SparkSQL

2015-01-05 Thread Dai, Kevin
Hi, All

Suppose I want to join two tables A and B as follows:

Select * from A join B on A.id = B.id

A is a file while B is a database which indexed by id and I wrapped it by Data 
source API.
The desired join flow is:

1.   Generate A's RDD[Row]

2.   Generate B's RDD[Row] from A by using A's id and B's data source api 
to get row from the database

3.   Merge these two RDDs to the final RDD[Row]

However it seems existing join strategy doesn't support it?

Any way to achieve it?

Best Regards,
Kevin.


Window function by Spark SQL

2014-12-04 Thread Dai, Kevin
Hi, ALL

How can I group by one column and order by another one, then select the first 
row for each group (which is just like window function doing) by SparkSQL?

Best Regards,
Kevin.


Transform RDD.groupBY result to multiple RDDs

2014-11-19 Thread Dai, Kevin
Hi, all

Suppose I have a RDD of (K, V) tuple and I do groupBy on it with the key K.

My question is how to make each groupBy resukt whick is (K, iterable[V]) a RDD.

BTW, can we transform it as a DStream and also each groupBY result is a RDD in 
it?

Best Regards,
Kevin.


Is there setup and cleanup function in spark?

2014-11-13 Thread Dai, Kevin
HI, all

Is there setup and cleanup function as in hadoop mapreduce in spark which does 
some initialization and cleanup work?

Best Regards,
Kevin.


ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId not found

2014-10-31 Thread Dai, Kevin
Hi, all

My job failed and there are a lot of  ERROR ConnectionManager: Corresponding 
SendingConnection to ConnectionManagerId not found information in the log.

Can anyone tell me what's wrong and how to fix it?

Best Regards,
Kevin.


Use RDD like a Iterator

2014-10-28 Thread Dai, Kevin
Hi, ALL

I have a RDD[T], can I use it like a iterator.
That means I can compute every element of this RDD lazily.

Best Regards,
Kevin.


SchemaRDD Convert

2014-10-22 Thread Dai, Kevin
Hi, ALL

I have a RDD of case class T and T contains several primitive types and a Map.
How can I convert this to a SchemaRDD?

Best Regards,
Kevin.


Convert Iterable to RDD

2014-10-20 Thread Dai, Kevin
Hi, All

Is there any way to convert iterable to RDD?

Thanks,
Kevin.


RE: Convert Iterable to RDD

2014-10-20 Thread Dai, Kevin
In addition, how to convert Iterable[Iterable[T]] to RDD[T]

Thanks,
Kevin.

From: Dai, Kevin [mailto:yun...@ebay.com]
Sent: 2014年10月21日 10:58
To: user@spark.apache.org
Subject: Convert Iterable to RDD

Hi, All

Is there any way to convert iterable to RDD?

Thanks,
Kevin.


Interactive interface tool for spark

2014-10-08 Thread Dai, Kevin
Hi, All

We need an interactive interface tool for spark in which we can run spark job 
and plot graph to explorer the data interactively.
Ipython notebook is good, but it only support python (we want one supporting 
scala)...

BR,
Kevin.