Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-18 Thread Ji ZHANG
Hi, We switched from ParallelGC to CMS, and the symptom is gone. On Thu, Jun 4, 2015 at 3:37 PM, Ji ZHANG zhangj...@gmail.com wrote: Hi, I set spark.shuffle.io.preferDirectBufs to false in SparkConf and this setting can be seen in web ui's environment tab. But, it still eats memory, i.e

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-04 Thread Ji ZHANG
set spark.shuffle.io.preferDirectBufs to false to turn off the off-heap allocation of netty? Best Regards, Shixiong Zhu 2015-06-03 11:58 GMT+08:00 Ji ZHANG zhangj...@gmail.com: Hi, Thanks for you information. I'll give spark1.4 a try when it's released. On Wed, Jun 3, 2015 at 11:31 AM

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-02 Thread Ji ZHANG
is easy. Diff two dumps can be done in JVisualVM, it will show the diff in the objects that got added in the later dump. That makes it easy to debug what is not getting cleaned. TD On Tue, Jun 2, 2015 at 7:33 PM, Ji ZHANG zhangj...@gmail.com wrote: Hi, Thanks for you reply. Here's the top

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-02 Thread Ji ZHANG
reveal something. On Thursday, May 28, 2015, Ji ZHANG zhangj...@gmail.com wrote: Hi, Unfortunately, they're still growing, both driver and executors. I run the same job with local mode, everything is fine. On Thu, May 28, 2015 at 5:26 PM, Akhil Das ak...@sigmoidanalytics.com wrote

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Ji ZHANG
. Thanks. On Thu, May 28, 2015 at 3:07 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Zhang, Could you paste your code in a gist? Not sure what you are doing inside the code to fill up memory. Thanks Best Regards On Thu, May 28, 2015 at 10:08 AM, Ji ZHANG zhangj...@gmail.com wrote

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Ji ZHANG
= logger.info(rdd.count())) Thanks Best Regards On Thu, May 28, 2015 at 1:02 PM, Ji ZHANG zhangj...@gmail.com wrote: Hi, I wrote a simple test job, it only does very basic operations. for example: val lines = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic - 1)).map(_._2

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Ji ZHANG
the createStream or createDirectStream api? If its the former, you can try setting the StorageLevel to MEMORY_AND_DISK (it might slow things down though). Another way would be to try the later one. Thanks Best Regards On Wed, May 27, 2015 at 1:00 PM, Ji ZHANG zhangj...@gmail.com wrote: Hi

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Ji ZHANG
with spark. Thanks Best Regards On Wed, May 27, 2015 at 11:51 AM, Ji ZHANG zhangj...@gmail.com wrote: Hi, I'm using Spark Streaming 1.3 on CDH5.1 with yarn-cluster mode. I find out that YARN is killing the driver and executor process because of excessive use of memory. Here's something I

Re: Join DStream With Other Datasets

2015-01-18 Thread Ji ZHANG
? This is just something you write in Java/Scala. On Jan 17, 2015 2:06 PM, Ji ZHANG zhangj...@gmail.com wrote: Hi, I want to join a DStream with some other dataset, e.g. join a click stream with a spam ip list. I can think of two possible solutions, one is use broadcast variable, and the other

Re: Join DStream With Other Datasets

2015-01-18 Thread Ji ZHANG
solution. On Sun, Jan 18, 2015 at 2:12 AM, Jörn Franke jornfra...@gmail.com wrote: Can't you send a special event through spark streaming once the list is updated? So you have your normal events and a special reload event Le 17 janv. 2015 15:06, Ji ZHANG zhangj...@gmail.com a écrit : Hi, I

Join DStream With Other Datasets

2015-01-17 Thread Ji ZHANG
Hi, I want to join a DStream with some other dataset, e.g. join a click stream with a spam ip list. I can think of two possible solutions, one is use broadcast variable, and the other is use transform operation as is described in the manual. But the problem is the spam ip list will be updated

Spark SQL 1.2 with CDH 4, Hive UDF is not working.

2014-12-22 Thread Ji ZHANG
Hi, Recently I'm migrating from Shark 0.9 to Spark SQL 1.2, my CDH version is 4.5, Hive 0.11. I've managed to setup Spark SQL Thriftserver, and normal queries work fine, but custom UDF is not usable. The symptom is when executing CREATE TEMPORARY FUNCTION, the query hangs on a lock request:

Re: Executor Log Rotation Is Not Working?

2014-11-06 Thread Ji ZHANG
-Dspark.executor.logs.rolling.maxRetainedFiles=3 Maybe in yarn mode the spark-defaults.conf would be sufficient, but I've not tested. On Tue, Nov 4, 2014 at 12:24 PM, Ji ZHANG zhangj...@gmail.com wrote: Hi, I'm using Spark Streaming 1.1, and I have the following logs keep growing: /opt/spark-1.1.0-bin-cdh4/work/app

Executor Log Rotation Is Not Working?

2014-11-03 Thread Ji ZHANG
Hi, I'm using Spark Streaming 1.1, and I have the following logs keep growing: /opt/spark-1.1.0-bin-cdh4/work/app-20141029175309-0005/2/stderr I think it is executor log, so I setup the following options in spark-defaults.conf: spark.executor.logs.rolling.strategy time

Implement Count by Minute in Spark Streaming

2014-10-26 Thread Ji ZHANG
Hi, Suppose I have a stream of logs and I want to count them by minute. The result is like: 2014-10-26 18:38:00 100 2014-10-26 18:39:00 150 2014-10-26 18:40:00 200 One way to do this is to set the batch interval to 1 min, but each batch would be quite large. Or I can use updateStateByKey where

Spark Streaming: Calculate PV/UV by Minute and by Day?

2014-09-20 Thread Ji ZHANG
Hi, I'm using Spark Streaming 1.0. Say I have a source of website click stream, like the following: ('2014-09-19 00:00:00', '192.168.1.1', 'home_page') ('2014-09-19 00:00:01', '192.168.1.2', 'list_page') ... And I want to calculate the page views (PV, number of logs) and unique user (UV,

How to Exclude Spark Dependencies from spark-streaming-kafka?

2014-09-20 Thread Ji ZHANG
Hi, I'm developing an application with spark-streaming-kafka, which depends on spark-streaming and kafka. Since spark-streaming is provided in runtime, I want to exclude the jars from the assembly. I tried the following configuration: libraryDependencies ++= { val sparkVersion = 1.0.2 Seq(

Do I Need to Set Checkpoint Interval for Every DStream?

2014-09-17 Thread Ji ZHANG
Hi, I'm using spark streaming 1.0. I create dstream with kafkautils and apply some operations on it. There's a reduceByWindow operation at last so I suppose the checkpoint interval should be automatically set to more than 10 seconds. But what I see is it still checkpoint every 2 seconds (my batch