Re: Is there a list of missing optimizations for typed functions?

2017-02-27 Thread lihu
Hi, you can refer to https://issues.apache.org/jira/browse/SPARK-14083 for more detail. For performance issue,it is better to using the DataFrame than DataSet API. On Sat, Feb 25, 2017 at 2:45 AM, Jacek Laskowski wrote: > Hi Justin, > > I have never seen such a list. I think

Re: Graphx

2016-03-11 Thread lihu
Hi, John: I am very intersting in your experiment, How can you get that RDD serialization cost lots of time, from the log or some other tools? On Fri, Mar 11, 2016 at 8:46 PM, John Lilley wrote: > Andrew, > > > > We conducted some tests for using Graphx to solve

how to using local repository in spark[dev]

2015-11-27 Thread lihu
Hi, All: I modify the spark code and try to use some extra jars in Spark, the extra jars is published in my local maven repository using* mvn install*. However the sbt can not find this jars file, even I can find this jar fils under* /home/myname/.m2/resposiroty*. I can guarantee

Re: high GC in the Kmeans algorithm

2015-02-17 Thread lihu
. -Xiangrui On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com wrote: I just want to make the best use of CPU, and test the performance of spark if there is a lot of task in a single node. On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen so...@cloudera.com wrote: Good, worth double

Task not serializable problem in the multi-thread SQL query

2015-02-12 Thread lihu
I try to use the multi-thread to use the Spark SQL query. some sample code just like this: val sqlContext = new SqlContext(sc) val rdd_query = sc.parallelize(data, part) rdd_query.registerTempTable(MyTable) sqlContext.cacheTable(MyTable) val serverPool = Executors.newFixedThreadPool(3) val

Re: Task not serializable problem in the multi-thread SQL query

2015-02-12 Thread lihu
executors for more information. On Thu, Feb 12, 2015 at 2:34 AM, lihu lihu...@gmail.com wrote: I try to use the multi-thread to use the Spark SQL query. some sample code just like this: val sqlContext = new SqlContext(sc) val rdd_query = sc.parallelize(data, part) rdd_query.registerTempTable

Re: high GC in the Kmeans algorithm

2015-02-11 Thread lihu
have 24 cores? On Wed, Feb 11, 2015 at 9:03 AM, lihu lihu...@gmail.com wrote: I give 50GB to the executor, so it seem that there is no reason the memory is not enough. On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen so...@cloudera.com wrote: Meaning, you have 128GB per machine but how much

high GC in the Kmeans algorithm

2015-02-11 Thread lihu
Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I

Re: use netty shuffle for network cause high gc time

2015-01-14 Thread lihu
and much more thoroughly tested version under the property spark.shuffle.blockTransferService, which is set to netty by default. On Tue, Jan 13, 2015 at 9:26 PM, lihu lihu...@gmail.com wrote: Hi, I just test groupByKey method on a 100GB data, the cluster is 20 machine, each with 125GB RAM

use netty shuffle for network cause high gc time

2015-01-13 Thread lihu
Hi, I just test groupByKey method on a 100GB data, the cluster is 20 machine, each with 125GB RAM. At first I set conf.set(spark.shuffle.use.netty, false) and run the experiment, and then I set conf.set(spark.shuffle.use.netty, true) again to re-run the experiment, but at the latter

Re: Save RDD with partition information

2015-01-13 Thread lihu
there is no way to avoid shuffle if you use combine by key, no matter if your data is cached in memory, because the shuffle write must write the data into disk. And It seem that spark can not guarantee the similar key(K1) goes to the Container_X. you can use the tmpfs for your shuffle dir, this

Re: Save RDD with partition information

2015-01-13 Thread lihu
By the way, I am not sure enough wether the shuffle key can go into the similar container.

Re: Is It Feasible for Spark 1.1 Broadcast to Fully Utilize the Ethernet Card Throughput?

2015-01-12 Thread lihu
How about your scene? do you need use lots of Broadcast? If not, It will be better to focus more on other thing. At this time, there is not more better method than TorrentBroadcast. Though one-by-one, but after one node get the data, it can act as the data source immediately.

Re: do not assemble the spark example jar

2014-12-09 Thread lihu
Can this assembly get faster if we do not need the Spark SQL or some other component in spark ? such as we only need the core of spark. On Wed, Nov 26, 2014 at 3:37 PM, lihu lihu...@gmail.com wrote: Matei, sorry for my last typo error. And the tip can improve about 30s in my computer

Re: How to convert RDD to JSON?

2014-12-08 Thread lihu
RDD is just a wrap of the scala collection, Maybe you can use the .collect() method to get the scala collection type, you can then transfer to a JSON object using the Scala method.

Re: Viewing web UI after fact

2014-12-02 Thread lihu
Sciences(IIIS http://iiis.tsinghua.edu.cn/)* *Tsinghua University, China* *Email: lihu...@gmail.com lihu...@gmail.com* *Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/ http://iiis.tsinghua.edu.cn/zh/lihu/*

do not assemble the spark example jar

2014-11-25 Thread lihu
Hi, The spark assembly is time costly. If I only need the spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the spark-examples-1.1.0-hadoop2.3.0.jar. How to configure the spark to avoid assemble the example jar. I know *export SPARK_PREPEND_CLASSES=**true* method can reduce the assembly, but

Re: do not assemble the spark example jar

2014-11-25 Thread lihu
, 2014, at 7:50 PM, lihu lihu...@gmail.com wrote: Hi, The spark assembly is time costly. If I only need the spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the spark-examples-1.1.0-hadoop2.3.0.jar. How to configure the spark to avoid assemble the example jar. I know *export

Re: do not assemble the spark example jar

2014-11-25 Thread lihu
Matei, sorry for my last typo error. And the tip can improve about 30s in my computer. On Wed, Nov 26, 2014 at 3:34 PM, lihu lihu...@gmail.com wrote: Mater, thank you very much! After take your advice, the time for assembly from about 20min down to 6min in my computer. that's a very big

Re: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-07-24 Thread lihu
​Which code do you used, do you caused by your own code or something in spark itself? On Tue, Jul 22, 2014 at 8:50 AM, hsy...@gmail.com hsy...@gmail.com wrote: I have the same problem On Sat, Jul 19, 2014 at 12:31 AM, lihu lihu...@gmail.com wrote: Hi, Everyone. I have a piece

Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-07-19 Thread lihu
Hi, Everyone. I have a piece of following code. When I run it, it occurred the error just like below, it seem that the SparkContext is not serializable, but i do not try to use the SparkContext except the broadcast. [In fact, this code is in the MLLib, I just try to broadcast the

which function can generate a ShuffleMapTask

2014-06-23 Thread lihu
I see that the task will either be a ShuffleMapTask or be a ResultTask, I wonder which function will generate a ShuffleMapTask, which will generate a ResultTask?

spark-env.sh do not take effect.

2014-05-12 Thread lihu
Hi, I set a small cluster with 3 machines, every machine is 64GB RAM, 11 Core. and I used the spark0.9. I have set spark-env.sh as following: *SPARK_MASTER_IP=192.168.35.2* * SPARK_MASTER_PORT=7077* * SPARK_MASTER_WEBUI_PORT=12306* * SPARK_WORKER_CORES=3* *

spark.akka.frameSize setting problem

2014-03-28 Thread lihu
Hi, I just run a simple example to generate some data for the ALS algorithm. my spark version is 0.9, and in local mode, the memory of my node is 108G but when I set conf.set(spark.akka.frameSize, 4096), it then occurred the following problem, and when I do not set this, it runs well .

Re: how to use the log4j for the standalone app

2014-03-10 Thread lihu
Thanks, but I do not to log myself program info, I just do not want spark output all the info to my console, I want the spark output the log into to some file which I specified. On Tue, Mar 11, 2014 at 11:49 AM, Robin Cjc cjcro...@gmail.com wrote: Hi lihu, you can extends