Hi Aliaksandr,

Thank you very much for your answer.
And in my test, I would reuse the spark context, it is initialized when I start 
the application, for the later throughput test, it won't be initialized again. 
And when I increase the number of workers, the through put doesn't increase.
I read the link you post, it only described use command line tools that faster 
than Hadoop cluster, I didn't really get the key point that explain my 
question. 
If spark context initialization isn't affect my test case, is there anything 
else? Does the job initialization or dispatch take time? Thank you!


-----Original Message-----
From: Bedrytski Aliaksandr [mailto:sp...@bedryt.ski] 
Sent: Wednesday, August 31, 2016 8:45 PM
To: Xie, Feng
Cc: user@spark.apache.org
Subject: Re: Why does spark take so much time for simple task without 
calculation?

Hi xiefeng,

Spark Context initialization takes some time and the tool does not really shine 
for small data computations:
http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

But, when working with terabytes (petabytes) of data, those 35 seconds of 
initialization don't really matter. 

Regards,

--
  Bedrytski Aliaksandr
  sp...@bedryt.ski

On Wed, Aug 31, 2016, at 11:45, xiefeng wrote:
> I install a spark standalone and run the spark cluster(one master and 
> one
> worker) in a windows 2008 server with 16cores and 24GB memory.
> 
> I have done a simple test: Just create  a string RDD and simply return 
> it. I use JMeter to test throughput but the highest is around 35/sec. 
> I think spark is powerful at distribute calculation, but why the 
> throughput is so limit in such simple test scenario only contains 
> simple task dispatch and no calculation?
> 
> 1. In JMeter I test both 10 threads or 100 threads, there is little 
> difference around 2-3/sec.
> 2. I test both cache/not cache the RDD, there is little difference. 
> 3. During the test, the cpu and memory are in low level.
> 
> Below is my test code:
> @RestController
> public class SimpleTest {       
>       @RequestMapping(value = "/SimpleTest", method = RequestMethod.GET)
>       @ResponseBody
>       public String testProcessTransaction() {
>               return SparkShardTest.simpleRDDTest();
>       }
> }
> 
> final static Map<String, JavaRDD&lt;String>> simpleRDDs = 
> initSimpleRDDs(); public static Map<String, JavaRDD&lt;String>> 
> initSimpleRDDs()
>       {
>               Map<String, JavaRDD&lt;String>> result = new 
> ConcurrentHashMap<String, JavaRDD&lt;String>>();
>               JavaRDD<String> rddData = JavaSC.parallelize(data);
>               rddData.cache().count();    //this cache will improve 1-2/sec
>               result.put("MyRDD", rddData);
>               return result;
>       }
>       
>       public static String simpleRDDTest()
>       {               
>               JavaRDD<String> rddData = simpleRDDs.get("MyRDD");
>               return rddData.first();
>       }
> 
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-tak
> e-so-much-time-for-simple-task-without-calculation-tp27628.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to