Hi Aliaksandr, Thank you very much for your answer. And in my test, I would reuse the spark context, it is initialized when I start the application, for the later throughput test, it won't be initialized again. And when I increase the number of workers, the through put doesn't increase. I read the link you post, it only described use command line tools that faster than Hadoop cluster, I didn't really get the key point that explain my question. If spark context initialization isn't affect my test case, is there anything else? Does the job initialization or dispatch take time? Thank you!
-----Original Message----- From: Bedrytski Aliaksandr [mailto:sp...@bedryt.ski] Sent: Wednesday, August 31, 2016 8:45 PM To: Xie, Feng Cc: user@spark.apache.org Subject: Re: Why does spark take so much time for simple task without calculation? Hi xiefeng, Spark Context initialization takes some time and the tool does not really shine for small data computations: http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html But, when working with terabytes (petabytes) of data, those 35 seconds of initialization don't really matter. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 31, 2016, at 11:45, xiefeng wrote: > I install a spark standalone and run the spark cluster(one master and > one > worker) in a windows 2008 server with 16cores and 24GB memory. > > I have done a simple test: Just create a string RDD and simply return > it. I use JMeter to test throughput but the highest is around 35/sec. > I think spark is powerful at distribute calculation, but why the > throughput is so limit in such simple test scenario only contains > simple task dispatch and no calculation? > > 1. In JMeter I test both 10 threads or 100 threads, there is little > difference around 2-3/sec. > 2. I test both cache/not cache the RDD, there is little difference. > 3. During the test, the cpu and memory are in low level. > > Below is my test code: > @RestController > public class SimpleTest { > @RequestMapping(value = "/SimpleTest", method = RequestMethod.GET) > @ResponseBody > public String testProcessTransaction() { > return SparkShardTest.simpleRDDTest(); > } > } > > final static Map<String, JavaRDD<String>> simpleRDDs = > initSimpleRDDs(); public static Map<String, JavaRDD<String>> > initSimpleRDDs() > { > Map<String, JavaRDD<String>> result = new > ConcurrentHashMap<String, JavaRDD<String>>(); > JavaRDD<String> rddData = JavaSC.parallelize(data); > rddData.cache().count(); //this cache will improve 1-2/sec > result.put("MyRDD", rddData); > return result; > } > > public static String simpleRDDTest() > { > JavaRDD<String> rddData = simpleRDDs.get("MyRDD"); > return rddData.first(); > } > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-tak > e-so-much-time-for-simple-task-without-calculation-tp27628.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org