Re: Spark performance comparison for research

Jörn Franke Mon, 29 Feb 2016 22:42:45 -0800

I am not sure what you compare here. You would need to provide additional 
details, such as algorithms and functionality supported by your framework. For 
instance, Spark has built-in fault-tolerance and is a generic framework, which 
has advantage with respect to development and operations, but may have 
disadvantage in certain use cases wrt performance.
Another concern it is the SDN which could be configured in disadvantageous for 
your approach or for Spark. I would not use it for generic performance 
comparison, except if it is the production network of your company and you want 
to compare it only for your company.


 I doubt that focusing only on performance for a framework makes scientifically 
sense. 
Your approach sounds too simple to be of scientific value, but more for 
unscientific marketing purposes. That being said, it could be that you did not 
provide all the details.

> On 01 Mar 2016, at 06:25, yasincelik <yasinceli...@gmail.com> wrote:
> 
> Hello,
> 
> I am working on a project as a part of my research. The system I am working
> on is basically an in-memory computing system. I want to compare its
> performance with Spark. Here is how I conduct experiments. For my project: I
> have a software defined network(SDN) that allows HPC applications to share
> data, such as sending and receiving messages through this network. For
> example, in a word count application, a master reads a 10GB text file from
> hard drive, slices into small chunks, and distribute the chunks. Each worker
> will fetch some chunks, process them, and send them back to the SDN. Then
> master collects the results.
> 
> To compare with Spark, I run word count application. I run Spark in
> standalone mode. I do not use any cluster manager. There is no pre-installed
> HDFS. I use PBS to reserve nodes, which gives me list of nodes. Then I
> simply run Spark on these nodes. Here is the command to run Spark:
> ~/SPARK/bin/spark-submit --class word.JavaWordCount  --num-executors 1
> spark.jar ~/data.txt  > ~/wc
> 
> Technically, these experiments are run under same conditions. Read file, cut
> it into small chunks, distribute chunks, process chunks, collect results.
> Do you think this is a reasonable comparison? Can someone make this claim:
> "Well, Spark is designed to work on top of HDFS, in which the data is
> already stored in nodes, and Spark jobs are submitted to these nodes to take
> advantage of data locality"
> 
> 
> Any comment is appreciated.
> 
> Thanks
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-performance-comparison-for-research-tp16498.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark performance comparison for research

Reply via email to