Re: Research ideas using spark
Ok… After having some off-line exchanges with Shashidhar Rao came up with an idea… Apply machine learning to either implement or improve autoscaling up or down within a Storm/Akka cluster. While I don’t know what constitutes an acceptable PhD thesis, or senior project for undergrads… this is a real life problem that actually has some real value. First, storm doesn’t scale down. Unless there’s been some improvements in the last year, you really can’t easily scale down the number of workers and transfer state to another worker. Looking at Akka, that would be an easier task because of the actor model. However, I don’t know Akka that well, so I can’t say if this is already implemented. So besides the mechanism to scale (up and down), you then have the issue of machine learning in terms of load and how to properly scale. This could be as simple as a PID function that watches the queues between spout/bolts and bolt/bolt, or something more advanced. This is where the research part of the project comes in. (What do you monitor, and how do you calculate and determine when to scale up or down, weighing in the cost(s) of the action of scaling.) Again its a worthwhile project, something that actually has business value, especially in terms of Lambda and other groovy greek lettered names for cluster designs (Zeta? ;-) ) Where you have both M/R (computational) and subjective real time (including micro batch) occurring either on the same cluster or within the same DC infrastructure. Again I don’t know if this is worthy of a PhD thesis, Masters Thesis, or Senior Project, but it is something that one could sink one’s teeth into and potentially lead to a commercial grade project if done properly. Good luck with it. HTH -Mike On Jul 15, 2015, at 12:40 PM, vaquar khan vaquar.k...@gmail.com wrote: I would suggest study spark ,flink,strom and based on your understanding and finding prepare your research paper. May be you will invented new spark ☺ Regards, Vaquar khan On 16 Jul 2015 00:47, Michael Segel msegel_had...@hotmail.com mailto:msegel_had...@hotmail.com wrote: Silly question… When thinking about a PhD thesis… do you want to tie it to a specific technology or do you want to investigate an idea but then use a specific technology. Or is this an outdated way of thinking? I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning.” So before we look at technologies like Spark… could the OP break down a more specific concept or idea that he wants to pursue? Looking at what Jorn said… Using machine learning to better predict workloads in terms of managing clusters… This could be interesting… but is it enough for a PhD thesis, or of interest to the OP? On Jul 15, 2015, at 9:43 AM, Jörn Franke jornfra...@gmail.com mailto:jornfra...@gmail.com wrote: Well one of the strength of spark is standardized general distributed processing allowing many different types of processing, such as graph processing, stream processing etc. The limitation is that it is less performant than one system focusing only on one type of processing (eg graph processing). I miss - and this may not be spark specific - some artificial intelligence to manage a cluster, e.g. Predicting workloads, how long a job may run based on previously executed similar jobs etc. Furthermore, many optimizations you have do to manually, e.g. Bloom filters, partitioning etc - if you find here as well some intelligence that does this automatically based on previously executed jobs taking into account that optimizations themselves change over time would be great... You may also explore feature interaction Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao raoshashidhar...@gmail.com mailto:raoshashidhar...@gmail.com a écrit : Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc. If somebody could help me with some ideas it would really accelerate my work. Plus few ideas on research papers regarding Spark or Mahout. Thanks in advance. Regards
Re: Research ideas using spark
Hi Daniel Well said Regards Vineel On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Hi Shahid, To be honest I think this question is better suited for Stack Overflow than for a PhD thesis. On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks are put on each node more then once. looking at tasks 8 tasks get completed under 7-8 minutes and one task takes around 30 minutes so causing the delay in results. On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc. If somebody could help me with some ideas it would really accelerate my work. Plus few ideas on research papers regarding Spark or Mahout. Thanks in advance. Regards -- with Regards Shahid Ashraf - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Research ideas using spark
Try to repartition it to a higher number (at least 3-4 times the total # of cpu cores). What operation are you doing? It may happen that if you are doing a join/groupBy sort of operation that task which is taking time is having all the values, in that case you need to use a Partitioner which will evenly distribute the keys across machines to speed up things. Thanks Best Regards On Tue, Jul 14, 2015 at 11:12 AM, shahid ashraf sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks are put on each node more then once. looking at tasks 8 tasks get completed under 7-8 minutes and one task takes around 30 minutes so causing the delay in results. On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc. If somebody could help me with some ideas it would really accelerate my work. Plus few ideas on research papers regarding Spark or Mahout. Thanks in advance. Regards -- with Regards Shahid Ashraf - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Research ideas using spark
Sorry Guys! I mistakenly added my question to this thread( Research ideas using spark). Moreover people can ask any question , this spark user group is for that. Cheers! On Wed, Jul 15, 2015 at 9:43 PM, Robin East robin.e...@xense.co.uk wrote: Well said Will. I would add that you might want to investigate GraphChi which claims to be able to run a number of large-scale graph processing tasks on a workstation much quicker than a very large Hadoop cluster. It would be interesting to know how widely applicable the approach GraphChi takes and what implications it has for parallel/distributed computing approaches. A rich seam to mine indeed. Robin On 15 Jul 2015, at 14:48, William Temperley willtemper...@gmail.com wrote: There seems to be a bit of confusion here - the OP (doing the PhD) had the thread hijacked by someone with a similar name asking a mundane question. It would be a shame to send someone away so rudely, who may do valuable work on Spark. Sashidar (not Sashid!) I'm personally interested in running graph algorithms for image segmentation using MLib and Spark. I've got many questions though - like is it even going to give me a speed-up? ( http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) It's not obvious to me which classes of graph algorithms can be implemented correctly and efficiently in a highly parallel manner. There's tons of work to be done here, I'm sure. Also, look at parallel geospatial algorithms - there's a lot of work being done on this. Best, Will On 15 July 2015 at 09:01, Vineel Yalamarthy vineelyalamar...@gmail.com wrote: Hi Daniel Well said Regards Vineel On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Hi Shahid, To be honest I think this question is better suited for Stack Overflow than for a PhD thesis. On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks are put on each node more then once. looking at tasks 8 tasks get completed under 7-8 minutes and one task takes around 30 minutes so causing the delay in results. On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc. If somebody could help me with some ideas it would really accelerate my work. Plus few ideas on research papers regarding Spark or Mahout. Thanks in advance. Regards -- with Regards Shahid Ashraf - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- with Regards Shahid Ashraf
Re: Research ideas using spark
Look at this : http://www.forbes.com/sites/lisabrownlee/2015/07/10/the-11-trillion-internet-of-things-big-data-and-pattern-of-life-pol-analytics/ On Wed, Jul 15, 2015 at 10:19 PM shahid ashraf sha...@trialx.com wrote: Sorry Guys! I mistakenly added my question to this thread( Research ideas using spark). Moreover people can ask any question , this spark user group is for that. Cheers! On Wed, Jul 15, 2015 at 9:43 PM, Robin East robin.e...@xense.co.uk wrote: Well said Will. I would add that you might want to investigate GraphChi which claims to be able to run a number of large-scale graph processing tasks on a workstation much quicker than a very large Hadoop cluster. It would be interesting to know how widely applicable the approach GraphChi takes and what implications it has for parallel/distributed computing approaches. A rich seam to mine indeed. Robin On 15 Jul 2015, at 14:48, William Temperley willtemper...@gmail.com wrote: There seems to be a bit of confusion here - the OP (doing the PhD) had the thread hijacked by someone with a similar name asking a mundane question. It would be a shame to send someone away so rudely, who may do valuable work on Spark. Sashidar (not Sashid!) I'm personally interested in running graph algorithms for image segmentation using MLib and Spark. I've got many questions though - like is it even going to give me a speed-up? ( http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) It's not obvious to me which classes of graph algorithms can be implemented correctly and efficiently in a highly parallel manner. There's tons of work to be done here, I'm sure. Also, look at parallel geospatial algorithms - there's a lot of work being done on this. Best, Will On 15 July 2015 at 09:01, Vineel Yalamarthy vineelyalamar...@gmail.com wrote: Hi Daniel Well said Regards Vineel On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Hi Shahid, To be honest I think this question is better suited for Stack Overflow than for a PhD thesis. On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks are put on each node more then once. looking at tasks 8 tasks get completed under 7-8 minutes and one task takes around 30 minutes so causing the delay in results. On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc. If somebody could help me with some ideas it would really accelerate my work. Plus few ideas on research papers regarding Spark or Mahout. Thanks in advance. Regards -- with Regards Shahid Ashraf - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- with Regards Shahid Ashraf
Re: Research ideas using spark
Silly question… When thinking about a PhD thesis… do you want to tie it to a specific technology or do you want to investigate an idea but then use a specific technology. Or is this an outdated way of thinking? I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning.” So before we look at technologies like Spark… could the OP break down a more specific concept or idea that he wants to pursue? Looking at what Jorn said… Using machine learning to better predict workloads in terms of managing clusters… This could be interesting… but is it enough for a PhD thesis, or of interest to the OP? On Jul 15, 2015, at 9:43 AM, Jörn Franke jornfra...@gmail.com wrote: Well one of the strength of spark is standardized general distributed processing allowing many different types of processing, such as graph processing, stream processing etc. The limitation is that it is less performant than one system focusing only on one type of processing (eg graph processing). I miss - and this may not be spark specific - some artificial intelligence to manage a cluster, e.g. Predicting workloads, how long a job may run based on previously executed similar jobs etc. Furthermore, many optimizations you have do to manually, e.g. Bloom filters, partitioning etc - if you find here as well some intelligence that does this automatically based on previously executed jobs taking into account that optimizations themselves change over time would be great... You may also explore feature interaction Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao raoshashidhar...@gmail.com mailto:raoshashidhar...@gmail.com a écrit : Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc. If somebody could help me with some ideas it would really accelerate my work. Plus few ideas on research papers regarding Spark or Mahout. Thanks in advance. Regards
Re: Research ideas using spark
I would suggest study spark ,flink,strom and based on your understanding and finding prepare your research paper. May be you will invented new spark ☺ Regards, Vaquar khan On 16 Jul 2015 00:47, Michael Segel msegel_had...@hotmail.com wrote: Silly question… When thinking about a PhD thesis… do you want to tie it to a specific technology or do you want to investigate an idea but then use a specific technology. Or is this an outdated way of thinking? I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning.” So before we look at technologies like Spark… could the OP break down a more specific concept or idea that he wants to pursue? Looking at what Jorn said… Using machine learning to better predict workloads in terms of managing clusters… This could be interesting… but is it enough for a PhD thesis, or of interest to the OP? On Jul 15, 2015, at 9:43 AM, Jörn Franke jornfra...@gmail.com wrote: Well one of the strength of spark is standardized general distributed processing allowing many different types of processing, such as graph processing, stream processing etc. The limitation is that it is less performant than one system focusing only on one type of processing (eg graph processing). I miss - and this may not be spark specific - some artificial intelligence to manage a cluster, e.g. Predicting workloads, how long a job may run based on previously executed similar jobs etc. Furthermore, many optimizations you have do to manually, e.g. Bloom filters, partitioning etc - if you find here as well some intelligence that does this automatically based on previously executed jobs taking into account that optimizations themselves change over time would be great... You may also explore feature interaction Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao raoshashidhar...@gmail.com a écrit : Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc. If somebody could help me with some ideas it would really accelerate my work. Plus few ideas on research papers regarding Spark or Mahout. Thanks in advance. Regards
Re: Research ideas using spark
Well one of the strength of spark is standardized general distributed processing allowing many different types of processing, such as graph processing, stream processing etc. The limitation is that it is less performant than one system focusing only on one type of processing (eg graph processing). I miss - and this may not be spark specific - some artificial intelligence to manage a cluster, e.g. Predicting workloads, how long a job may run based on previously executed similar jobs etc. Furthermore, many optimizations you have do to manually, e.g. Bloom filters, partitioning etc - if you find here as well some intelligence that does this automatically based on previously executed jobs taking into account that optimizations themselves change over time would be great... You may also explore feature interaction Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao raoshashidhar...@gmail.com a écrit : Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc. If somebody could help me with some ideas it would really accelerate my work. Plus few ideas on research papers regarding Spark or Mahout. Thanks in advance. Regards
Re: Research ideas using spark
Well said Will. I would add that you might want to investigate GraphChi which claims to be able to run a number of large-scale graph processing tasks on a workstation much quicker than a very large Hadoop cluster. It would be interesting to know how widely applicable the approach GraphChi takes and what implications it has for parallel/distributed computing approaches. A rich seam to mine indeed. Robin On 15 Jul 2015, at 14:48, William Temperley willtemper...@gmail.com wrote: There seems to be a bit of confusion here - the OP (doing the PhD) had the thread hijacked by someone with a similar name asking a mundane question. It would be a shame to send someone away so rudely, who may do valuable work on Spark. Sashidar (not Sashid!) I'm personally interested in running graph algorithms for image segmentation using MLib and Spark. I've got many questions though - like is it even going to give me a speed-up? (http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) It's not obvious to me which classes of graph algorithms can be implemented correctly and efficiently in a highly parallel manner. There's tons of work to be done here, I'm sure. Also, look at parallel geospatial algorithms - there's a lot of work being done on this. Best, Will On 15 July 2015 at 09:01, Vineel Yalamarthy vineelyalamar...@gmail.com mailto:vineelyalamar...@gmail.com wrote: Hi Daniel Well said Regards Vineel On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos daniel.dara...@lynxanalytics.com mailto:daniel.dara...@lynxanalytics.com wrote: Hi Shahid, To be honest I think this question is better suited for Stack Overflow than for a PhD thesis. On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com mailto:sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks are put on each node more then once. looking at tasks 8 tasks get completed under 7-8 minutes and one task takes around 30 minutes so causing the delay in results. On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao raoshashidhar...@gmail.com mailto:raoshashidhar...@gmail.com wrote: Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc. If somebody could help me with some ideas it would really accelerate my work. Plus few ideas on research papers regarding Spark or Mahout. Thanks in advance. Regards -- with Regards Shahid Ashraf - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Research ideas using spark
There seems to be a bit of confusion here - the OP (doing the PhD) had the thread hijacked by someone with a similar name asking a mundane question. It would be a shame to send someone away so rudely, who may do valuable work on Spark. Sashidar (not Sashid!) I'm personally interested in running graph algorithms for image segmentation using MLib and Spark. I've got many questions though - like is it even going to give me a speed-up? ( http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) It's not obvious to me which classes of graph algorithms can be implemented correctly and efficiently in a highly parallel manner. There's tons of work to be done here, I'm sure. Also, look at parallel geospatial algorithms - there's a lot of work being done on this. Best, Will On 15 July 2015 at 09:01, Vineel Yalamarthy vineelyalamar...@gmail.com wrote: Hi Daniel Well said Regards Vineel On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Hi Shahid, To be honest I think this question is better suited for Stack Overflow than for a PhD thesis. On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks are put on each node more then once. looking at tasks 8 tasks get completed under 7-8 minutes and one task takes around 30 minutes so causing the delay in results. On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc. If somebody could help me with some ideas it would really accelerate my work. Plus few ideas on research papers regarding Spark or Mahout. Thanks in advance. Regards -- with Regards Shahid Ashraf - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Research ideas using spark
Hi Shahid, To be honest I think this question is better suited for Stack Overflow than for a PhD thesis. On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks are put on each node more then once. looking at tasks 8 tasks get completed under 7-8 minutes and one task takes around 30 minutes so causing the delay in results. On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features especially the above learning methods and algorithms that have not been implemented etc. If somebody could help me with some ideas it would really accelerate my work. Plus few ideas on research papers regarding Spark or Mahout. Thanks in advance. Regards -- with Regards Shahid Ashraf - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org