Re: Minimum cost flow problem solving in Spark
You might be interested in "Maximum Flow implementation on Spark GraphX" done by a Colorado School of Mines grad student a couple of years ago. http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx From: Swapnil ShindeTo: user@spark.apache.org; d...@spark.apache.org Sent: Wednesday, September 13, 2017 9:41 AM Subject: Minimum cost flow problem solving in Spark Hello Has anyone used Spark to solve minimum cost flow problems in Spark? I am quite new to combinatorial optimization algorithms so any help or suggestions, libraries are very appreciated. Thanks Swapnil - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Shortest path with directed and weighted graphs
Chapter 6 of my book implements Dijkstra's Algorithm. The source code is available to download for free. https://www.manning.com/books/spark-graphx-in-action From: Brian WilsonTo: user@spark.apache.org Sent: Monday, October 24, 2016 7:11 AM Subject: Shortest path with directed and weighted graphs I have been looking at the ShortestPaths function inbuilt with Spark here. Am I correct in saying there is no support for weighted graphs with this function? By that I mean that it assumes all edges carry a weight = 1 Many thanks Brian
Re: GraphX drawing algorithm
In chapter 10 of Spark GraphX In Action, we describe how to use Zeppelin with d3.js to render graphs using d3's force-directed rendering algorithm. The source code can be downloaded for free from https://www.manning.com/books/spark-graphx-in-action From: agc studioTo: user@spark.apache.org Sent: Sunday, September 11, 2016 5:59 PM Subject: GraphX drawing algorithm Hi all, I was wondering if a force-directed graph drawing algorithm has been implemented for graphX? Thanks
Re: GraphX Java API
Yes, it is possible to use GraphX from Java but it requires 10x the amount of code and involves using obscure typing and pre-defined lambda prototype facilities. I give an example of it in my book, the source code for which can be downloaded for free from https://www.manning.com/books/spark-graphx-in-action The relevant example is EdgeCount.java in chapter 10. As I suggest in my book, likely the only reason you'd want to put yourself through that torture is corporate mandate or compatibility with Java bytecode tools. From: Sean OwenTo: Takeshi Yamamuro ; "Kumar, Abhishek (US - Bengaluru)" Cc: "user@spark.apache.org" Sent: Monday, May 30, 2016 7:07 AM Subject: Re: GraphX Java API No, you can call any Scala API in Java. It is somewhat less convenient if the method was not written with Java in mind but does work. On Mon, May 30, 2016, 00:32 Takeshi Yamamuro wrote: These package are used only for Scala. On Mon, May 30, 2016 at 2:23 PM, Kumar, Abhishek (US - Bengaluru) wrote: Hey,· I see some graphx packages listed here:http://spark.apache.org/docs/latest/api/java/index.html· org.apache.spark.graphx· org.apache.spark.graphx.impl· org.apache.spark.graphx.lib· org.apache.spark.graphx.utilAren’t they meant to be used with JAVA?Thanks From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com] Sent: Friday, May 27, 2016 4:52 PM To: Kumar, Abhishek (US - Bengaluru) ; user@spark.apache.org Subject: RE: GraphX Java API GraphX APis are available only in Scala. If you need to use GraphX you need to switch to Scala. From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com] Sent: 27 May 2016 19:59 To: user@spark.apache.org Subject: GraphX Java API Hi, We are trying to consume the Java API for GraphX, but there is no documentation available online on the usage or examples. It would be great if we could get some examples in Java. Thanks and regards, Abhishek Kumar This message (including any attachments) contains confidential information intended for a specific individual and purpose, and is protected by law. If you are not the intended recipient, you should delete this message and any disclosure, copying, or distribution of this message, or the taking of any action based on it, by you is strictly prohibited.v.E.1 -- --- Takeshi Yamamuro
Re: Adhoc queries on Spark 2.0 with Structured Streaming
At first glance, it looks like the only streaming data sources available out of the box from the github master branch are https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala and https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala . Out of the Jira epic for Structured Streaming https://issues.apache.org/jira/browse/SPARK-8360 it would seem the still-open https://issues.apache.org/jira/browse/SPARK-10815 "API design: data sources and sinks" is relevant here. In short, it would seem the code is not there yet to create a Kafka-fed Dataframe/Dataset that can be queried with Structured Streaming; or if it is, it's not obvious how to write such code. From: Anthony MayTo: Deepak Sharma ; Sunita Arvind Cc: "user@spark.apache.org" Sent: Friday, May 6, 2016 11:50 AM Subject: Re: Adhoc queries on Spark 2.0 with Structured Streaming Yeah, there isn't even a RC yet and no documentation but you can work off the code base and test suites: https://github.com/apache/spark And this might help: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala On Fri, 6 May 2016 at 11:07 Deepak Sharma wrote: Spark 2.0 is yet to come out for public release. I am waiting to get hands on it as well. Please do let me know if i can download source and build spark2.0 from github. Thanks Deepak On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind wrote: Hi All, We are evaluating a few real time streaming query engines and spark is my personal choice. The addition of adhoc queries is what is getting me further excited about it, however the talks I have heard so far only mention about it but do not provide details. I need to build a prototype to ensure it works for our use cases. Can someone point me to relevant material for this. regards Sunita -- Thanks Deepak www.bigdatabig.com www.keosha.net
Re: Spark 2.0 forthcoming features
http://go.databricks.com/apache-spark-2.0-presented-by-databricks-co-founder-reynold-xin From: Sourav MazumderTo: user Sent: Wednesday, April 20, 2016 11:07 AM Subject: Spark 2.0 forthcoming features Hi All, Is there somewhere we can get idea of the upcoming features in Spark 2.0. I got a list for Spark ML from here https://issues.apache.org/jira/browse/SPARK-12626. Is there other links where I can similar enhancements planned for Sparl SQL, Spark Core, Spark Streaming. GraphX etc. ? Thanks in advance. Regards, Sourav
Re: Apache Flink
As with all history, "what if"s are not scientifically testable hypotheses, but my speculation is the energy (VCs, startups, big Internet companies, universities) within Silicon Valley contrasted to Germany. From: Mich Talebzadeh <mich.talebza...@gmail.com> To: Michael Malak <michaelma...@yahoo.com>; "user @spark" <user@spark.apache.org> Sent: Sunday, April 17, 2016 3:55 PM Subject: Re: Apache Flink Assuming that both Spark and Flink are contemporaries what are the reasons that Flink has not been adopted widely? (this may sound obvious and or prejudged). I mean Spark has surged in popularity in the past year if I am correct Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 April 2016 at 22:49, Michael Malak <michaelma...@yahoo.com> wrote: In terms of publication date, a paper on Nephele was published in 2009, prior to the 2010 USENIX paper on Spark. Nephele is the execution engine of Stratosphere, which became Flink. From: Mark Hamstra <m...@clearstorydata.com> To: Mich Talebzadeh <mich.talebza...@gmail.com> Cc: Corey Nolet <cjno...@gmail.com>; "user @spark" <user@spark.apache.org> Sent: Sunday, April 17, 2016 3:30 PM Subject: Re: Apache Flink To be fair, the Stratosphere project from which Flink springs was started as a collaborative university research project in Germany about the same time that Spark was first released as Open Source, so they are near contemporaries rather than Flink having been started only well after Spark was an established and widely-used Apache project. On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: Also it always amazes me why they are so many tangential projects in Big Data space? Would not it be easier if efforts were spent on adding to Spark functionality rather than creating a new product like Flink? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 April 2016 at 21:08, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: Thanks Corey for the useful info. I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, there does not seem to be anything close to these products in Hadoop Ecosystem. So I guess there is nothing there? Regards. Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 April 2016 at 20:43, Corey Nolet <cjno...@gmail.com> wrote: i have not been intrigued at all by the microbatching concept in Spark. I am used to CEP in real streams processing environments like Infosphere Streams & Storm where the granularity of processing is at the level of each individual tuple and processing units (workers) can react immediately to events being received and processed. The closest Spark streaming comes to this concept is the notion of "state" that that can be updated via the "updateStateBykey()" functions which are only able to be run in a microbatch. Looking at the expected design changes to Spark Streaming in Spark 2.0.0, it also does not look like tuple-at-a-time processing is on the radar for Spark, though I have seen articles stating that more effort is going to go into the Spark SQL layer in Spark streaming which may make it more reminiscent of Esper. For these reasons, I have not even tried to implement CEP in Spark. I feel it's a waste of time without immediate tuple-at-a-time processing. Without this, they avoid the whole problem of "back pressure" (though keep in mind, it is still very possible to overload the Spark streaming layer with stages that will continue to pile up and never get worked off) but they lose the granular control that you get in CEP environments by allowing the rules & processors to react with the receipt of each tuple, right away. Awhile back, I did attempt to implement an InfoSphere Streams-like API [1] on top of Apache Storm as an example of what such a design may look like. It looks like Storm is going to be replaced in the not so distant future by Twitter's new design called Heron. IIRC, Heron does not have an open source implementation as of yet. [1] https://github.com/calrissian/flowmix On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: Hi Corey, Can you please point me to docs on using Spark for CEP? Do we have a set of CEP libraries somewhere. I am keen on getting hold of adaptor libraries for Spark something like below Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 April 2016 at 16:07, Corey Nolet <cjno...@gma
Re: Apache Flink
There have been commercial CEP solutions for decades, including from my employer. From: Mich TalebzadehTo: Mark Hamstra Cc: Corey Nolet ; "user @spark" Sent: Sunday, April 17, 2016 3:48 PM Subject: Re: Apache Flink The problem is that the strength and wider acceptance of a typical Open source project is its sizeable user and development community. When the community is small like Flink, then it is not a viable solution to adopt I am rather disappointed that no big data project can be used for Complex Event Processing as it has wider use in Algorithmic trading among others. Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 April 2016 at 22:30, Mark Hamstra wrote: To be fair, the Stratosphere project from which Flink springs was started as a collaborative university research project in Germany about the same time that Spark was first released as Open Source, so they are near contemporaries rather than Flink having been started only well after Spark was an established and widely-used Apache project. On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh wrote: Also it always amazes me why they are so many tangential projects in Big Data space? Would not it be easier if efforts were spent on adding to Spark functionality rather than creating a new product like Flink? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 April 2016 at 21:08, Mich Talebzadeh wrote: Thanks Corey for the useful info. I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, there does not seem to be anything close to these products in Hadoop Ecosystem. So I guess there is nothing there? Regards. Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 April 2016 at 20:43, Corey Nolet wrote: i have not been intrigued at all by the microbatching concept in Spark. I am used to CEP in real streams processing environments like Infosphere Streams & Storm where the granularity of processing is at the level of each individual tuple and processing units (workers) can react immediately to events being received and processed. The closest Spark streaming comes to this concept is the notion of "state" that that can be updated via the "updateStateBykey()" functions which are only able to be run in a microbatch. Looking at the expected design changes to Spark Streaming in Spark 2.0.0, it also does not look like tuple-at-a-time processing is on the radar for Spark, though I have seen articles stating that more effort is going to go into the Spark SQL layer in Spark streaming which may make it more reminiscent of Esper. For these reasons, I have not even tried to implement CEP in Spark. I feel it's a waste of time without immediate tuple-at-a-time processing. Without this, they avoid the whole problem of "back pressure" (though keep in mind, it is still very possible to overload the Spark streaming layer with stages that will continue to pile up and never get worked off) but they lose the granular control that you get in CEP environments by allowing the rules & processors to react with the receipt of each tuple, right away. Awhile back, I did attempt to implement an InfoSphere Streams-like API [1] on top of Apache Storm as an example of what such a design may look like. It looks like Storm is going to be replaced in the not so distant future by Twitter's new design called Heron. IIRC, Heron does not have an open source implementation as of yet. [1] https://github.com/calrissian/flowmix On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh wrote: Hi Corey, Can you please point me to docs on using Spark for CEP? Do we have a set of CEP libraries somewhere. I am keen on getting hold of adaptor libraries for Spark something like below Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 April 2016 at 16:07, Corey Nolet wrote: One thing I've noticed about Flink in my following of the project has been that it has established, in a few cases, some novel ideas and improvements over Spark. The problem with it, however, is that both the development team and the community around it are very small and many of those novel improvements have been rolled directly into Spark in subsequent versions. I was considering changing over my architecture to Flink at one point to get better, more real-time CEP streaming support, but in the end I
Re: Apache Flink
In terms of publication date, a paper on Nephele was published in 2009, prior to the 2010 USENIX paper on Spark. Nephele is the execution engine of Stratosphere, which became Flink. From: Mark HamstraTo: Mich Talebzadeh Cc: Corey Nolet ; "user @spark" Sent: Sunday, April 17, 2016 3:30 PM Subject: Re: Apache Flink To be fair, the Stratosphere project from which Flink springs was started as a collaborative university research project in Germany about the same time that Spark was first released as Open Source, so they are near contemporaries rather than Flink having been started only well after Spark was an established and widely-used Apache project. On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh wrote: Also it always amazes me why they are so many tangential projects in Big Data space? Would not it be easier if efforts were spent on adding to Spark functionality rather than creating a new product like Flink? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 April 2016 at 21:08, Mich Talebzadeh wrote: Thanks Corey for the useful info. I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, there does not seem to be anything close to these products in Hadoop Ecosystem. So I guess there is nothing there? Regards. Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 April 2016 at 20:43, Corey Nolet wrote: i have not been intrigued at all by the microbatching concept in Spark. I am used to CEP in real streams processing environments like Infosphere Streams & Storm where the granularity of processing is at the level of each individual tuple and processing units (workers) can react immediately to events being received and processed. The closest Spark streaming comes to this concept is the notion of "state" that that can be updated via the "updateStateBykey()" functions which are only able to be run in a microbatch. Looking at the expected design changes to Spark Streaming in Spark 2.0.0, it also does not look like tuple-at-a-time processing is on the radar for Spark, though I have seen articles stating that more effort is going to go into the Spark SQL layer in Spark streaming which may make it more reminiscent of Esper. For these reasons, I have not even tried to implement CEP in Spark. I feel it's a waste of time without immediate tuple-at-a-time processing. Without this, they avoid the whole problem of "back pressure" (though keep in mind, it is still very possible to overload the Spark streaming layer with stages that will continue to pile up and never get worked off) but they lose the granular control that you get in CEP environments by allowing the rules & processors to react with the receipt of each tuple, right away. Awhile back, I did attempt to implement an InfoSphere Streams-like API [1] on top of Apache Storm as an example of what such a design may look like. It looks like Storm is going to be replaced in the not so distant future by Twitter's new design called Heron. IIRC, Heron does not have an open source implementation as of yet. [1] https://github.com/calrissian/flowmix On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh wrote: Hi Corey, Can you please point me to docs on using Spark for CEP? Do we have a set of CEP libraries somewhere. I am keen on getting hold of adaptor libraries for Spark something like below Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 April 2016 at 16:07, Corey Nolet wrote: One thing I've noticed about Flink in my following of the project has been that it has established, in a few cases, some novel ideas and improvements over Spark. The problem with it, however, is that both the development team and the community around it are very small and many of those novel improvements have been rolled directly into Spark in subsequent versions. I was considering changing over my architecture to Flink at one point to get better, more real-time CEP streaming support, but in the end I decided to stick with Spark and just watch Flink continue to pressure it into improvement. On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers wrote: i never found much info that flink was actually designed to be fault tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that doesn't bode well for large scale data processing. spark was designed with fault tolerance in mind from the beginning. On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh
Re: Spark with Druid
Will Spark 2.0 Structured Streaming obviate some of the Druid/Spark use cases? From: Raymond HonderdorsTo: "yuzhih...@gmail.com" Cc: "user@spark.apache.org" Sent: Wednesday, March 23, 2016 8:43 AM Subject: Re: Spark with Druid I saw these but i fail to understand how to direct the code to use rhe index json Sent from Outlook Mobile On Wed, Mar 23, 2016 at 7:19 AM -0700, "Ted Yu" wrote: Please see: https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani which references https://github.com/SparklineData/spark-druid-olap On Wed, Mar 23, 2016 at 5:59 AM, Raymond Honderdors wrote: Does anyone have a good overview on how to integrate Spark and Druid? I am now struggling with the creation of a druid data source in spark. Raymond HonderdorsTeam Lead Analytics BIBusiness Intelligence developerraymond.honderd...@sizmek.comt +972.7325.3569Herzliya
Re: Build k-NN graph for large dataset
Yes. And a paper that describes using grids (actually varying grids) is http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf In the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically simplified version of this in chapter 7, which should become available in the MEAP mid-September. http://www.manning.com/books/spark-graphx-in-action From: Kristina Rogale Plazonic kpl...@gmail.com To: Jaonary Rabarisoa jaon...@gmail.com Cc: user user@spark.apache.org Sent: Wednesday, August 26, 2015 7:24 AM Subject: Re: Build k-NN graph for large dataset If you don't want to compute all N^2 similarities, you need to implement some kind of blocking first. For example, LSH (locally sensitive hashing). A quick search gave this link to a Spark implementation: http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely, I have a large set of high dimensional vector (say d 1) and I want to build a graph where those high dimensional points are the vertices and each one is linked to the k-nearest neighbor based on some kind similarity defined on the vertex spaces. My problem is to implement an efficient algorithm to compute the weight matrix of the graph. I need to compute a N*N similarities and the only way I know is to use cartesian operation follow by map operation on RDD. But, this is very slow when the N is large. Is there a more cleaver way to do this for an arbitrary similarity function ? Cheers, Jao
RE: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?
I would also add, from a data locality theoretic standpoint, mapPartitions() provides for node-local computation that plain old map-reduce does not. From my Android phone on T-Mobile. The first nationwide 4G network. Original message From: Ashic Mahtab as...@live.com Date: 06/28/2015 10:51 AM (GMT-07:00) To: YaoPau jonrgr...@gmail.com,Apache Spark user@spark.apache.org Subject: RE: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce? Spark comes with quite a few components. At it's core is..surprisespark core. This provides the core things required to run spark jobs. Spark provides a lot of operators out of the box...take a look at https://spark.apache.org/docs/latest/programming-guide.html#transformations https://spark.apache.org/docs/latest/programming-guide.html#actions While all of them can be implemented with variations of rd.map().reduce(), there are optimisations to be gained in terms of data locality, etc., and the additional operators simply make life simpler. In addition to the core stuff, spark also brings things like Spark Streaming, Spark Sql and data frames, MLLib, GraphX, etc. Spark Streaming gives you microbatches of rdds at periodic intervals.Think give me the last 15 seconds of events every 5 seconds. You can then program towards the small collection, and the job will run in a fault tolerant manner on your cluster. Spark Sql provides hive like functionality that works nicely with various data sources, and RDDs. MLLib provide a lot of oob machine learning algorithms, and the new Spark ML project provides a nice elegant pipeline api to take care of a lot of common machine learning tasks. GraphX allows you to represent data in graphs, and run graph algorithms on it. e.g. you can represent your data as RDDs of vertexes and edges, and run pagerank on a distributed cluster. And there's moreso, yeah...Spark is definitely not just MapReduce. :) Date: Sun, 28 Jun 2015 09:13:18 -0700 From: jonrgr...@gmail.com To: user@spark.apache.org Subject: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce? I've heard Spark is not just MapReduce mentioned during Spark talks, but it seems like every method that Spark has is really doing something like (Map - Reduce) or (Map - Map - Map - Reduce) etc behind the scenes, with the performance benefit of keeping RDDs in memory between stages. Am I wrong about that? Is Spark doing anything more efficiently than a series of Maps followed by a Reduce in memory? What methods does Spark have that can't easily be mapped (with somewhat similar efficiency) to Map and Reduce in memory? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-does-Spark-is-not-just-MapReduce-mean-Isn-t-every-Spark-job-a-form-of-MapReduce-tp23518.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why Spark is much faster than Hadoop MapReduce even on disk
http://www.datascienceassn.org/content/making-sense-making-sense-performance-data-analytics-frameworks From: bit1...@163.com bit1...@163.com To: user user@spark.apache.org Sent: Monday, April 27, 2015 8:33 PM Subject: Why Spark is much faster than Hadoop MapReduce even on disk #yiv1713360705 body {line-height:1.5;}#yiv1713360705 body {font-size:10.5pt;color:rgb(0, 0, 0);line-height:1.5;}Hi, I am frequently asked why spark is also much faster than Hadoop MapReduce on disk (without the use of memory cache). I have no convencing answer for this question, could you guys elaborate on this? Thanks!
Re: How to restrict foreach on a streaming RDD only once upon receiver completion
You could have your receiver send a magic value when it is done. I discuss this Spark Streaming pattern in my presentation Spark Gotchas and Anti-Patterns. In the PDF version, it's slides 34-36.http://www.datascienceassn.org/content/2014-11-05-spark-gotchas-and-anti-patterns-julia-language YouTube version cued to that place: http://www.youtube.com/watch?v=W5Uece_JmNst=23m18s From: Hari Polisetty hpoli...@icloud.com To: Tathagata Das t...@databricks.com Cc: user user@spark.apache.org Sent: Monday, April 6, 2015 2:02 PM Subject: Re: How to restrict foreach on a streaming RDD only once upon receiver completion Yes, I’m using updateStateByKey and it works. But then I need to perform further computation on this Stateful RDD (see code snippet below). I perform forEach on the final RDD and get the top 10 records. I just don’t want the foreach to be performed every time a new batch is received. Only when the receiver is done fetching all the records. My requirements are to programmatically invoke the E.S query (it varies by usecase) , get all the records and apply certain transformations and get the top 10 results based on certain criteria back into the driver program for further processing. I’m able to apply the transformations on the batches of records fetched from E.S using streaming. So, I don’t need to wait for all the records to be fetched. The RDD transformations are happening all the time and the top k results are getting updated constantly until all the records are fetched by the receiver. Is there any drawback with this approach? Can you give more pointers on what you mean by creating a custom RDD that reads from ElasticSearch? Here is the relevant portion of my Spark streaming code: //Create a custom streaming receiver to query for relevant data from E.S JavaReceiverInputDStreamString jsonStrings = ssc.receiverStream( new ElasticSearchResponseReceiver(query…….)); //Apply JSON Paths to extract specific value(s) from each record JavaDStreamString fieldVariations = jsonStrings.flatMap(new FlatMapFunctionString, String() { private static final long serialVersionUID = 465237345751948L; @Override public IterableString call(String jsonString) { ListString r = JsonPath.read(jsonString, attributeDetail.getJsonPath()); return r; } }); //Perform a stateful map reduce on each variation JavaPairDStreamString, Integer fieldVariationCounts = fieldVariations.mapToPair( new PairFunctionString, String, Integer() { private static final long serialVersionUID = -1241276515559408238L; @Override public Tuple2String, Integer call(String s) { return new Tuple2String, Integer(s, 1); } }).updateStateByKey(new Function2ListInteger, OptionalInteger, OptionalInteger () { private static final long serialVersionUID = 7598681835161199865L; public OptionalInteger call(ListInteger nums, OptionalInteger current) { Integer sum = current.or((int) 0L); return (OptionalInteger) Optional.of(sum + nums.size()); } }).reduceByKey(new Function2Integer, Integer, Integer() { private static final long serialVersionUID = -5906059838295609562L; @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); //Swap the Map from Enum String,Int to Int,Enum String. This is so that we can sort on frequencies JavaPairDStreamInteger, String swappedPair = fieldVariationCounts.mapToPair(new PairFunctionTuple2String, Integer, Integer, String() { private static final long serialVersionUID = -5889774695187619957L; @Override public Tuple2Integer, String call(Tuple2String, Integer item) throws Exception { return item.swap(); } }); //Sort based on Key i.e, frequency JavaPairDStreamInteger, String sortedCounts = swappedPair.transformToPair( new FunctionJavaPairRDDInteger, String, JavaPairRDDInteger, String() { private static final long serialVersionUID = -4172702039963232779L; public JavaPairRDDInteger, String call(JavaPairRDDInteger, String in) throws Exception { //False to denote sort in descending order return in.sortByKey(false); } }); //Iterate through the RDD and get the top 20 values in the sorted pair and write to results list sortedCounts.foreach( new FunctionJavaPairRDDInteger, String, Void () { private static final long serialVersionUID = 2186144129973051920L; public Void call(JavaPairRDDInteger, String rdd) { resultList.clear(); for (Tuple2Integer, String t: rdd.take(MainDriver.NUMBER_OF_TOP_VARIATIONS)) { resultList.add(new Tuple3String,Integer, Double(t._2(), t._1(), (double) (100*t._1())/totalProcessed.value())); } return null; } } ); On Apr 7, 2015, at 1:14 AM, Tathagata Das t...@databricks.com wrote: So you want to sort based on the total count of the all the records received through receiver? In that case, you have to combine all the counts using updateStateByKey (https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala) But stepping back, if you want to
Spark GraphX In Action on documentation page?
Can my new book, Spark GraphX In Action, which is currently in MEAP http://manning.com/malak/, be added to https://spark.apache.org/documentation.html and, if appropriate, to https://spark.apache.org/graphx/ ? Michael Malak - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark challenge: zip with next???
But isn't foldLeft() overkill for the originally stated use case of max diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative non-associative accumulation as opposed to an embarrassingly parallel operation such as this one? This use case reminds me of FIR filtering in DSP. It seems that RDDs could use something that serves the same purpose as scala.collection.Iterator.sliding. From: Koert Kuipers ko...@tresata.com To: Mohit Jaggi mohitja...@gmail.com Cc: Tobias Pfeiffer t...@preferred.jp; Ganelin, Ilya ilya.gane...@capitalone.com; derrickburns derrickrbu...@gmail.com; user@spark.apache.org user@spark.apache.org Sent: Friday, January 30, 2015 7:11 AM Subject: Re: spark challenge: zip with next??? assuming the data can be partitioned then you have many timeseries for which you want to detect potential gaps. also assuming the resulting gaps info per timeseries is much smaller data then the timeseries data itself, then this is a classical example to me of a sorted (streaming) foldLeft, requiring an efficient secondary sort in the spark shuffle. i am trying to get that into spark here: https://issues.apache.org/jira/browse/SPARK-3655 On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi mohitja...@gmail.com wrote: http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E you can use the MLLib function or do the following (which is what I had done): - in first pass over the data, using mapPartitionWithIndex, gather the first item in each partition. you can use collect (or aggregator) for this. “key” them by the partition index. at the end, you will have a map (partition index) -- first item- in the second pass over the data, using mapPartitionWithIndex again, look at two (or in the general case N items at a time, you can use scala’s sliding iterator) items at a time and check the time difference(or any sliding window computation). To this mapParitition, pass the map created in previous step. You will need to use them to check the last item in this partition. If you can tolerate a few inaccuracies then you can just do the second step. You will miss the “boundaries” of the partitions but it might be acceptable for your use case. On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Make a copy of your RDD with an extra entry in the beginning to offset. The you can zip the two RDDs and run a map to generate an RDD of differences. Does that work? I recently tried something to compute differences between each entry and the next, so I did val rdd1 = ... // null element + rdd val rdd2 = ... // rdd + null elementbut got an error message about zip requiring data sizes in each partition to match. Tobias
Re: Rdd of Rdds
On Wednesday, October 22, 2014 9:06 AM, Sean Owen so...@cloudera.com wrote: No, there's no such thing as an RDD of RDDs in Spark. Here though, why not just operate on an RDD of Lists? or a List of RDDs? Usually one of these two is the right approach whenever you feel inclined to operate on an RDD of RDDs. Depending on one's needs, one could also consider the matrix (RDD[Vector]) operations provided by MLLib, such as https://spark.apache.org/docs/latest/mllib-statistics.html - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: UpdateStateByKey - How to improve performance?
Depending on the density of your keys, the alternative signature def updateStateByKey[S](updateFunc: (Iterator[(K, Seq[V], Option[S])]) ? Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner: Boolean)(implicit arg0: ClassTag[S]): DStream[(K, S)] at least iterates by key rather than by (old) value. I believe your thinking is correct that there might be a performance improvement opportunity for your case if there were an updateStateByKey() that instead iterated by (new) value. BTW, my impression from the stock examples is that the signature I pasted above was intended to be the more typically called updateStateByKey(), as opposed to the one you pasted, for which my impression is that it is the more general purpose one. I have used the more general purpose one but only when I needed to peek into the entire set of states for some unusual reason. On Wednesday, August 6, 2014 2:30 PM, Venkat Subramanian vsubr...@gmail.com wrote: The method def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) = Option[S] ): DStream[(K, S)] takes Dstream (K,V) and Produces DStream (K,S) in Spark Streaming We have a input Dstream(K,V) that has 40,000 elements. We update on average of 1000 elements of them in every 3 second batch, but based on how this updateStateByKey function is defined, we are looping through 40,000 elements (Seq[V]) to make an update for just 1000 elements and not updating 39000 elements. I think looping through extra 39000 elements is a waste of performance. Isn't there a better way to update this efficiently by just figuring out the a hash map for the 1000 elements that are required to be updated and just updating it (without looping through the unwanted elements)? Shouldn't there be a Streaming update function provided that updates selective members or are we missing some concepts here? I think updateStateByKey may be causing lot of performance degradation in our app as we keep doing this again and again for every batch. Please let us know if my thought process is correct here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/UpdateStateByKey-How-to-improve-performance-tp11575.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: relationship of RDD[Array[String]] to Array[Array[String]]
It's really more of a Scala question than a Spark question, but the standard OO (not Scala-specific) way is to create your own custom supertype (e.g. MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD and MyArray), each of which manually forwards method calls to the corresponding pre-existing library implementations. Writing all those forwarding method calls is tedious, but Scala provides at least one bit of syntactic sugar, which alleviates having to type in twice the parameter lists for each method: http://stackoverflow.com/questions/8230831/is-method-parameter-forwarding-possible-in-scala I'm not seeing a way to utilize implicit conversions in this case. Since Scala is statically (albeit inferred) typed, I don't see a way around having a common supertype. On Monday, July 21, 2014 11:01 AM, Philip Ogren philip.og...@oracle.com wrote: It is really nice that Spark RDD's provide functions that are often equivalent to functions found in Scala collections. For example, I can call: myArray.map(myFx) and equivalently myRdd.map(myFx) Awesome! My question is this. Is it possible to write code that works on either an RDD or a local collection without having to have parallel implementations? I can't tell that RDD or Array share any supertypes or traits by looking at the respective scaladocs. Perhaps implicit conversions could be used here. What I would like to do is have a single function whose body is like this: myData.map(myFx) where myData could be an RDD[Array[String]] (for example) or an Array[Array[String]]. Has anyone had success doing this? Thanks, Philip
Re: parallel Reduce within a key
How about a treeReduceByKey? :-) On Friday, June 20, 2014 11:55 AM, DB Tsai dbt...@stanford.edu wrote: Currently, the reduce operation combines the result from mapper sequentially, so it's O(n). Xiangrui is working on treeReduce which is O(log(n)). Based on the benchmark, it dramatically increase the performance. You can test the code in his own branch. https://github.com/apache/spark/pull/1110 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Jun 20, 2014 at 6:57 AM, ansriniv ansri...@gmail.com wrote: Hi, I am on Spark 0.9.0 I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32 cores in the cluster). I have an input rdd with 64 partitions. I am running sc.mapPartitions(...).reduce(...) I can see that I get full parallelism on the mapper (all my 32 cores are busy simultaneously). However, when it comes to reduce(), the outputs of the mappers are all reduced SERIALLY. Further, all the reduce processing happens only on 1 of the workers. I was expecting that the outputs of the 16 mappers on node 1 would be reduced in parallel in node 1 while the outputs of the 16 mappers on node 2 would be reduced in parallel on node 2 and there would be 1 final inter-node reduce (of node 1 reduced result and node 2 reduced result). Isn't parallel reduce supported WITHIN a key (in this case I have no key) ? (I know that there is parallelism in reduce across keys) Best Regards Anand -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallel-Reduce-within-a-key-tp7998.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: rdd ordering gets scrambled
Mohit Jaggi: A workaround is to use zipWithIndex (to appear in Spark 1.0, but if you're still on 0.9x you can swipe the code from https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala ), map it to (x = (x._2,x._1)) and then sortByKey. Spark developers: The lack of ordering guarantee for RDDs should be better documented, and the presence of a method called first() is a bit deceiving, in my opinion, if that same first element doesn't survive a map(). On Tuesday, April 29, 2014 3:45 PM, Mohit Jaggi mohitja...@gmail.com wrote: Hi, I started with a text file(CSV) of sorted data (by first column), parsed it into Scala objects using map operation in Scala. Then I used more maps to add some extra info to the data and saved it as text file. The final text file is not sorted. What do I need to do to keep the order from the original input intact? My code looks like: csvFile = sc.textFile(..) //file is CSV and ordered by first column splitRdd = csvFile map { line = line.split(,,-1) } parsedRdd = rdd map { parts = { key = parts(0) //use first column as key value = new MyObject(parts(0), parts(1)) //parse into scala objects (key, value) } augmentedRdd = parsedRdd map { x = key = x._1 value = //add extra fields to x._2 (key, value) } augmentedRdd.saveAsFile(...) //this file is not sorted Mohit.
Re: Bug when zip with longs and too many partitions?
I've discovered that it was noticed a year ago that RDD zip() does not work when the number of partitions does not evenly divide the total number of elements in the RDD: https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ I will enter a JIRA ticket just as soon as the ASF Jira system will let me reset my password. On Sunday, May 11, 2014 4:40 AM, Michael Malak michaelma...@yahoo.com wrote: Is this a bug? scala sc.parallelize(1 to 2,4).zip(sc.parallelize(11 to 12,4)).collect res0: Array[(Int, Int)] = Array((1,11), (2,12)) scala sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect res1: Array[(Long, Int)] = Array((2,11))
Re: Opinions stratosphere
looks like Spark outperforms Stratosphere fairly consistently in the experiments There was one exception the paper noted, which was when memory resources were constrained. In that case, Stratosphere seemed to have degraded more gracefully than Spark, but the author did not explore it deeper. The author did insert into his conclusion section, though, However, in our experiments, for iterative algorithms, the Spark programs may show the poor results in performance in the environment of limited memory resources. I recently blogged a fuller list of alternatives/competitors to Spark: http://datascienceassn.org/content/alternatives-spark-memory-distributed-computing On Friday, May 2, 2014 10:39 AM, Philip Ogren philip.og...@oracle.com wrote: Great reference! I just skimmed through the results without reading much of the methodology - but it looks like Spark outperforms Stratosphere fairly consistently in the experiments. It's too bad the data sources only range from 2GB to 8GB. Who knows if the apparent pattern would extend out to 64GB, 128GB, 1TB, and so on... On 05/01/2014 06:02 PM, Christopher Nguyen wrote: Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted such a comparative study as a Masters thesis: http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf According to this snapshot (c. 2013), Stratosphere is different from Spark in not having an explicit concept of an in-memory dataset (e.g., RDD). In principle this could be argued to be an implementation detail; the operators and execution plan/data flow are of primary concern in the API, and the data representation/materializations are otherwise unspecified. But in practice, for long-running interactive applications, I consider RDDs to be of fundamental, first-class citizen importance, and the key distinguishing feature of Spark's model vs other in-memory approaches that treat memory merely as an implicit cache. -- Christopher T. Nguyen Co-founder CEO, Adatao linkedin.com/in/ctnguyen On Tue, Nov 26, 2013 at 1:26 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I don’t know a lot about it except from the research side, where the team has done interesting optimization stuff for these types of applications. In terms of the engine, one thing I’m not sure of is whether Stratosphere allows explicit caching of datasets (similar to RDD.cache()) and interactive queries (similar to spark-shell). But it’s definitely an interesting project to watch. Matei On Nov 22, 2013, at 4:17 PM, Ankur Chauhan achau...@brightcove.com wrote: Hi, That's what I thought but as per the slides on http://www.stratosphere.eu they seem to know about spark and the scala api does look similar. I found the PACT model interesting. Would like to know if matei or other core comitters have something to weight in on. -- Ankur On 22 Nov 2013, at 16:05, Patrick Wendell pwend...@gmail.com wrote: I've never seen that project before, would be interesting to get a comparison. Seems to offer a much lower level API. For instance this is a wordcount program: https://github.com/stratosphere/stratosphere/blob/master/pact/pact-examples/src/main/java/eu/stratosphere/pact/example/wordcount/WordCount.java On Thu, Nov 21, 2013 at 3:15 PM, Ankur Chauhan achau...@brightcove.com wrote: Hi, I was just curious about https://github.com/stratosphere/stratosphere and how does spark compare to it. Anyone has any experience with it to make any comments? -- Ankur