Re: Minimum cost flow problem solving in Spark

2017-09-13 Thread Michael Malak
You might be interested in "Maximum Flow implementation on Spark GraphX" done 
by a Colorado School of Mines grad student a couple of years ago.


http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx


From: Swapnil Shinde 
To: user@spark.apache.org; d...@spark.apache.org 
Sent: Wednesday, September 13, 2017 9:41 AM
Subject: Minimum cost flow problem solving in Spark



Hello
Has anyone used Spark to solve minimum cost flow problems in Spark? I am 
quite new to combinatorial optimization algorithms so any help or suggestions, 
libraries are very appreciated. 

Thanks
Swapnil

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Shortest path with directed and weighted graphs

2016-10-24 Thread Michael Malak
Chapter 6 of my book implements Dijkstra's Algorithm. The source code is 
available to download for free. 
https://www.manning.com/books/spark-graphx-in-action



  From: Brian Wilson 
 To: user@spark.apache.org 
 Sent: Monday, October 24, 2016 7:11 AM
 Subject: Shortest path with directed and weighted graphs
   
I have been looking at the ShortestPaths function inbuilt with Spark here.
Am I correct in saying there is no support for weighted graphs with this 
function? By that I mean that it assumes all edges carry a weight = 1
Many thanks
Brian 

   

Re: GraphX drawing algorithm

2016-09-11 Thread Michael Malak
In chapter 10 of Spark GraphX In Action, we describe how to use Zeppelin with 
d3.js to render graphs using d3's force-directed rendering algorithm. The 
source code can be downloaded for free from 
https://www.manning.com/books/spark-graphx-in-action
  From: agc studio 
 To: user@spark.apache.org 
 Sent: Sunday, September 11, 2016 5:59 PM
 Subject: GraphX drawing algorithm
   
Hi all,
I was wondering if a force-directed graph drawing algorithm has been 
implemented for graphX?

Thanks

   

Re: GraphX Java API

2016-05-30 Thread Michael Malak
Yes, it is possible to use GraphX from Java but it requires 10x the amount of 
code and involves using obscure typing and pre-defined lambda prototype 
facilities. I give an example of it in my book, the source code for which can 
be downloaded for free from 
https://www.manning.com/books/spark-graphx-in-action The relevant example is 
EdgeCount.java in chapter 10.
As I suggest in my book, likely the only reason you'd want to put yourself 
through that torture is corporate mandate or compatibility with Java bytecode 
tools.

  From: Sean Owen 
 To: Takeshi Yamamuro ; "Kumar, Abhishek (US - 
Bengaluru)"  
Cc: "user@spark.apache.org" 
 Sent: Monday, May 30, 2016 7:07 AM
 Subject: Re: GraphX Java API
   
No, you can call any Scala API in Java. It is somewhat less convenient if the 
method was not written with Java in mind but does work. 

On Mon, May 30, 2016, 00:32 Takeshi Yamamuro  wrote:

These package are used only for Scala.
On Mon, May 30, 2016 at 2:23 PM, Kumar, Abhishek (US - Bengaluru) 
 wrote:

Hey,·  I see some graphx packages listed 
here:http://spark.apache.org/docs/latest/api/java/index.html·  
org.apache.spark.graphx·  org.apache.spark.graphx.impl·  
org.apache.spark.graphx.lib·  org.apache.spark.graphx.utilAren’t they meant 
to be used with JAVA?Thanks From: Santoshakhilesh 
[mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) ; 
user@spark.apache.org
Subject: RE: GraphX Java API GraphX APis are available only in Scala. If you 
need to use GraphX you need to switch to Scala. From: Kumar, Abhishek (US - 
Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org
Subject: GraphX Java API Hi, We are trying to consume the Java API for GraphX, 
but there is no documentation available online on the usage or examples. It 
would be great if we could get some examples in Java. Thanks and regards, 
Abhishek Kumar   This message (including any attachments) contains confidential 
information intended for a specific individual and purpose, and is protected by 
law. If you are not the intended recipient, you should delete this message and 
any disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.v.E.1



-- 
---
Takeshi Yamamuro



  

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Michael Malak
At first glance, it looks like the only streaming data sources available out of 
the box from the github master branch are 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 and 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala
 . Out of the Jira epic for Structured Streaming 
https://issues.apache.org/jira/browse/SPARK-8360 it would seem the still-open 
https://issues.apache.org/jira/browse/SPARK-10815 "API design: data sources and 
sinks" is relevant here.
In short, it would seem the code is not there yet to create a Kafka-fed 
Dataframe/Dataset that can be queried with Structured Streaming; or if it is, 
it's not obvious how to write such code.

  From: Anthony May 
 To: Deepak Sharma ; Sunita Arvind 
 
Cc: "user@spark.apache.org" 
 Sent: Friday, May 6, 2016 11:50 AM
 Subject: Re: Adhoc queries on Spark 2.0 with Structured Streaming
   
Yeah, there isn't even a RC yet and no documentation but you can work off the 
code base and test suites:
https://github.com/apache/spark
And this might help:
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala

On Fri, 6 May 2016 at 11:07 Deepak Sharma  wrote:

Spark 2.0 is yet to come out for public release.
I am waiting to get hands on it as well.
Please do let me know if i can download source and build spark2.0 from github.

Thanks
Deepak

On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind  wrote:

Hi All,

We are evaluating a few real time streaming query engines and spark is my 
personal choice. The addition of adhoc queries is what is getting me further 
excited about it, however the talks I have heard so far only mention about it 
but do not provide details. I need to build a prototype to ensure it works for 
our use cases. 

Can someone point me to relevant material for this.

regards
Sunita




-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


  

Re: Spark 2.0 forthcoming features

2016-04-20 Thread Michael Malak
http://go.databricks.com/apache-spark-2.0-presented-by-databricks-co-founder-reynold-xin



  From: Sourav Mazumder 
 To: user  
 Sent: Wednesday, April 20, 2016 11:07 AM
 Subject: Spark 2.0 forthcoming features
   
Hi All,

Is there somewhere we can get idea of the upcoming features in Spark 2.0.

I got a list for Spark ML from here 
https://issues.apache.org/jira/browse/SPARK-12626.

Is there other links where I can similar enhancements planned for Sparl SQL, 
Spark Core, Spark Streaming. GraphX etc. ?

Thanks in advance.

Regards,
Sourav


   

Re: Apache Flink

2016-04-17 Thread Michael Malak
As with all history, "what if"s are not scientifically testable hypotheses, but 
my speculation is the energy (VCs, startups, big Internet companies, 
universities) within Silicon Valley contrasted to Germany.

  From: Mich Talebzadeh <mich.talebza...@gmail.com>
 To: Michael Malak <michaelma...@yahoo.com>; "user @spark" 
<user@spark.apache.org> 
 Sent: Sunday, April 17, 2016 3:55 PM
 Subject: Re: Apache Flink
   
Assuming that both Spark and Flink are contemporaries what are the reasons that 
Flink has not been adopted widely? (this may sound obvious and or prejudged). I 
mean Spark has surged in popularity in the past year if I am correct
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 22:49, Michael Malak <michaelma...@yahoo.com> wrote:

In terms of publication date, a paper on Nephele was published in 2009, prior 
to the 2010 USENIX paper on Spark. Nephele is the execution engine of 
Stratosphere, which became Flink.

  From: Mark Hamstra <m...@clearstorydata.com>
 To: Mich Talebzadeh <mich.talebza...@gmail.com> 
Cc: Corey Nolet <cjno...@gmail.com>; "user @spark" <user@spark.apache.org>
 Sent: Sunday, April 17, 2016 3:30 PM
 Subject: Re: Apache Flink
  
To be fair, the Stratosphere project from which Flink springs was started as a 
collaborative university research project in Germany about the same time that 
Spark was first released as Open Source, so they are near contemporaries rather 
than Flink having been started only well after Spark was an established and 
widely-used Apache project.
On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:

Also it always amazes me why they are so many tangential projects in Big Data 
space? Would not it be easier if efforts were spent on adding to Spark 
functionality rather than creating a new product like Flink?
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 21:08, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

Thanks Corey for the useful info.
I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, 
there does not seem to be anything close to these products in Hadoop Ecosystem. 
So I guess there is nothing there?
Regards.

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 20:43, Corey Nolet <cjno...@gmail.com> wrote:

i have not been intrigued at all by the microbatching concept in Spark. I am 
used to CEP in real streams processing environments like Infosphere Streams & 
Storm where the granularity of processing is at the level of each individual 
tuple and processing units (workers) can react immediately to events being 
received and processed. The closest Spark streaming comes to this concept is 
the notion of "state" that that can be updated via the "updateStateBykey()" 
functions which are only able to be run in a microbatch. Looking at the 
expected design changes to Spark Streaming in Spark 2.0.0, it also does not 
look like tuple-at-a-time processing is on the radar for Spark, though I have 
seen articles stating that more effort is going to go into the Spark SQL layer 
in Spark streaming which may make it more reminiscent of Esper.
For these reasons, I have not even tried to implement CEP in Spark. I feel it's 
a waste of time without immediate tuple-at-a-time processing. Without this, 
they avoid the whole problem of "back pressure" (though keep in mind, it is 
still very possible to overload the Spark streaming layer with stages that will 
continue to pile up and never get worked off) but they lose the granular 
control that you get in CEP environments by allowing the rules & processors to 
react with the receipt of each tuple, right away. 
Awhile back, I did attempt to implement an InfoSphere Streams-like API [1] on 
top of Apache Storm as an example of what such a design may look like. It looks 
like Storm is going to be replaced in the not so distant future by Twitter's 
new design called Heron. IIRC, Heron does not have an open source 
implementation as of yet. 
[1] https://github.com/calrissian/flowmix
On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:

Hi Corey,
Can you please point me to docs on using Spark for CEP? Do we have a set of CEP 
libraries somewhere. I am keen on getting hold of adaptor libraries for Spark 
something like below


​Thanks

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 16:07, Corey Nolet <cjno...@gma

Re: Apache Flink

2016-04-17 Thread Michael Malak
There have been commercial CEP solutions for decades, including from my 
employer.

  From: Mich Talebzadeh 
 To: Mark Hamstra  
Cc: Corey Nolet ; "user @spark" 
 Sent: Sunday, April 17, 2016 3:48 PM
 Subject: Re: Apache Flink
   
The problem is that the strength and wider acceptance of a typical Open source 
project is its sizeable user and development community. When the community is 
small like Flink, then it is not a viable solution to adopt 
I am rather disappointed that no big data project can be used for Complex Event 
Processing as it has wider use in Algorithmic trading among others.

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 22:30, Mark Hamstra  wrote:

To be fair, the Stratosphere project from which Flink springs was started as a 
collaborative university research project in Germany about the same time that 
Spark was first released as Open Source, so they are near contemporaries rather 
than Flink having been started only well after Spark was an established and 
widely-used Apache project.
On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh  
wrote:

Also it always amazes me why they are so many tangential projects in Big Data 
space? Would not it be easier if efforts were spent on adding to Spark 
functionality rather than creating a new product like Flink?
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 21:08, Mich Talebzadeh  wrote:

Thanks Corey for the useful info.
I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, 
there does not seem to be anything close to these products in Hadoop Ecosystem. 
So I guess there is nothing there?
Regards.

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 20:43, Corey Nolet  wrote:

i have not been intrigued at all by the microbatching concept in Spark. I am 
used to CEP in real streams processing environments like Infosphere Streams & 
Storm where the granularity of processing is at the level of each individual 
tuple and processing units (workers) can react immediately to events being 
received and processed. The closest Spark streaming comes to this concept is 
the notion of "state" that that can be updated via the "updateStateBykey()" 
functions which are only able to be run in a microbatch. Looking at the 
expected design changes to Spark Streaming in Spark 2.0.0, it also does not 
look like tuple-at-a-time processing is on the radar for Spark, though I have 
seen articles stating that more effort is going to go into the Spark SQL layer 
in Spark streaming which may make it more reminiscent of Esper.
For these reasons, I have not even tried to implement CEP in Spark. I feel it's 
a waste of time without immediate tuple-at-a-time processing. Without this, 
they avoid the whole problem of "back pressure" (though keep in mind, it is 
still very possible to overload the Spark streaming layer with stages that will 
continue to pile up and never get worked off) but they lose the granular 
control that you get in CEP environments by allowing the rules & processors to 
react with the receipt of each tuple, right away. 
Awhile back, I did attempt to implement an InfoSphere Streams-like API [1] on 
top of Apache Storm as an example of what such a design may look like. It looks 
like Storm is going to be replaced in the not so distant future by Twitter's 
new design called Heron. IIRC, Heron does not have an open source 
implementation as of yet. 
[1] https://github.com/calrissian/flowmix
On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh  
wrote:

Hi Corey,
Can you please point me to docs on using Spark for CEP? Do we have a set of CEP 
libraries somewhere. I am keen on getting hold of adaptor libraries for Spark 
something like below


​Thanks

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 16:07, Corey Nolet  wrote:

One thing I've noticed about Flink in my following of the project has been that 
it has established, in a few cases, some novel ideas and improvements over 
Spark. The problem with it, however, is that both the development team and the 
community around it are very small and many of those novel improvements have 
been rolled directly into Spark in subsequent versions. I was considering 
changing over my architecture to Flink at one point to get better, more 
real-time CEP streaming support, but in the end I 

Re: Apache Flink

2016-04-17 Thread Michael Malak
In terms of publication date, a paper on Nephele was published in 2009, prior 
to the 2010 USENIX paper on Spark. Nephele is the execution engine of 
Stratosphere, which became Flink.

  From: Mark Hamstra 
 To: Mich Talebzadeh  
Cc: Corey Nolet ; "user @spark" 
 Sent: Sunday, April 17, 2016 3:30 PM
 Subject: Re: Apache Flink
   
To be fair, the Stratosphere project from which Flink springs was started as a 
collaborative university research project in Germany about the same time that 
Spark was first released as Open Source, so they are near contemporaries rather 
than Flink having been started only well after Spark was an established and 
widely-used Apache project.
On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh  
wrote:

Also it always amazes me why they are so many tangential projects in Big Data 
space? Would not it be easier if efforts were spent on adding to Spark 
functionality rather than creating a new product like Flink?
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 21:08, Mich Talebzadeh  wrote:

Thanks Corey for the useful info.
I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, 
there does not seem to be anything close to these products in Hadoop Ecosystem. 
So I guess there is nothing there?
Regards.

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 20:43, Corey Nolet  wrote:

i have not been intrigued at all by the microbatching concept in Spark. I am 
used to CEP in real streams processing environments like Infosphere Streams & 
Storm where the granularity of processing is at the level of each individual 
tuple and processing units (workers) can react immediately to events being 
received and processed. The closest Spark streaming comes to this concept is 
the notion of "state" that that can be updated via the "updateStateBykey()" 
functions which are only able to be run in a microbatch. Looking at the 
expected design changes to Spark Streaming in Spark 2.0.0, it also does not 
look like tuple-at-a-time processing is on the radar for Spark, though I have 
seen articles stating that more effort is going to go into the Spark SQL layer 
in Spark streaming which may make it more reminiscent of Esper.
For these reasons, I have not even tried to implement CEP in Spark. I feel it's 
a waste of time without immediate tuple-at-a-time processing. Without this, 
they avoid the whole problem of "back pressure" (though keep in mind, it is 
still very possible to overload the Spark streaming layer with stages that will 
continue to pile up and never get worked off) but they lose the granular 
control that you get in CEP environments by allowing the rules & processors to 
react with the receipt of each tuple, right away. 
Awhile back, I did attempt to implement an InfoSphere Streams-like API [1] on 
top of Apache Storm as an example of what such a design may look like. It looks 
like Storm is going to be replaced in the not so distant future by Twitter's 
new design called Heron. IIRC, Heron does not have an open source 
implementation as of yet. 
[1] https://github.com/calrissian/flowmix
On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh  
wrote:

Hi Corey,
Can you please point me to docs on using Spark for CEP? Do we have a set of CEP 
libraries somewhere. I am keen on getting hold of adaptor libraries for Spark 
something like below


​Thanks

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 16:07, Corey Nolet  wrote:

One thing I've noticed about Flink in my following of the project has been that 
it has established, in a few cases, some novel ideas and improvements over 
Spark. The problem with it, however, is that both the development team and the 
community around it are very small and many of those novel improvements have 
been rolled directly into Spark in subsequent versions. I was considering 
changing over my architecture to Flink at one point to get better, more 
real-time CEP streaming support, but in the end I decided to stick with Spark 
and just watch Flink continue to pressure it into improvement.
On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers  wrote:

i never found much info that flink was actually designed to be fault tolerant. 
if fault tolerance is more bolt-on/add-on/afterthought then that doesn't bode 
well for large scale data processing. spark was designed with fault tolerance 
in mind from the beginning.

On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh 

Re: Spark with Druid

2016-03-23 Thread Michael Malak
Will Spark 2.0 Structured Streaming obviate some of the Druid/Spark use cases?

  From: Raymond Honderdors 
 To: "yuzhih...@gmail.com"  
Cc: "user@spark.apache.org" 
 Sent: Wednesday, March 23, 2016 8:43 AM
 Subject: Re: Spark with Druid
   
I saw these but i fail to understand how to direct the code to use rhe index 
json
Sent from Outlook Mobile



On Wed, Mar 23, 2016 at 7:19 AM -0700, "Ted Yu"  wrote:

Please see:
https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani

which references https://github.com/SparklineData/spark-druid-olap
On Wed, Mar 23, 2016 at 5:59 AM, Raymond Honderdors 
 wrote:

Does anyone have a good overview on how to integrate Spark and Druid? I am now 
struggling with the creation of a druid data source in spark.  Raymond 
HonderdorsTeam Lead Analytics BIBusiness Intelligence 
developerraymond.honderd...@sizmek.comt +972.7325.3569Herzliya 



  

Re: Build k-NN graph for large dataset

2015-08-26 Thread Michael Malak
Yes. And a paper that describes using grids (actually varying grids) is 
http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
 In the Spark GraphX In Action book that Robin East and I are writing, we 
implement a drastically simplified version of this in chapter 7, which should 
become available in the MEAP mid-September. 
http://www.manning.com/books/spark-graphx-in-action

  From: Kristina Rogale Plazonic kpl...@gmail.com
 To: Jaonary Rabarisoa jaon...@gmail.com 
Cc: user user@spark.apache.org 
 Sent: Wednesday, August 26, 2015 7:24 AM
 Subject: Re: Build k-NN graph for large dataset
   
If you don't want to compute all N^2 similarities, you need to implement some 
kind of blocking first. For example, LSH (locally sensitive hashing). A quick 
search gave this link to a Spark implementation: 
http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing


On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

Dear all,
I'm trying to find an efficient way to build a k-NN graph for a large dataset. 
Precisely, I have a large set of high dimensional vector (say d  1) and 
I want to build a graph where those high dimensional points are the vertices 
and each one is linked to the k-nearest neighbor based on some kind similarity 
defined on the vertex spaces. 
My problem is to implement an efficient algorithm to compute the weight matrix 
of the graph. I need to compute a N*N similarities and the only way I know is 
to use cartesian operation follow by map operation on RDD. But, this is 
very slow when the N is large. Is there a more cleaver way to do this for an 
arbitrary similarity function ? 
Cheers,
Jao



  

RE: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Michael Malak
I would also add, from a data locality theoretic standpoint, mapPartitions() 
provides for node-local computation that plain old map-reduce does not.


From my Android phone on T-Mobile. The first nationwide 4G network.

 Original message 
From: Ashic Mahtab as...@live.com 
Date: 06/28/2015  10:51 AM  (GMT-07:00) 
To: YaoPau jonrgr...@gmail.com,Apache Spark user@spark.apache.org 
Subject: RE: What does Spark is not just MapReduce mean?  Isn't every Spark 
job a form of MapReduce? 
 
Spark comes with quite a few components. At it's core is..surprisespark 
core. This provides the core things required to run spark jobs. Spark provides 
a lot of operators out of the box...take a look at 
https://spark.apache.org/docs/latest/programming-guide.html#transformations
https://spark.apache.org/docs/latest/programming-guide.html#actions

While all of them can be implemented with variations of rd.map().reduce(), 
there are optimisations to be gained in terms of data locality, etc., and the 
additional operators simply make life simpler.

In addition to the core stuff, spark also brings things like Spark Streaming, 
Spark Sql and data frames, MLLib, GraphX, etc. Spark Streaming gives you 
microbatches of rdds at periodic intervals.Think give me the last 15 seconds 
of events every 5 seconds. You can then program towards the small collection, 
and the job will run in a fault tolerant manner on your cluster. Spark Sql 
provides hive like functionality that works nicely with various data sources, 
and RDDs. MLLib provide a lot of oob machine learning algorithms, and the new 
Spark ML project provides a nice elegant pipeline api to take care of a lot of 
common machine learning tasks. GraphX allows you to represent data in graphs, 
and run graph algorithms on it. e.g. you can represent your data as RDDs of 
vertexes and edges, and run pagerank on a distributed cluster.

And there's moreso, yeah...Spark is definitely not just MapReduce. :)

 Date: Sun, 28 Jun 2015 09:13:18 -0700
 From: jonrgr...@gmail.com
 To: user@spark.apache.org
 Subject: What does Spark is not just MapReduce mean? Isn't every Spark job 
 a form of MapReduce?
 
 I've heard Spark is not just MapReduce mentioned during Spark talks, but it
 seems like every method that Spark has is really doing something like (Map
 - Reduce) or (Map - Map - Map - Reduce) etc behind the scenes, with the
 performance benefit of keeping RDDs in memory between stages.
 
 Am I wrong about that? Is Spark doing anything more efficiently than a
 series of Maps followed by a Reduce in memory? What methods does Spark have
 that can't easily be mapped (with somewhat similar efficiency) to Map and
 Reduce in memory?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/What-does-Spark-is-not-just-MapReduce-mean-Isn-t-every-Spark-job-a-form-of-MapReduce-tp23518.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Michael Malak
http://www.datascienceassn.org/content/making-sense-making-sense-performance-data-analytics-frameworks
  
  From: bit1...@163.com bit1...@163.com
 To: user user@spark.apache.org 
 Sent: Monday, April 27, 2015 8:33 PM
 Subject: Why Spark is much faster than Hadoop MapReduce even on disk
   
#yiv1713360705 body {line-height:1.5;}#yiv1713360705 body 
{font-size:10.5pt;color:rgb(0, 0, 0);line-height:1.5;}Hi,
I am frequently asked why spark is also much faster than Hadoop MapReduce on 
disk (without the use of memory cache). I have no convencing answer for this 
question, could you guys elaborate on this? Thanks!



  

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Michael Malak
You could have your receiver send a magic value when it is done. I discuss 
this Spark Streaming pattern in my presentation Spark Gotchas and 
Anti-Patterns. In the PDF version, it's slides 
34-36.http://www.datascienceassn.org/content/2014-11-05-spark-gotchas-and-anti-patterns-julia-language

YouTube version cued to that place: 
http://www.youtube.com/watch?v=W5Uece_JmNst=23m18s   
  From: Hari Polisetty hpoli...@icloud.com
 To: Tathagata Das t...@databricks.com 
Cc: user user@spark.apache.org 
 Sent: Monday, April 6, 2015 2:02 PM
 Subject: Re: How to restrict foreach on a streaming RDD only once upon 
receiver completion
   
Yes, I’m using updateStateByKey and it works. But then I need to perform 
further computation on this Stateful RDD (see code snippet below). I perform 
forEach on the final RDD and get the top 10 records. I just don’t want the 
foreach to be performed every time a new batch is received. Only when the 
receiver is done fetching all the records.
My requirements are to programmatically invoke the E.S query (it varies by 
usecase) , get all the records and apply certain transformations and get the 
top 10 results based on certain criteria back into the driver program for 
further processing. I’m able to apply the transformations on the batches of 
records fetched from E.S  using streaming. So, I don’t need to wait for all the 
records to be fetched. The RDD transformations are happening all the time and 
the top k results are getting updated constantly until all the records are 
fetched by the receiver. Is there any drawback with this approach?
Can you give more pointers on what you mean by creating a custom RDD that reads 
from ElasticSearch? 
Here is the relevant portion of my Spark streaming code:
 //Create a custom streaming receiver to query for relevant data from E.S 
JavaReceiverInputDStreamString jsonStrings = ssc.receiverStream( new 
ElasticSearchResponseReceiver(query…….));
 //Apply JSON Paths to extract specific value(s) from each record 
JavaDStreamString fieldVariations = jsonStrings.flatMap(new 
FlatMapFunctionString, String() { private static final long serialVersionUID 
= 465237345751948L;
 @Override public IterableString call(String jsonString) { ListString r = 
JsonPath.read(jsonString, attributeDetail.getJsonPath()); return r; }
 });
 //Perform a stateful map reduce on each variation JavaPairDStreamString, 
Integer fieldVariationCounts = fieldVariations.mapToPair( new 
PairFunctionString, String, Integer() { private static final long 
serialVersionUID = -1241276515559408238L;
 @Override public Tuple2String, Integer call(String s) { return new 
Tuple2String, Integer(s, 1); } }).updateStateByKey(new 
Function2ListInteger, OptionalInteger, OptionalInteger () { private 
static final long serialVersionUID = 7598681835161199865L;
 public OptionalInteger call(ListInteger nums, OptionalInteger current) { 
Integer sum =  current.or((int) 0L); return (OptionalInteger) Optional.of(sum 
+ nums.size()); } }).reduceByKey(new Function2Integer, Integer, Integer() { 
private static final long serialVersionUID = -5906059838295609562L;
 @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } });
 //Swap the Map from Enum String,Int to Int,Enum String. This is so that we can 
sort on frequencies JavaPairDStreamInteger, String swappedPair = 
fieldVariationCounts.mapToPair(new PairFunctionTuple2String, Integer, 
Integer, String() { private static final long serialVersionUID = 
-5889774695187619957L;
 @Override public Tuple2Integer, String call(Tuple2String, Integer item) 
throws Exception { return item.swap(); }
 });
 //Sort based on Key i.e, frequency JavaPairDStreamInteger, String  
sortedCounts = swappedPair.transformToPair( new FunctionJavaPairRDDInteger, 
String, JavaPairRDDInteger, String() { private static final long 
serialVersionUID = -4172702039963232779L;
 public JavaPairRDDInteger, String call(JavaPairRDDInteger, String in) 
throws Exception { //False to denote sort in descending order return 
in.sortByKey(false); } });
 //Iterate through the RDD and get the top 20 values in the sorted pair and 
write to results list sortedCounts.foreach( new FunctionJavaPairRDDInteger, 
String, Void () { private static final long serialVersionUID = 
2186144129973051920L;
 public Void call(JavaPairRDDInteger, String rdd) { resultList.clear(); for 
(Tuple2Integer, String t: rdd.take(MainDriver.NUMBER_OF_TOP_VARIATIONS)) { 
resultList.add(new Tuple3String,Integer, Double(t._2(), t._1(), (double) 
(100*t._1())/totalProcessed.value())); } return null; } } );        



On Apr 7, 2015, at 1:14 AM, Tathagata Das t...@databricks.com wrote:
So you want to sort based on the total count of the all the records received 
through receiver? In that case, you have to combine all the counts using 
updateStateByKey 
(https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala)
 But stepping back, if you want to 

Spark GraphX In Action on documentation page?

2015-03-24 Thread Michael Malak
Can my new book, Spark GraphX In Action, which is currently in MEAP 
http://manning.com/malak/, be added to 
https://spark.apache.org/documentation.html and, if appropriate, to 
https://spark.apache.org/graphx/ ?

Michael Malak

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark challenge: zip with next???

2015-01-30 Thread Michael Malak
But isn't foldLeft() overkill for the originally stated use case of max diff of 
adjacent pairs? Isn't foldLeft() for recursive non-commutative non-associative 
accumulation as opposed to an embarrassingly parallel operation such as this 
one?
This use case reminds me of FIR filtering in DSP. It seems that RDDs could use 
something that serves the same purpose as scala.collection.Iterator.sliding.
  From: Koert Kuipers ko...@tresata.com
 To: Mohit Jaggi mohitja...@gmail.com 
Cc: Tobias Pfeiffer t...@preferred.jp; Ganelin, Ilya 
ilya.gane...@capitalone.com; derrickburns derrickrbu...@gmail.com; 
user@spark.apache.org user@spark.apache.org 
 Sent: Friday, January 30, 2015 7:11 AM
 Subject: Re: spark challenge: zip with next???
   
assuming the data can be partitioned then you have many timeseries for which 
you want to detect potential gaps. also assuming the resulting gaps info per 
timeseries is much smaller data then the timeseries data itself, then this is a 
classical example to me of a sorted (streaming) foldLeft, requiring an 
efficient secondary sort in the spark shuffle. i am trying to get that into 
spark here:
https://issues.apache.org/jira/browse/SPARK-3655



On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi mohitja...@gmail.com wrote:

http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
you can use the MLLib function or do the following (which is what I had done):
- in first pass over the data, using mapPartitionWithIndex, gather the first 
item in each partition. you can use collect (or aggregator) for this. “key” 
them by the partition index. at the end, you will have a map   (partition 
index) -- first item- in the second pass over the data, using 
mapPartitionWithIndex again, look at two (or in the general case N items at a 
time, you can use scala’s sliding iterator) items at a time and check the time 
difference(or any sliding window computation). To this mapParitition, pass the 
map created in previous step. You will need to use them to check the last item 
in this partition.
If you can tolerate a few inaccuracies then you can just do the second step. 
You will miss the “boundaries” of the partitions but it might be acceptable for 
your use case.



On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,

On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya ilya.gane...@capitalone.com 
wrote:

Make a copy of your RDD with an extra entry in the beginning to offset. The you 
can zip the two RDDs and run a map to generate an RDD of differences.


Does that work? I recently tried something to compute differences between each 
entry and the next, so I did  val rdd1 = ... // null element + rdd  val rdd2 = 
... // rdd + null elementbut got an error message about zip requiring data 
sizes in each partition to match.
Tobias






  

Re: Rdd of Rdds

2014-10-22 Thread Michael Malak
On Wednesday, October 22, 2014 9:06 AM, Sean Owen so...@cloudera.com wrote:

 No, there's no such thing as an RDD of RDDs in Spark.
 Here though, why not just operate on an RDD of Lists? or a List of RDDs?
 Usually one of these two is the right approach whenever you feel
 inclined to operate on an RDD of RDDs.


Depending on one's needs, one could also consider the matrix (RDD[Vector]) 
operations provided by MLLib, such as
https://spark.apache.org/docs/latest/mllib-statistics.html

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: UpdateStateByKey - How to improve performance?

2014-08-06 Thread Michael Malak
Depending on the density of your keys, the alternative signature

def updateStateByKey[S](updateFunc: (Iterator[(K, Seq[V], Option[S])]) ? 
Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner: 
Boolean)(implicit arg0: ClassTag[S]): DStream[(K, S)] 

at least iterates by key rather than by (old) value.

I believe your thinking is correct that there might be a performance 
improvement opportunity for your case if there were an updateStateByKey() that 
instead iterated by (new) value.

BTW, my impression from the stock examples is that the signature I pasted above 
was intended to be the more typically called updateStateByKey(), as opposed to 
the one you pasted, for which my impression is that it is the more general 
purpose one. I have used the more general purpose one but only when I needed to 
peek into the entire set of states for some unusual reason.



On Wednesday, August 6, 2014 2:30 PM, Venkat Subramanian vsubr...@gmail.com 
wrote:
 


The method

def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) =
Option[S] ): DStream[(K, S)]

takes Dstream (K,V) and Produces DStream (K,S)  in Spark Streaming

We have a input Dstream(K,V) that has 40,000 elements. We update on average
of 1000  elements of them in every 3 second batch, but based on how this
updateStateByKey function is defined, we are looping through 40,000 elements
(Seq[V]) to make an update for just 1000 elements and not updating 39000
elements. I think looping through extra 39000 elements is a waste of
performance.

Isn't there a better way to update this efficiently by just figuring out the
a hash map for the 1000 elements that are required to be updated and just
updating it (without looping through the unwanted elements)?  Shouldn't
there be a Streaming update function provided that updates selective members
or are we missing some concepts here?

I think updateStateByKey may be causing lot of performance degradation in
our app as we keep doing this again and again for every batch. Please let us
know if my thought process is correct here.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/UpdateStateByKey-How-to-improve-performance-tp11575.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Michael Malak
It's really more of a Scala question than a Spark question, but the standard OO 
(not Scala-specific) way is to create your own custom supertype (e.g. 
MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD 
and MyArray), each of which manually forwards method calls to the corresponding 
pre-existing library implementations. Writing all those forwarding method calls 
is tedious, but Scala provides at least one bit of syntactic sugar, which 
alleviates having to type in twice the parameter lists for each method:
http://stackoverflow.com/questions/8230831/is-method-parameter-forwarding-possible-in-scala

I'm not seeing a way to utilize implicit conversions in this case. Since Scala 
is statically (albeit inferred) typed, I don't see a way around having a common 
supertype.



On Monday, July 21, 2014 11:01 AM, Philip Ogren philip.og...@oracle.com wrote:
It is really nice that Spark RDD's provide functions  that are often 
equivalent to functions found in Scala collections.  For example, I can 
call:

myArray.map(myFx)

and equivalently

myRdd.map(myFx)

Awesome!

My question is this.  Is it possible to write code that works on either 
an RDD or a local collection without having to have parallel 
implementations?  I can't tell that RDD or Array share any supertypes or 
traits by looking at the respective scaladocs. Perhaps implicit 
conversions could be used here.  What I would like to do is have a 
single function whose body is like this:

myData.map(myFx)

where myData could be an RDD[Array[String]] (for example) or an 
Array[Array[String]].

Has anyone had success doing this?

Thanks,
Philip


Re: parallel Reduce within a key

2014-06-20 Thread Michael Malak
How about a treeReduceByKey? :-)


On Friday, June 20, 2014 11:55 AM, DB Tsai dbt...@stanford.edu wrote:
 


Currently, the reduce operation combines the result from mapper
sequentially, so it's O(n).

Xiangrui is working on treeReduce which is O(log(n)). Based on the
benchmark, it dramatically increase the performance. You can test the
code in his own branch.
https://github.com/apache/spark/pull/1110

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai



On Fri, Jun 20, 2014 at 6:57 AM, ansriniv ansri...@gmail.com wrote:
 Hi,

 I am on Spark 0.9.0

 I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32
 cores in the cluster).
 I have an input rdd with 64 partitions.

 I am running  sc.mapPartitions(...).reduce(...)

 I can see that I get full parallelism on the mapper (all my 32 cores are
 busy simultaneously). However, when it comes to reduce(), the outputs of the
 mappers are all reduced SERIALLY. Further, all the reduce processing happens
 only on 1 of the workers.

 I was expecting that the outputs of the 16 mappers on node 1 would be
 reduced in parallel in node 1 while the outputs of the 16 mappers on node 2
 would be reduced in parallel on node 2 and there would be 1 final inter-node
 reduce (of node 1 reduced result and node 2 reduced result).

 Isn't parallel reduce supported WITHIN a key (in this case I have no key) ?
 (I know that there is parallelism in reduce across keys)

 Best Regards
 Anand



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/parallel-Reduce-within-a-key-tp7998.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: rdd ordering gets scrambled

2014-05-28 Thread Michael Malak
Mohit Jaggi:

A workaround is to use zipWithIndex (to appear in Spark 1.0, but if you're 
still on 0.9x you can swipe the code from 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala
 ), map it to (x = (x._2,x._1)) and then sortByKey.

Spark developers:

The lack of ordering guarantee for RDDs should be better documented, and the 
presence of a method called first() is a bit deceiving, in my opinion, if that 
same first element doesn't survive a map().
 


On Tuesday, April 29, 2014 3:45 PM, Mohit Jaggi mohitja...@gmail.com wrote:
 


Hi,
I started with a text file(CSV) of sorted data (by first column), parsed it 
into Scala objects using map operation in Scala. Then I used more maps to add 
some extra info to the data and saved it as text file.
The final text file is not sorted. What do I need to do to keep the order from 
the original input intact?

My code looks like:

csvFile = sc.textFile(..) //file is CSV and ordered by first column
splitRdd = csvFile map { line = line.split(,,-1) }
parsedRdd = rdd map { parts = 
  { 
    key = parts(0) //use first column as key
    value = new MyObject(parts(0), parts(1)) //parse into scala objects
    (key, value)
  }

augmentedRdd = parsedRdd map { x =
   key =  x._1
   value = //add extra fields to x._2
   (key, value)
}
augmentedRdd.saveAsFile(...) //this file is not sorted

Mohit.

Re: Bug when zip with longs and too many partitions?

2014-05-12 Thread Michael Malak


I've discovered that it was noticed a year ago that RDD zip() does not work 
when the number of partitions does not evenly divide the total number of 
elements in the RDD:

https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ

I will enter a JIRA ticket just as soon as the ASF Jira system will let me 
reset my password.



On Sunday, May 11, 2014 4:40 AM, Michael Malak michaelma...@yahoo.com wrote:

Is this a bug?

scala sc.parallelize(1 to 2,4).zip(sc.parallelize(11 to 12,4)).collect
res0: Array[(Int, Int)] = Array((1,11), (2,12))

scala sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
res1: Array[(Long, Int)] = Array((2,11))


Re: Opinions stratosphere

2014-05-02 Thread Michael Malak
looks like Spark outperforms Stratosphere fairly consistently in the 
experiments

There was one exception the paper noted, which was when memory resources were 
constrained. In that case, Stratosphere seemed to have degraded more gracefully 
than Spark, but the author did not explore it deeper. The author did insert 
into his conclusion section, though, However, in our experiments, for 
iterative algorithms, the Spark programs may show the poor results in 
performance in the environment of limited memory resources.

I recently blogged a fuller list of alternatives/competitors to Spark:
http://datascienceassn.org/content/alternatives-spark-memory-distributed-computing

 
On Friday, May 2, 2014 10:39 AM, Philip Ogren philip.og...@oracle.com wrote:
 
Great reference!  I just skimmed through the results without reading much of 
the methodology - but it looks like Spark outperforms Stratosphere fairly 
consistently in the experiments.  It's too bad the data sources only range from 
2GB to 8GB.  Who knows if the apparent pattern would extend out to 64GB, 128GB, 
1TB, and so on...




On 05/01/2014 06:02 PM, Christopher Nguyen wrote:

Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted such a 
comparative study as a Masters thesis: 


http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf



According to this snapshot (c. 2013), Stratosphere is different from Spark in 
not having an explicit concept of an in-memory dataset (e.g., RDD).


In principle this could be argued to be an implementation detail; the 
operators and execution plan/data flow are of primary concern in the API, and 
the data representation/materializations are otherwise unspecified.


But in practice, for long-running interactive applications, I consider RDDs to 
be of fundamental, first-class citizen importance, and the key distinguishing 
feature of Spark's model vs other in-memory approaches that treat memory 
merely as an implicit cache.


--

Christopher T. Nguyen
Co-founder  CEO, Adatao
linkedin.com/in/ctnguyen




On Tue, Nov 26, 2013 at 1:26 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

I don’t know a lot about it except from the research side, where the team has 
done interesting optimization stuff for these types of applications. In terms 
of the engine, one thing I’m not sure of is whether Stratosphere allows 
explicit caching of datasets (similar to RDD.cache()) and interactive queries 
(similar to spark-shell). But it’s definitely an interesting project to watch.

Matei
 

On Nov 22, 2013, at 4:17 PM, Ankur Chauhan achau...@brightcove.com wrote:

 Hi,

 That's what I thought but as per the slides on http://www.stratosphere.eu 
 they seem to know about spark and the scala api does look similar.
 I found the PACT model interesting. Would like to
  know if matei or other core comitters have something
  to weight in on.

 -- Ankur
 On 22 Nov 2013, at 16:05, Patrick Wendell pwend...@gmail.com wrote:

 I've never seen that project before, would be
  interesting to get a
 comparison. Seems to offer a much lower level
  API. For instance this
 is a wordcount program:

 https://github.com/stratosphere/stratosphere/blob/master/pact/pact-examples/src/main/java/eu/stratosphere/pact/example/wordcount/WordCount.java

 On Thu, Nov 21, 2013 at 3:15 PM, Ankur
  Chauhan achau...@brightcove.com wrote:
 Hi,

 I was just curious about https://github.com/stratosphere/stratosphere
 and how does spark compare to it. Anyone
  has any experience with it to make
 any comments?

 -- Ankur