Re: Minimum cost flow problem solving in Spark

2017-09-13 Thread Michael Malak
You might be interested in "Maximum Flow implementation on Spark GraphX" done by a Colorado School of Mines grad student a couple of years ago. http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx From: Swapnil Shinde

Re: Shortest path with directed and weighted graphs

2016-10-24 Thread Michael Malak
Chapter 6 of my book implements Dijkstra's Algorithm. The source code is available to download for free.  https://www.manning.com/books/spark-graphx-in-action From: Brian Wilson To: user@spark.apache.org Sent: Monday, October 24, 2016 7:11 AM Subject:

Re: GraphX drawing algorithm

2016-09-11 Thread Michael Malak
In chapter 10 of Spark GraphX In Action, we describe how to use Zeppelin with d3.js to render graphs using d3's force-directed rendering algorithm. The source code can be downloaded for free from  https://www.manning.com/books/spark-graphx-in-action From: agc studio

Re: GraphX Java API

2016-05-30 Thread Michael Malak
Yes, it is possible to use GraphX from Java but it requires 10x the amount of code and involves using obscure typing and pre-defined lambda prototype facilities. I give an example of it in my book, the source code for which can be downloaded for free from 

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Michael Malak
At first glance, it looks like the only streaming data sources available out of the box from the github master branch are  https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala  and 

Re: Spark 2.0 forthcoming features

2016-04-20 Thread Michael Malak
http://go.databricks.com/apache-spark-2.0-presented-by-databricks-co-founder-reynold-xin From: Sourav Mazumder To: user Sent: Wednesday, April 20, 2016 11:07 AM Subject: Spark 2.0 forthcoming features Hi All, Is there

Re: Apache Flink

2016-04-17 Thread Michael Malak
As with all history, "what if"s are not scientifically testable hypotheses, but my speculation is the energy (VCs, startups, big Internet companies, universities) within Silicon Valley contrasted to Germany. From: Mich Talebzadeh <mich.talebza...@gmail.com> To: Michael

Re: Apache Flink

2016-04-17 Thread Michael Malak
There have been commercial CEP solutions for decades, including from my employer. From: Mich Talebzadeh To: Mark Hamstra Cc: Corey Nolet ; "user @spark" Sent: Sunday, April 17, 2016 3:48 PM

Re: Apache Flink

2016-04-17 Thread Michael Malak
In terms of publication date, a paper on Nephele was published in 2009, prior to the 2010 USENIX paper on Spark. Nephele is the execution engine of Stratosphere, which became Flink. From: Mark Hamstra To: Mich Talebzadeh Cc: Corey

Re: Spark with Druid

2016-03-23 Thread Michael Malak
Will Spark 2.0 Structured Streaming obviate some of the Druid/Spark use cases? From: Raymond Honderdors To: "yuzhih...@gmail.com" Cc: "user@spark.apache.org" Sent: Wednesday, March 23, 2016 8:43 AM Subject:

Re: Build k-NN graph for large dataset

2015-08-26 Thread Michael Malak
Yes. And a paper that describes using grids (actually varying grids) is  http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf  In the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically simplified version of this in chapter

RE: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Michael Malak
I would also add, from a data locality theoretic standpoint, mapPartitions() provides for node-local computation that plain old map-reduce does not. From my Android phone on T-Mobile. The first nationwide 4G network. Original message From: Ashic Mahtab as...@live.com Date:

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Michael Malak
http://www.datascienceassn.org/content/making-sense-making-sense-performance-data-analytics-frameworks   From: bit1...@163.com bit1...@163.com To: user user@spark.apache.org Sent: Monday, April 27, 2015 8:33 PM Subject: Why Spark is much faster than Hadoop MapReduce even on disk

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Michael Malak
You could have your receiver send a magic value when it is done. I discuss this Spark Streaming pattern in my presentation Spark Gotchas and Anti-Patterns. In the PDF version, it's slides 34-36.http://www.datascienceassn.org/content/2014-11-05-spark-gotchas-and-anti-patterns-julia-language

Spark GraphX In Action on documentation page?

2015-03-24 Thread Michael Malak
Can my new book, Spark GraphX In Action, which is currently in MEAP http://manning.com/malak/, be added to https://spark.apache.org/documentation.html and, if appropriate, to https://spark.apache.org/graphx/ ? Michael Malak

Re: spark challenge: zip with next???

2015-01-30 Thread Michael Malak
But isn't foldLeft() overkill for the originally stated use case of max diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative non-associative accumulation as opposed to an embarrassingly parallel operation such as this one? This use case reminds me of FIR filtering in DSP. It

Re: Rdd of Rdds

2014-10-22 Thread Michael Malak
On Wednesday, October 22, 2014 9:06 AM, Sean Owen so...@cloudera.com wrote: No, there's no such thing as an RDD of RDDs in Spark. Here though, why not just operate on an RDD of Lists? or a List of RDDs? Usually one of these two is the right approach whenever you feel inclined to operate on an

Re: UpdateStateByKey - How to improve performance?

2014-08-06 Thread Michael Malak
Depending on the density of your keys, the alternative signature def updateStateByKey[S](updateFunc: (Iterator[(K, Seq[V], Option[S])]) ? Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner: Boolean)(implicit arg0: ClassTag[S]): DStream[(K, S)] at least iterates by key rather than

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Michael Malak
It's really more of a Scala question than a Spark question, but the standard OO (not Scala-specific) way is to create your own custom supertype (e.g. MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD and MyArray), each of which manually forwards method calls to the

Re: parallel Reduce within a key

2014-06-20 Thread Michael Malak
How about a treeReduceByKey? :-) On Friday, June 20, 2014 11:55 AM, DB Tsai dbt...@stanford.edu wrote: Currently, the reduce operation combines the result from mapper sequentially, so it's O(n). Xiangrui is working on treeReduce which is O(log(n)). Based on the benchmark, it dramatically

Re: rdd ordering gets scrambled

2014-05-28 Thread Michael Malak
Mohit Jaggi: A workaround is to use zipWithIndex (to appear in Spark 1.0, but if you're still on 0.9x you can swipe the code from  https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala  ), map it to (x = (x._2,x._1)) and then sortByKey.

Re: Bug when zip with longs and too many partitions?

2014-05-12 Thread Michael Malak
as the ASF Jira system will let me reset my password. On Sunday, May 11, 2014 4:40 AM, Michael Malak michaelma...@yahoo.com wrote: Is this a bug? scala sc.parallelize(1 to 2,4).zip(sc.parallelize(11 to 12,4)).collect res0: Array[(Int, Int)] = Array((1,11), (2,12)) scala sc.parallelize(1L to 2L,4

Re: Opinions stratosphere

2014-05-02 Thread Michael Malak
looks like Spark outperforms Stratosphere fairly consistently in the experiments There was one exception the paper noted, which was when memory resources were constrained. In that case, Stratosphere seemed to have degraded more gracefully than Spark, but the author did not explore it deeper.