You might be interested in "Maximum Flow implementation on Spark GraphX" done
by a Colorado School of Mines grad student a couple of years ago.
http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx
From: Swapnil Shinde
Chapter 6 of my book implements Dijkstra's Algorithm. The source code is
available to download for free.
https://www.manning.com/books/spark-graphx-in-action
From: Brian Wilson
To: user@spark.apache.org
Sent: Monday, October 24, 2016 7:11 AM
Subject:
In chapter 10 of Spark GraphX In Action, we describe how to use Zeppelin with
d3.js to render graphs using d3's force-directed rendering algorithm. The
source code can be downloaded for free from
https://www.manning.com/books/spark-graphx-in-action
From: agc studio
It's been reduced to a single line of code.
http://technicaltidbit.blogspot.com/2016/03/dataframedataset-swap-places-in-spark-20.html
From: Gerhard Fiedler
To: "dev@spark.apache.org"
Sent: Friday, June 3, 2016 9:01 AM
Subject: Where
Yes, it is possible to use GraphX from Java but it requires 10x the amount of
code and involves using obscure typing and pre-defined lambda prototype
facilities. I give an example of it in my book, the source code for which can
be downloaded for free from
At first glance, it looks like the only streaming data sources available out of
the box from the github master branch are
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
and
http://go.databricks.com/apache-spark-2.0-presented-by-databricks-co-founder-reynold-xin
From: Sourav Mazumder
To: user
Sent: Wednesday, April 20, 2016 11:07 AM
Subject: Spark 2.0 forthcoming features
Hi All,
Is there
As with all history, "what if"s are not scientifically testable hypotheses, but
my speculation is the energy (VCs, startups, big Internet companies,
universities) within Silicon Valley contrasted to Germany.
From: Mich Talebzadeh <mich.talebza...@gmail.com>
To: Michael
There have been commercial CEP solutions for decades, including from my
employer.
From: Mich Talebzadeh
To: Mark Hamstra
Cc: Corey Nolet ; "user @spark"
Sent: Sunday, April 17, 2016 3:48 PM
In terms of publication date, a paper on Nephele was published in 2009, prior
to the 2010 USENIX paper on Spark. Nephele is the execution engine of
Stratosphere, which became Flink.
From: Mark Hamstra
To: Mich Talebzadeh
Cc: Corey
I see you've been burning the midnight oil.
From: Reynold Xin
To: "dev@spark.apache.org"
Sent: Friday, April 1, 2016 1:15 AM
Subject: [discuss] using deep learning to improve Spark
Hi all,
Hope you all enjoyed the Tesla 3 unveiling
Will Spark 2.0 Structured Streaming obviate some of the Druid/Spark use cases?
From: Raymond Honderdors
To: "yuzhih...@gmail.com"
Cc: "user@spark.apache.org"
Sent: Wednesday, March 23, 2016 8:43 AM
Subject:
Would it make sense (in terms of feasibility, code organization, and
politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra lines
to a Java compatibility layer/class?
From: Reynold Xin
To: "dev@spark.apache.org"
Sent:
[
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990391#comment-14990391
]
Michael Malak commented on SPARK-3789:
--
My publisher tells me the MEAP for Spark GraphX In Action has
[
https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Malak updated SPARK-11278:
--
Component/s: GraphX
> PageRank fails with unified memory mana
[
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950679#comment-14950679
]
Michael Malak commented on SPARK-2365:
--
It's off-topic of IndexedRDD, but you can have a look
[
https://issues.apache.org/jira/browse/SPARK-10939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948596#comment-14948596
]
Michael Malak commented on SPARK-10939:
---
Here Matei explains the explicit design decision to prefer
Michael Malak created SPARK-10972:
-
Summary: UDFs in SQL joins
Key: SPARK-10972
URL: https://issues.apache.org/jira/browse/SPARK-10972
Project: Spark
Issue Type: New Feature
[
https://issues.apache.org/jira/browse/SPARK-10972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Malak updated SPARK-10972:
--
Description:
Currently expressions used to .join() in DataFrames are limited to column names
[
https://issues.apache.org/jira/browse/SPARK-10722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909883#comment-14909883
]
Michael Malak commented on SPARK-10722:
---
I have seen this in a small Hello World type program
[
https://issues.apache.org/jira/browse/SPARK-10489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739681#comment-14739681
]
Michael Malak commented on SPARK-10489:
---
Feynman Liang:
Link https://github.com/databricks/spark
Yes. And a paper that describes using grids (actually varying grids) is
http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
In the Spark GraphX In Action book that Robin East and I are writing, we
implement a drastically simplified version of this in chapter
I would also add, from a data locality theoretic standpoint, mapPartitions()
provides for node-local computation that plain old map-reduce does not.
From my Android phone on T-Mobile. The first nationwide 4G network.
Original message
From: Ashic Mahtab as...@live.com
Date:
http://www.datascienceassn.org/content/making-sense-making-sense-performance-data-analytics-frameworks
From: bit1...@163.com bit1...@163.com
To: user user@spark.apache.org
Sent: Monday, April 27, 2015 8:33 PM
Subject: Why Spark is much faster than Hadoop MapReduce even on disk
You could have your receiver send a magic value when it is done. I discuss
this Spark Streaming pattern in my presentation Spark Gotchas and
Anti-Patterns. In the PDF version, it's slides
34-36.http://www.datascienceassn.org/content/2014-11-05-spark-gotchas-and-anti-patterns-julia-language
Michael Malak created SPARK-6710:
Summary: Wrong initial bias in GraphX SVDPlusPlus
Key: SPARK-6710
URL: https://issues.apache.org/jira/browse/SPARK-6710
Project: Spark
Issue Type: Bug
I believe that in the initialization portion of GraphX SVDPlusPluS, the
initialization of biases is incorrect. Specifically, in line
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96
instead of
(vd._1, vd._2, msg.get._2 /
Can my new book, Spark GraphX In Action, which is currently in MEAP
http://manning.com/malak/, be added to
https://spark.apache.org/documentation.html and, if appropriate, to
https://spark.apache.org/graphx/ ?
Michael Malak
[
https://issues.apache.org/jira/browse/SPARK-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14365758#comment-14365758
]
Michael Malak commented on SPARK-6388:
--
Isn't it Hadoop 2.7 that is supposed
Since RDDs are generally unordered, aren't things like textFile().first() not
guaranteed to return the first row (such as looking for a header row)? If so,
doesn't that make the example in
http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading?
[
https://issues.apache.org/jira/browse/SPARK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309459#comment-14309459
]
Michael Malak commented on SPARK-4279:
--
Is there another place where I might be able
1. Is IndexedRDD planned for 1.3?
https://issues.apache.org/jira/browse/SPARK-2365
2. Once IndexedRDD is in, is it planned to convert Word2VecModel to it from its
current Map[String,Array[Float]]?
But isn't foldLeft() overkill for the originally stated use case of max diff of
adjacent pairs? Isn't foldLeft() for recursive non-commutative non-associative
accumulation as opposed to an embarrassingly parallel operation such as this
one?
This use case reminds me of FIR filtering in DSP. It
Message -
From: Evan R. Sparks evan.spa...@gmail.com
To: Matei Zaharia matei.zaha...@gmail.com
Cc: Koert Kuipers ko...@tresata.com; Michael Malak michaelma...@yahoo.com;
Patrick Wendell pwend...@gmail.com; Reynold Xin r...@databricks.com;
dev@spark.apache.org dev@spark.apache.org
Sent: Tuesday
Michael Malak created SPARK-5343:
Summary: ShortestPaths traverses backwards
Key: SPARK-5343
URL: https://issues.apache.org/jira/browse/SPARK-5343
Project: Spark
Issue Type: Bug
I created https://issues.apache.org/jira/browse/SPARK-5343 for this.
- Original Message -
From: Michael Malak michaelma...@yahoo.com
To: dev@spark.apache.org dev@spark.apache.org
Cc:
Sent: Monday, January 19, 2015 5:09 PM
Subject: GraphX ShortestPaths backwards?
GraphX ShortestPaths
GraphX ShortestPaths seems to be following edges backwards instead of forwards:
import org.apache.spark.graphx._
val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))),
sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L,
lib.ShortestPaths.run(g,Array(3)).vertices.collect
res1:
But wouldn't the gain be greater under something similar to EdgePartition1D
(but perhaps better load-balanced based on number of edges for each vertex) and
an algorithm that primarily follows edges in the forward direction?
From: Ankur Dave ankurd...@gmail.com
To: Michael Malak michaelma
Does GraphX make an effort to co-locate vertices onto the same workers as the
majority (or even some) of its edges?
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail:
According to:
https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting
Note that TriangleCount requires the edges to be in canonical orientation
(srcId dstId)
But isn't this overstating the requirement? Isn't the requirement really that
IF there are duplicate
Thank you. I created
https://issues.apache.org/jira/browse/SPARK-5064
- Original Message -
From: xhudik xhu...@gmail.com
To: dev@spark.apache.org
Cc:
Sent: Saturday, January 3, 2015 2:04 PM
Subject: Re: GraphX rmatGraph hangs
Hi Michael,
yes, I can confirm the behavior.
It get stuck
The following single line just hangs, when executed in either Spark Shell or
standalone:
org.apache.spark.graphx.util.GraphGenerators.rmatGraph(sc, 4, 8)
It just outputs 0 edges and then locks up.
The only other information I've found via Google is:
Michael Malak created SPARK-5064:
Summary: GraphX rmatGraph hangs
Key: SPARK-5064
URL: https://issues.apache.org/jira/browse/SPARK-5064
Project: Spark
Issue Type: Bug
Components
On Wednesday, October 22, 2014 9:06 AM, Sean Owen so...@cloudera.com wrote:
No, there's no such thing as an RDD of RDDs in Spark.
Here though, why not just operate on an RDD of Lists? or a List of RDDs?
Usually one of these two is the right approach whenever you feel
inclined to operate on an
Depending on the density of your keys, the alternative signature
def updateStateByKey[S](updateFunc: (Iterator[(K, Seq[V], Option[S])]) ?
Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner:
Boolean)(implicit arg0: ClassTag[S]): DStream[(K, S)]
at least iterates by key rather than
It's really more of a Scala question than a Spark question, but the standard OO
(not Scala-specific) way is to create your own custom supertype (e.g.
MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD
and MyArray), each of which manually forwards method calls to the
At Spark Summit, Patrick Wendell indicated the number of MLlib algorithms would
roughly double in 1.1 from the current approx. 15.
http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf
What are the planned additional algorithms?
In Jira, I only see two when
How about a treeReduceByKey? :-)
On Friday, June 20, 2014 11:55 AM, DB Tsai dbt...@stanford.edu wrote:
Currently, the reduce operation combines the result from mapper
sequentially, so it's O(n).
Xiangrui is working on treeReduce which is O(log(n)). Based on the
benchmark, it dramatically
Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I
missing something fundamental?
val nodes = sc.parallelize(Array((1L, N1), (2L, N2), (3L, N3), (4L,
N4), (5L, N5)))
val edges = sc.parallelize(Array(Edge(1L, 2L, E1), Edge(1L, 3L, E2),
Edge(2L, 4L, E3), Edge(3L,
[
https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011492#comment-14011492
]
Michael Malak commented on SPARK-1199:
--
See also additional test cases in
https
[
https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Malak resolved SPARK-1836.
--
Resolution: Duplicate
REPL $outer type mismatch causes lookup() and equals() problems
Mohit Jaggi:
A workaround is to use zipWithIndex (to appear in Spark 1.0, but if you're
still on 0.9x you can swipe the code from
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala
), map it to (x = (x._2,x._1)) and then sortByKey.
[
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007565#comment-14007565
]
Michael Malak commented on SPARK-1867:
--
Thank you, sam, that fixed it for me!
FYI, I
While developers may appreciate 1.0 == API stability, I'm not sure that will
be the understanding of the VP who gives the green light to a Spark-based
development effort.
I fear a bug that silently produces erroneous results will be perceived like
the FDIV bug, but in this case without the
[
https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Malak updated SPARK-1836:
-
Description:
Anand Avati partially traced the cause to REPL wrapping classes in $outer
classes
[
https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998807#comment-13998807
]
Michael Malak commented on SPARK-1836:
--
Michael Ambrust: Indeed. Do you think I
Michael Malak created SPARK-1817:
Summary: RDD zip erroneous when partitions do not divide RDD count
Key: SPARK-1817
URL: https://issues.apache.org/jira/browse/SPARK-1817
Project: Spark
Reposting here on dev since I didn't see a response on user:
I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In
the Spark Shell, equals() fails when I use the canonical equals() pattern of
match{}, but works when I subsitute with isInstanceOf[]. I am using Spark
:26 AM, Michael Malak michaelma...@yahoo.com wrote:
Reposting here on dev since I didn't see a response on user:
I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In
the Spark Shell, equals() fails when I use the canonical equals() pattern of
match{}, but works when I
as the ASF Jira system will let me
reset my password.
On Sunday, May 11, 2014 4:40 AM, Michael Malak michaelma...@yahoo.com wrote:
Is this a bug?
scala sc.parallelize(1 to 2,4).zip(sc.parallelize(11 to 12,4)).collect
res0: Array[(Int, Int)] = Array((1,11), (2,12))
scala sc.parallelize(1L to 2L,4
looks like Spark outperforms Stratosphere fairly consistently in the
experiments
There was one exception the paper noted, which was when memory resources were
constrained. In that case, Stratosphere seemed to have degraded more gracefully
than Spark, but the author did not explore it deeper.
; Michael Malak
michaelma...@yahoo.com
Sent: Tuesday, July 30, 2013 7:06 PM
Subject: Re: UDFs with package names
It might be a better idea to use your own package com.mystuff.x. You might be
running into an issue where java is not finding the file because it assumes the
relation between package
Thus far, I've been able to create Hive UDFs, but now I need to define them
within a Java package name (as opposed to the default Java package as I had
been doing), but once I do that, I'm no longer able to load them into Hive.
First off, this works:
add jar
Perhaps you can first create a temp table that contains only the records that
will match? See the UNION ALL trick at
http://www.mail-archive.com/hive-user@hadoop.apache.org/msg01906.html
From: Brad Ruderman bruder...@radiumone.com
To: user@hive.apache.org
Untested:
SELECT a.c100, a.c300, b.c400
FROM t1 a
JOIN t2 b
ON a.c200 = b.c200
JOIN (SELECT DISTINCT a.c100
FROM t1 a2
JOIN t2 b2
ON a2.c200 = b2.c200
WHERE b2.c400 = SYSDATE - 1) a3
ON a.c100 = a3.c100
WHERE b.c400 = SYSDATE - 1
AND a.c300 = 0
I have found that for output larger than a few GB, redirecting stdout results
in an incomplete file. For very large output, I do CREATE TABLE MYTABLE AS
SELECT ... and then copy the resulting HDFS files directly out of
/user/hive/warehouse.
From: Bertrand
Just copy and paste the whole long expressions to their second occurrences.
From: dyuti a hadoop.hiv...@gmail.com
To: user@hive.apache.org
Sent: Friday, June 28, 2013 10:58 AM
Subject: Fwd: Need urgent help in hive query
Hi Experts,
I'm trying with the
I've created
https://issues.apache.org/jira/browse/HIVE-4771
to track this issue.
- Original Message -
From: Michael Malak michaelma...@yahoo.com
To: user@hive.apache.org user@hive.apache.org
Cc:
Sent: Wednesday, June 19, 2013 2:35 PM
Subject: Re: INSERT non-static data to array
.
From: Edward Capriolo edlinuxg...@gmail.com
To: user@hive.apache.org user@hive.apache.org; Michael Malak
michaelma...@yahoo.com
Sent: Thursday, June 20, 2013 9:15 PM
Subject: Re: INSERT non-static data to array?
i think you could select into as sub query and then use
Michael Malak created HIVE-4771:
---
Summary: Support subqueries in INSERT for array types
Key: HIVE-4771
URL: https://issues.apache.org/jira/browse/HIVE-4771
Project: Hive
Issue Type
[]);
INSERT INTO table_a
SELECT a, b, ARRAY(SELECT c FROM table_c WHERE table_c.parent = table_b.id)
FROM table_b
From: Edward Capriolo edlinuxg...@gmail.com
To: user@hive.apache.org user@hive.apache.org; Michael Malak
michaelma...@yahoo.com
Sent: Wednesday
--- On Sun, 5/5/13, Peter Chu pete@outlook.com wrote:
I am wondering if there is any way to do this without resorting to
using left outer join and finding nulls.
I have found this to be an acceptable substitute. Is it not working for you?
[
https://issues.apache.org/jira/browse/HIVE-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582662#comment-13582662
]
Michael Malak commented on HIVE-4022:
-
Note that there is a workaround for the case
[
https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13582664#comment-13582664
]
Michael Malak commented on HIVE-3528:
-
As noted in the first comment from
https
If no one has any objection, I'm going to update HIVE-4022, which I entered a
week ago when I thought the behavior was Avro-specific, to indicate it actually
affects even native Hive tables.
https://issues.apache.org/jira/browse/HIVE-4022
--- On Fri, 2/15/13, Michael Malak michaelma
[
https://issues.apache.org/jira/browse/HIVE-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Malak updated HIVE-4022:
Description:
Originally thought to be Avro-specific, and first noted with respect to
HIVE-3528
It seems that all Hive columns (at least those of primitive types) are always
NULLable? What about columns of type STRUCT?
The following:
echo 1,2 twovalues.csv
hive
CREATE TABLE tc (x INT, y INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH 'twovalues.csv' INTO TABLE
[
https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578523#comment-13578523
]
Michael Malak commented on HIVE-3528:
-
I've tried the latest Avro SerDe from GitHub
[
https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578538#comment-13578538
]
Michael Malak commented on HIVE-3528:
-
Sean:
I mean
https://github.com/apache/hive
[
https://issues.apache.org/jira/browse/HIVE-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578783#comment-13578783
]
Michael Malak commented on HIVE-3528:
-
Sean:
OK, I've researched the problem further
Michael Malak created HIVE-4022:
---
Summary: Avro SerDe queries don't handle hard-coded nulls for
optional/nullable structs
Key: HIVE-4022
URL: https://issues.apache.org/jira/browse/HIVE-4022
Project
any files already there. I'm paranoid, so I
would write to a different directory and then move the files over...
dean
On Wed, Feb 13, 2013 at 1:26 PM, Michael Malak michaelma...@yahoo.com wrote:
Is it possible to INSERT INTO TABLE t SELECT FROM where t has a column with a
STRUCT?
Based
[
https://issues.apache.org/jira/browse/AVRO-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13573711#comment-13573711
]
Michael Malak commented on AVRO-1035:
-
ha...@cloudera.com has provided example code
take it as
append-able?
I don't think its possible for Avro to carry it since Avro
(core) does
not reverse-depend on Hadoop. Should we document it
somewhere though?
Do you have any ideas on the best place to do that?
On Thu, Feb 7, 2013 at 6:12 AM, Michael Malak michaelma...@yahoo.com
in
the API:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataOutputStream.html.
Here's a sample program that works on Hadoop 2.x in my
tests:
https://gist.github.com/QwertyManiac/4724582
On Wed, Feb 6, 2013 at 9:00 AM, Michael Malak michaelma...@yahoo.com
wrote:
I
changes in Avro.
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#append(org.apache.hadoop.fs.Path)
I have no recent personal experience with append in
HDFS. Does anyone
else here?
Doug
On Tue, Feb 5, 2013 at 4:10 PM, Michael Malak michaelma...@yahoo.com
wrote
I'm new to Pig, and it looks like there is no provision to declare relations
inline in a Pig script (without LOADing from an external file)?
Based on
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Constants
I would have thought the following would constitute Hello World for Pig:
A =
87 matches
Mail list logo