Hi,
LynxKite is a graph analytics application built on Apache Spark. (From very
early on, like Spark 0.9.) We have talked about it on occasion on Spark
Summits.
So I wanted to let you know that it's now open-source!
https://github.com/lynxkite/lynxkite
You should totally check it out if you
Hi,
LynxKite is a graph analytics application built on Apache Spark. (From very
early on, like Spark 0.9.) We have talked about it on occasion on Spark
Summits.
So I wanted to let you know that it's now open-source!
https://github.com/lynxkite/lynxkite
You should totally check it out if you
On Fri, Aug 4, 2017 at 4:36 PM, Jean Georges Perrin wrote:
> Thanks Daniel,
>
> I like your answer for #1. It makes sense.
>
> However, I don't get why you say that there are always pending
> transformations... After you call an action, you should be "clean" from
> pending
On Wed, Aug 2, 2017 at 2:16 PM, Jean Georges Perrin wrote:
> Hi Sparkians,
>
> I understand the lazy evaluation mechanism with transformations and
> actions. My question is simpler: 1) are show() and/or printSchema()
> actions? I would assume so...
>
show() is an action (it prints
immensely. (See the last slide of
http://www.slideshare.net/SparkSummit/interactive-graph-analytics-daniel-darabos#33.)
The large advantage is due to the lower number of necessary iterations.
For why this is failing even with one iteration: I would first check your
partitioning. Too many
Hi Antoaneta,
I believe the difference is not due to Datasets being slower (DataFrames
are just an alias to Datasets now), but rather using a user defined
function for filtering vs using Spark builtins. The builtin can use tricks
from Project Tungsten, such as only deserializing the "event_type"
You could run the application in a Docker container constrained to one CPU
with --cpuset-cpus (
https://docs.docker.com/engine/reference/run/#/cpuset-constraint).
On Thu, Aug 4, 2016 at 8:51 AM, Sun Rui wrote:
> I don’t think it possible as Spark does not support thread to
Another possible explanation is that by accident you are still running
Spark 1.6.1. Which download are you using? This is what I see:
$ ~/spark-1.6.2-bin-hadoop2.6/bin/spark-shell
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN
Mich's invocation is for starting a Spark application against an already
running Spark standalone cluster. It will not start the cluster for you.
We used to not use "spark-submit", but we started using it when it solved
some problem for us. Perhaps that day has also come for you? :)
On Fri, Jul
Hi Lokesh,
There is no way to do that. SqlContext.newSession documentation says:
Returns a SQLContext as new session, with separated SQL configurations,
temporary tables, registered functions, but sharing the same SparkContext,
CacheManager, SQLListener and SQLTab.
You have two options:
Spark is a software product. In software a "core" is something that a
process can run on. So it's a "virtual core". (Do not call these "threads".
A "thread" is not something a process can run on.)
local[*] uses java.lang.Runtime.availableProcessors()
On Sun, Jun 5, 2016 at 9:51 PM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:
> If you fill up the cache, 1.6.0+ will suffer performance degradation from
> GC thrashing. You can set spark.memory.useLegacyMode to true, or
> spark.memory.f
If you fill up the cache, 1.6.0+ will suffer performance degradation from
GC thrashing. You can set spark.memory.useLegacyMode to true, or
spark.memory.fraction to 0.66, or spark.executor.extraJavaOptions to
-XX:NewRatio=3 to avoid this issue.
I think my colleague filed a ticket for this issue,
On Thu, Mar 17, 2016 at 3:51 AM, charles li wrote:
> Hi, Alexander,
>
> that's awesome, and when will that feature be released ? Since I want to
> know the opportunity cost between waiting for that release and use caffe or
> tensorFlow ?
>
I don't expect MLlib will be
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers wrote:
> in spark, every partition needs to fit in the memory available to the core
> processing it.
>
That does not agree with my understanding of how it works. I think you
could do
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers wrote:
> in spark, every partition needs to fit in the memory available to the core
> processing it.
>
That does not agree with my understanding of how it works. I think you
could do
f things for it still to exist. Link to
>> article:
>> https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
>>
>> Hopefully a little more stability will come out with the upcoming Spark
>> 1.6 release on EMR (I think that is happening sometime soon).
>
Have you tried setting spark.emr.dropCharacters to a lower value? (It
defaults to 8.)
:) Just joking, sorry! Fantastic bug.
What data source do you have for this DataFrame? I could imagine for
example that it's a Parquet file and on EMR you are running with two wrong
version of the Parquet
Hi,
How do you know it doesn't work? The log looks roughly normal to me. Is
Spark not running at the printed address? Can you not start jobs?
On Mon, Jan 18, 2016 at 11:51 AM, Oleg Ruchovets
wrote:
> Hi ,
>I try to follow the spartk 1.6.0 to install spark on EC2.
>
>
On Mon, Jan 18, 2016 at 5:24 PM, Oleg Ruchovets
wrote:
> I thought script tries to install hadoop / hdfs also. And it looks like it
> failed. Installation is only standalone spark without hadoop. Is it correct
> behaviour?
>
Yes, it also sets up two HDFS clusters. Are they
Does Hive JDBC work if you are not using Spark as a backend? I just had
very bad experience with Hive JDBC in general. E.g. half the JDBC protocol
is not implemented (https://issues.apache.org/jira/browse/HIVE-3175, filed
in 2012).
On Fri, Jan 15, 2016 at 2:15 AM, sdevashis
You can cause a failure by throwing an exception in the code running on the
executors. The task will be retried (if spark.task.maxFailures > 1), and
then the stage is failed. No further tasks are processed after that, and an
exception is thrown on the driver. You could catch the exception and see
> For example does spark try to merge the small partitions first or the
election of partitions to merge is random?
It is quite smart as Iulian has pointed out. But it does not try to merge
small partitions first. Spark doesn't know the size of partitions. (The
partitions are represented as
Hi Praveen,
On Mon, Aug 17, 2015 at 12:34 PM, praveen S mylogi...@gmail.com wrote:
What does this mean in .setMaster(local[2])
Local mode (executor in the same JVM) with 2 executor threads.
Is this applicable only for standalone Mode?
It is not applicable for standalone mode, only for
Hi Shahid,
To be honest I think this question is better suited for Stack Overflow than
for a PhD thesis.
On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote:
hi
I have a 10 node cluster i loaded the data onto hdfs, so the no. of
partitions i get is 9. I am running a spark
I recommend testing it for yourself. Even if you have no application, you
can just run the spark-ec2 script, log in, run spark-shell and try reading
files from an S3 bucket and from hdfs://master IP:9000/. (This is the
ephemeral HDFS cluster, which uses SSD.)
I just tested our application this
This comes up so often. I wonder if the documentation or the API could be
changed to answer this question.
The solution I found is from
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job.
You basically write the items into two directories in a single
Indeed Spark does not have .NET bindings.
On Thu, Jul 2, 2015 at 10:33 AM, Zwits daniel.van...@ortec-finance.com
wrote:
I'm currently looking into a way to run a program/code (DAG) written in
.NET
on a cluster using Spark. However I ran into problems concerning the coding
language, Spark has
Spark creates a RecordReader and uses next() on it when you call
input.next(). (See
https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L215)
How
the RecordReader works is an HDFS question, but it's safe to say there is
no difference between using
It would be even faster to load the data on the driver and sort it there
without using Spark :). Using reduce() is cheating, because it only works
as long as the data fits on one machine. That is not the targeted use case
of a distributed computation system. You can repeat your test with more
data
I suggest you include your code and the error message! It's not even
immediately clear what programming language you mean to ask about.
On Mon, Jun 8, 2015 at 2:50 PM, elbehery elbeherymust...@gmail.com wrote:
Hi,
I have two datasets of customer types, and I would like to apply coGrouping
on
Hello! I just had a very similar stack trace. It was caused by an Akka
version mismatch. (From trying to use Play 2.3 with Spark 1.1 by accident
instead of 1.2.)
On Mon, Nov 24, 2014 at 7:15 PM, Blackeye black...@iit.demokritos.gr
wrote:
I created an application in spark. When I run it with
Spark 1.2.0 is coming in the next 48 hours according to
http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-tc9815.html
On Wed, Dec 17, 2014 at 10:11 AM, Madabhattula Rajesh Kumar
mrajaf...@gmail.com wrote:
Hi All,
When will the Spark 1.2.0 be
Yes, this is perfectly legal. This is what RDD.foreach() is for! You may
be encountering an IO exception while writing, and maybe using() suppresses
it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd
expect there is less that can go wrong with that simple call.
On Thu, Dec
only one implementation
which takes a List - requiring an in memory representation
On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
Hi,
I think you have the right idea. I would not even worry about flatMap.
val rdd = sc.parallelize(1 to 100
/prefer_reducebykey_over_groupbykey.html
On Mon, Dec 8, 2014 at 3:47 PM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
Could you not use a groupByKey instead of the join? I mean something like
this:
val byDst = rdd.map { case (src, dst, w) = dst - (src, w) }
byDst.groupByKey.map { case (dst, edges
distributions?
On Mon, Dec 8, 2014 at 3:59 PM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
I do not see how you hope to generate all incoming edge pairs without
repartitioning the data by dstID. You need to perform this shuffle for
joining too. Otherwise two incoming edges could
Hi,
I think you have the right idea. I would not even worry about flatMap.
val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x =
generateRandomObject(x))
Then when you try to evaluate something on this RDD, it will happen
partition-by-partition. So 1000 random objects will be
Hi, Alexey,
I'm getting the same error on startup with Spark 1.1.0. Everything works
fine fortunately.
The error is mentioned in the logs in
https://issues.apache.org/jira/browse/SPARK-4498, so maybe it will also be
fixed in Spark 1.2.0 and 1.1.2. I have no insight into it unfortunately.
On Tue,
It is controlled by spark.task.maxFailures. See
http://spark.apache.org/docs/latest/configuration.html#scheduling.
On Fri, Dec 5, 2014 at 11:02 AM, shahab shahab.mok...@gmail.com wrote:
Hello,
By some (unknown) reasons some of my tasks, that fetch data from
Cassandra, are failing so often,
On Fri, Dec 5, 2014 at 7:12 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Rahul,
On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish
rahul.bindl...@nectechnologies.in wrote:
I have done so thats why spark is able to load objectfile [e.g.
person_obj]
and spark has maintained serialVersionUID
On Wed, Dec 3, 2014 at 10:52 AM, shahab shahab.mok...@gmail.com wrote:
Hi,
I noticed that rdd.cache() is not happening immediately rather due to lazy
feature of Spark, it is happening just at the moment you perform some
map/reduce actions. Is this true?
Yes, this is correct.
If this is
Akhil, I think Aniket uses the word persisted in a different way than
what you mean. I.e. not in the RDD.persist() way. Aniket asks if running
combineByKey on a sorted RDD will result in a sorted RDD. (I.e. the sorting
is preserved.)
I think the answer is no. combineByKey uses AppendOnlyMap,
= (1, i)).sortBy(t = t._2).foldByKey(0)((a,
b) = b).collect
res0: Array[(Int, Int)] = Array((1,8))
The fold always given me last value as 8 which suggests values preserve
sorting earlier defined in stage in DAG?
On Wed Nov 19 2014 at 18:10:11 Daniel Darabos
daniel.dara...@lynxanalytics.com
Even though the stage UI has min, 25th%, median, 75th%, and max durations,
I am often still left clueless about the distribution. For example, 100 out
of 200 tasks (started at the same time) have completed in 1 hour. How much
longer do I have to wait? I cannot guess well based on the five numbers.
How about this?
Class.forName([Lorg.apache.spark.util.collection.CompactBuffer;)
On Tue, Sep 30, 2014 at 5:33 PM, Andras Barjak
andras.bar...@lynxanalytics.com wrote:
Hi,
what is the correct scala code to register an Array of this private spark
class to Kryo?
Can you share some example code of what you are doing?
BTW Gmail puts down your mail as spam, saying it cannot verify it came from
yahoo.com. Might want to check your mail client settings. (It could be a
Gmail or Yahoo bug too of course.)
On Fri, Jun 27, 2014 at 4:29 PM, harsh2005_7
I think you need to implement a timeout in your code. As far as I know,
Spark will not interrupt the execution of your code as long as the driver
is connected. Might be an idea though.
On Tue, Jun 17, 2014 at 7:54 PM, Peng Cheng pc...@uow.edu.au wrote:
I've tried enabling the speculative jobs,
I've been wondering about this. Is there a difference in performance
between these two?
val rdd1 = sc.textFile(files.mkString(,)) val rdd2 = sc.union(files.map(sc
.textFile(_)))
I don't know about your use-case, Meethu, but it may be worth trying to see
if reading all the files into one RDD
What I mean is, let's say I run this:
sc.parallelize(Seq(0-3, 0-2, 0-1), 3).partitionBy(HashPartitioner(3)).collect
Will the result always be Array((0,3), (0,2), (0,1))? Or could I
possibly get a different order?
I'm pretty sure the shuffle files are taken in the order of the source
, at 12:14 PM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
What I mean is, let's say I run this:
sc.parallelize(Seq(0-3, 0-2, 0-1),
3).partitionBy(HashPartitioner(3)).collect
Will the result always be Array((0,3), (0,2), (0,1))? Or could I possibly get
a different order?
I'm
Check out SparkContext.getPersistentRDDs!
On Fri, Jun 13, 2014 at 1:06 PM, mrm ma...@skimlinks.com wrote:
Hi,
How do I check the rdds that I have persisted? I have some code that looks
like:
rd1.cache()
rd2.cache()
...
rdN.cache()
How can I unpersist all rdd's at once? And
These stack traces come from the stuck node? Looks like it's waiting on
data in BlockFetcherIterator. Waiting for data from another node. But you
say all other nodes were done? Very curious.
Maybe you could try turning on debug logging, and try to figure out what
happens in BlockFetcherIterator (
About more succeeded tasks than total tasks:
- This can happen if you have enabled speculative execution. Some
partitions can get processed multiple times.
- More commonly, the result of the stage may be used in a later
calculation, and has to be recalculated. This happens if some of the
results
On Wed, Jun 4, 2014 at 10:39 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
That’s a good idea too, maybe we can change CallSiteInfo to do that.
I've filed an issue: https://issues.apache.org/jira/browse/SPARK-2035
Matei
On Jun 4, 2014, at 8:44 AM, Daniel Darabos
daniel.dara
On Tue, Jun 3, 2014 at 8:46 PM, Marek Wiewiorka marek.wiewio...@gmail.com
wrote:
Hi All,
I've been experiencing a very strange error after upgrade from Spark 0.9
to 1.0 - it seems that saveAsTestFile function is throwing
java.lang.UnsupportedOperationException that I have never seen before.
Oh, this would be super useful for us too!
Actually wouldn't it be best if you could see the whole call stack on the
UI, rather than just one line? (Of course you would have to click to expand
it.)
On Wed, Jun 4, 2014 at 2:38 AM, John Salvatier jsalvat...@gmail.com wrote:
Ok, I will probably
I keep bumping into a problem with persisting RDDs. Consider this (silly)
example:
def everySecondFromBehind(input: RDD[Int]): RDD[Int] = {
val count = input.count
if (count % 2 == 0) {
return input.filter(_ % 2 == 1)
} else {
return input.filter(_ % 2 == 0)
}
}
The situation is
Graph.subgraph() allows you to apply a filter to edges and/or vertices.
On Thu, May 1, 2014 at 8:52 AM, Николай Кинаш peroksi...@gmail.com wrote:
Hello.
How to remove vertex or edges from graph in GraphX?
Cool intro, thanks! One question. On slide 23 it says Standalone (local
mode). That sounds a bit confusing without hearing the talk.
Standalone mode is not local. It just does not depend on a cluster
software. I think it's the best mode for EC2/GCE, because they provide a
distributed filesystem
From: Daniel Darabos [mailto:daniel.dara...@lynxanalytics.com]
I have no idea why shuffle spill is so large. But this might make it
smaller:
val addition = (a: Int, b: Int) = a + b
val wordsCount = wordsPair.combineByKey(identity, addition, addition)
This way only one entry per distinct word
Create a key and join on that.
val callPricesByHour = callPrices.map(p = ((p.year, p.month, p.day,
p.hour), p))
val callsByHour = calls.map(c = ((c.year, c.month, c.day, c.hour), c))
val bills = callPricesByHour.join(callsByHour).mapValues({ case (p, c) =
BillRow(c.customer, c.hour, c.minutes *
That is quite mysterious, and I do not think we have enough information to
answer. JavaPairRDDString, Tuple2.lookup() works fine on a remote Spark
cluster:
$ MASTER=spark://localhost:7077 bin/spark-shell
scala val rdd = org.apache.spark.api.java.JavaPairRDD.fromRDD(sc.makeRDD(0
until 10, 3).map(x
Good question! I am also new to the JVM and would appreciate some tips.
On Sun, Apr 27, 2014 at 5:19 AM, wxhsdp wxh...@gmail.com wrote:
Hi, all
i have some questions about debug in spark:
1) when application finished, application UI is shut down, i can not see
the details about the app,
With the right program you can always exhaust any amount of memory :).
There is no silver bullet. You have to figure out what is happening in your
code that causes a high memory use and address that. I spent all of last
week doing this for a simple program of my own. Lessons I learned that may
or
This is caused by https://issues.apache.org/jira/browse/SPARK-1188. I think
the fix will be in the next release. But until then, do:
g.edges.map(_.copy()).distinct.count
On Wed, Apr 23, 2014 at 2:26 AM, Ryan Compton compton.r...@gmail.comwrote:
Try this:
On Mon, Apr 21, 2014 at 7:59 PM, Jim Carroll jimfcarr...@gmail.com wrote:
I'm experimenting with a few things trying to understand how it's working.
I
took the JavaSparkPi example as a starting point and added a few System.out
lines.
I added a system.out to the main body of the driver
Most likely the data is not just too big. For most operations the data is
processed partition by partition. The partitions may be too big. This is
what your last question hints at too:
val numWorkers = 10
val data = sc.textFile(somedirectory/data.csv, numWorkers)
This will work, but not quite
On Wed, Apr 23, 2014 at 12:06 AM, jaeholee jho...@lbl.gov wrote:
How do you determine the number of partitions? For example, I have 16
workers, and the number of cores and the worker memory set in spark-env.sh
are:
CORE = 8
MEMORY = 16g
So you have the capacity to work on 16 * 8 = 128
Looks like NumericRange in Scala is just a joke.
scala val x = 0.0 to 1.0 by 0.1
x: scala.collection.immutable.NumericRange[Double] = NumericRange(0.0, 0.1,
0.2, 0.30004, 0.4, 0.5, 0.6, 0.7, 0.7999,
0.8999, 0.)
scala x.take(3)
res1:
To make up for mocking Scala, I've filed a bug (
https://issues.scala-lang.org/browse/SI-8518) and will try to patch this.
On Fri, Apr 18, 2014 at 9:24 PM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
Looks like NumericRange in Scala is just a joke.
scala val x = 0.0 to 1.0 by 0.1
I'm quite new myself (just subscribed to the mailing list today :)), but
this happens to be something we've had success with. So let me know if you
hit any problems with this sort of usage.
On Thu, Apr 17, 2014 at 9:11 PM, Jim Carroll jimfcarr...@gmail.com wrote:
Daniel,
I'm new to Spark but
72 matches
Mail list logo