LynxKite is now open-source

2020-06-24 Thread Daniel Darabos
Hi, LynxKite is a graph analytics application built on Apache Spark. (From very early on, like Spark 0.9.) We have talked about it on occasion on Spark Summits. So I wanted to let you know that it's now open-source! https://github.com/lynxkite/lynxkite You should totally check it out if you work

LynxKite is now open-source

2020-06-24 Thread Daniel Darabos
Hi, LynxKite is a graph analytics application built on Apache Spark. (From very early on, like Spark 0.9.) We have talked about it on occasion on Spark Summits. So I wanted to let you know that it's now open-source! https://github.com/lynxkite/lynxkite You should totally check it out if you work

Re: Quick one on evaluation

2017-08-04 Thread Daniel Darabos
On Fri, Aug 4, 2017 at 4:36 PM, Jean Georges Perrin wrote: > Thanks Daniel, > > I like your answer for #1. It makes sense. > > However, I don't get why you say that there are always pending > transformations... After you call an action, you should be "clean" from > pending transformations, no? >

Re: Quick one on evaluation

2017-08-03 Thread Daniel Darabos
On Wed, Aug 2, 2017 at 2:16 PM, Jean Georges Perrin wrote: > Hi Sparkians, > > I understand the lazy evaluation mechanism with transformations and > actions. My question is simpler: 1) are show() and/or printSchema() > actions? I would assume so... > show() is an action (it prints data) but prin

Re: Strongly Connected Components

2016-11-11 Thread Daniel Darabos
ation immensely. (See the last slide of http://www.slideshare.net/SparkSummit/interactive-graph-analytics-daniel-darabos#33.) The large advantage is due to the lower number of necessary iterations. For why this is failing even with one iteration: I would first check your partitioning. Too

Re: Spark 2.0 - DataFrames vs Dataset performance

2016-10-24 Thread Daniel Darabos
Hi Antoaneta, I believe the difference is not due to Datasets being slower (DataFrames are just an alias to Datasets now), but rather using a user defined function for filtering vs using Spark builtins. The builtin can use tricks from Project Tungsten, such as only deserializing the "event_type" co

Re: how to run local[k] threads on a single core

2016-08-04 Thread Daniel Darabos
You could run the application in a Docker container constrained to one CPU with --cpuset-cpus ( https://docs.docker.com/engine/reference/run/#/cpuset-constraint). On Thu, Aug 4, 2016 at 8:51 AM, Sun Rui wrote: > I don’t think it possible as Spark does not support thread to CPU affinity. > > On A

Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-25 Thread Daniel Darabos
Another possible explanation is that by accident you are still running Spark 1.6.1. Which download are you using? This is what I see: $ ~/spark-1.6.2-bin-hadoop2.6/bin/spark-shell log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN P

Re: spark.executor.cores

2016-07-15 Thread Daniel Darabos
Mich's invocation is for starting a Spark application against an already running Spark standalone cluster. It will not start the cluster for you. We used to not use "spark-submit", but we started using it when it solved some problem for us. Perhaps that day has also come for you? :) On Fri, Jul 1

Re: How to Register Permanent User-Defined-Functions (UDFs) in SparkSQL

2016-07-12 Thread Daniel Darabos
Hi Lokesh, There is no way to do that. SqlContext.newSession documentation says: Returns a SQLContext as new session, with separated SQL configurations, temporary tables, registered functions, but sharing the same SparkContext, CacheManager, SQLListener and SQLTab. You have two options: eithe

Re: What is the interpretation of Cores in Spark doc

2016-06-12 Thread Daniel Darabos
Spark is a software product. In software a "core" is something that a process can run on. So it's a "virtual core". (Do not call these "threads". A "thread" is not something a process can run on.) local[*] uses java.lang.Runtime.availableProcessors()

Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-07 Thread Daniel Darabos
On Sun, Jun 5, 2016 at 9:51 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > If you fill up the cache, 1.6.0+ will suffer performance degradation from > GC thrashing. You can set spark.memory.useLegacyMode to true, or > spark.memory.fract

Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-05 Thread Daniel Darabos
If you fill up the cache, 1.6.0+ will suffer performance degradation from GC thrashing. You can set spark.memory.useLegacyMode to true, or spark.memory.fraction to 0.66, or spark.executor.extraJavaOptions to -XX:NewRatio=3 to avoid this issue. I think my colleague filed a ticket for this issue, bu

Re: best way to do deep learning on spark ?

2016-03-19 Thread Daniel Darabos
On Thu, Mar 17, 2016 at 3:51 AM, charles li wrote: > Hi, Alexander, > > that's awesome, and when will that feature be released ? Since I want to > know the opportunity cost between waiting for that release and use caffe or > tensorFlow ? > I don't expect MLlib will be able to compete with major

Re: SPARK-9559

2016-02-18 Thread Daniel Darabos
YARN may be a workaround. On Thu, Feb 18, 2016 at 4:13 PM, Ashish Soni wrote: > Hi All , > > Just wanted to know if there is any work around or resolution for below > issue in Stand alone mode > > https://issues.apache.org/jira/browse/SPARK-9559 > > Ashish >

Re: coalesce and executor memory

2016-02-13 Thread Daniel Darabos
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers wrote: > in spark, every partition needs to fit in the memory available to the core > processing it. > That does not agree with my understanding of how it works. I think you could do sc.textFile("input").coalesce(1).map(_.replace("A", "B")).saveAsT

Re: coalesce and executor memory

2016-02-13 Thread Daniel Darabos
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers wrote: > in spark, every partition needs to fit in the memory available to the core > processing it. > That does not agree with my understanding of how it works. I think you could do sc.textFile("input").coalesce(1).map(_.replace("A", "B")).saveAsT

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-29 Thread Daniel Darabos
>> article: >> https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/ >> >> Hopefully a little more stability will come out with the upcoming Spark >> 1.6 release on EMR (I think that is happening sometime soon). >> >> Thanks again for the advic

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-26 Thread Daniel Darabos
Have you tried setting spark.emr.dropCharacters to a lower value? (It defaults to 8.) :) Just joking, sorry! Fantastic bug. What data source do you have for this DataFrame? I could imagine for example that it's a Parquet file and on EMR you are running with two wrong version of the Parquet librar

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Daniel Darabos
On Mon, Jan 18, 2016 at 5:24 PM, Oleg Ruchovets wrote: > I thought script tries to install hadoop / hdfs also. And it looks like it > failed. Installation is only standalone spark without hadoop. Is it correct > behaviour? > Yes, it also sets up two HDFS clusters. Are they not working? Try to se

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Daniel Darabos
Hi, How do you know it doesn't work? The log looks roughly normal to me. Is Spark not running at the printed address? Can you not start jobs? On Mon, Jan 18, 2016 at 11:51 AM, Oleg Ruchovets wrote: > Hi , >I try to follow the spartk 1.6.0 to install spark on EC2. > > It doesn't work properl

Re: Using JDBC clients with "Spark on Hive"

2016-01-15 Thread Daniel Darabos
Does Hive JDBC work if you are not using Spark as a backend? I just had very bad experience with Hive JDBC in general. E.g. half the JDBC protocol is not implemented (https://issues.apache.org/jira/browse/HIVE-3175, filed in 2012). On Fri, Jan 15, 2016 at 2:15 AM, sdevashis wrote: > Hello Expert

Re: stopping a process usgin an RDD

2016-01-04 Thread Daniel Darabos
You can cause a failure by throwing an exception in the code running on the executors. The task will be retried (if spark.task.maxFailures > 1), and then the stage is failed. No further tasks are processed after that, and an exception is thrown on the driver. You could catch the exception and see i

Re: Is coalesce smart while merging partitions?

2015-10-08 Thread Daniel Darabos
> For example does spark try to merge the small partitions first or the election of partitions to merge is random? It is quite smart as Iulian has pointed out. But it does not try to merge small partitions first. Spark doesn't know the size of partitions. (The partitions are represented as Iterato

Re: Meaning of local[2]

2015-08-17 Thread Daniel Darabos
Hi Praveen, On Mon, Aug 17, 2015 at 12:34 PM, praveen S wrote: > What does this mean in .setMaster("local[2]") > Local mode (executor in the same JVM) with 2 executor threads. > Is this applicable only for standalone Mode? > It is not applicable for standalone mode, only for local. > Can I do

Re: Research ideas using spark

2015-07-14 Thread Daniel Darabos
Hi Shahid, To be honest I think this question is better suited for Stack Overflow than for a PhD thesis. On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf wrote: > hi > > I have a 10 node cluster i loaded the data onto hdfs, so the no. of > partitions i get is 9. I am running a spark application ,

Re: S3 vs HDFS

2015-07-09 Thread Daniel Darabos
I recommend testing it for yourself. Even if you have no application, you can just run the spark-ec2 script, log in, run spark-shell and try reading files from an S3 bucket and from hdfs://:9000/. (This is the ephemeral HDFS cluster, which uses SSD.) I just tested our application this way yesterda

Re: Split RDD into two in a single pass

2015-07-06 Thread Daniel Darabos
This comes up so often. I wonder if the documentation or the API could be changed to answer this question. The solution I found is from http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job. You basically write the items into two directories in a single p

Re: .NET on Apache Spark?

2015-07-02 Thread Daniel Darabos
Indeed Spark does not have .NET bindings. On Thu, Jul 2, 2015 at 10:33 AM, Zwits wrote: > I'm currently looking into a way to run a program/code (DAG) written in > .NET > on a cluster using Spark. However I ran into problems concerning the coding > language, Spark has no .NET API. > I tried look

Re: map vs mapPartitions

2015-06-25 Thread Daniel Darabos
Spark creates a RecordReader and uses next() on it when you call input.next(). (See https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L215) How the RecordReader works is an HDFS question, but it's safe to say there is no difference between using ma

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Daniel Darabos
It would be even faster to load the data on the driver and sort it there without using Spark :). Using reduce() is cheating, because it only works as long as the data fits on one machine. That is not the targeted use case of a distributed computation system. You can repeat your test with more data

Re: coGroup on RDD

2015-06-08 Thread Daniel Darabos
I suggest you include your code and the error message! It's not even immediately clear what programming language you mean to ask about. On Mon, Jun 8, 2015 at 2:50 PM, elbehery wrote: > Hi, > > I have two datasets of customer types, and I would like to apply coGrouping > on them. > > I could not

Re: Spark error in execution

2015-01-06 Thread Daniel Darabos
Hello! I just had a very similar stack trace. It was caused by an Akka version mismatch. (From trying to use Play 2.3 with Spark 1.1 by accident instead of 1.2.) On Mon, Nov 24, 2014 at 7:15 PM, Blackeye wrote: > I created an application in spark. When I run it with spark, everything > works > f

Re: when will the spark 1.3.0 be released?

2014-12-17 Thread Daniel Darabos
Spark 1.2.0 is coming in the next 48 hours according to http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-tc9815.html On Wed, Dec 17, 2014 at 10:11 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > > Hi All, > > When will the Spark 1.2

Re: Can spark job have sideeffects (write files to FileSystem)

2014-12-11 Thread Daniel Darabos
Yes, this is perfectly "legal". This is what RDD.foreach() is for! You may be encountering an IO exception while writing, and maybe using() suppresses it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd expect there is less that can go wrong with that simple call. On Thu, Dec

Re: How can I create an RDD with millions of entries created programmatically

2014-12-09 Thread Daniel Darabos
a) has only one implementation > which takes a List - requiring an in memory representation > > On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos < > daniel.dara...@lynxanalytics.com> wrote: > >> Hi, >> I think you have the right idea. I would not even worry about flatMap. >&

Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Daniel Darabos
Hi, I think you have the right idea. I would not even worry about flatMap. val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x => generateRandomObject(x)) Then when you try to evaluate something on this RDD, it will happen partition-by-partition. So 1000 random objects will be generate

Re: Efficient self-joins

2014-12-08 Thread Daniel Darabos
ave incoming few edges. > > What is a good partitioning strategy for a self-join on an RDD with > unbalanced key distributions? > > > > On Mon, Dec 8, 2014 at 3:59 PM, Daniel Darabos < > daniel.dara...@lynxanalytics.com> wrote: > >> I do not see how you hope to

Re: Efficient self-joins

2014-12-08 Thread Daniel Darabos
/best_practices/prefer_reducebykey_over_groupbykey.html > > On Mon, Dec 8, 2014 at 3:47 PM, Daniel Darabos < > daniel.dara...@lynxanalytics.com> wrote: > >> Could you not use a groupByKey instead of the join? I mean something like >> this: >> >> val byDst = rdd.map { case (src, ds

Re: Efficient self-joins

2014-12-08 Thread Daniel Darabos
Could you not use a groupByKey instead of the join? I mean something like this: val byDst = rdd.map { case (src, dst, w) => dst -> (src, w) } byDst.groupByKey.map { case (dst, edges) => for { (src1, w1) <- edges (src2, w2) <- edges } { ??? // Do something. } ??? // Return somet

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-05 Thread Daniel Darabos
On Fri, Dec 5, 2014 at 7:12 AM, Tobias Pfeiffer wrote: > Rahul, > > On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish < > rahul.bindl...@nectechnologies.in> wrote: >> >> I have done so thats why spark is able to load objectfile [e.g. >> person_obj] >> and spark has maintained serialVersionUID [perso

Re: Increasing the number of retry in case of job failure

2014-12-05 Thread Daniel Darabos
It is controlled by "spark.task.maxFailures". See http://spark.apache.org/docs/latest/configuration.html#scheduling. On Fri, Dec 5, 2014 at 11:02 AM, shahab wrote: > Hello, > > By some (unknown) reasons some of my tasks, that fetch data from > Cassandra, are failing so often, and apparently the

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-05 Thread Daniel Darabos
Hi, Alexey, I'm getting the same error on startup with Spark 1.1.0. Everything works fine fortunately. The error is mentioned in the logs in https://issues.apache.org/jira/browse/SPARK-4498, so maybe it will also be fixed in Spark 1.2.0 and 1.1.2. I have no insight into it unfortunately. On Tue,

Re: How to enforce RDD to be cached?

2014-12-03 Thread Daniel Darabos
On Wed, Dec 3, 2014 at 10:52 AM, shahab wrote: > Hi, > > I noticed that rdd.cache() is not happening immediately rather due to lazy > feature of Spark, it is happening just at the moment you perform some > map/reduce actions. Is this true? > Yes, this is correct. If this is the case, how can I

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Daniel Darabos
> sc.parallelize(1 to 8).map(i => (1, i)).sortBy(t => t._2).foldByKey(0)((a, > b) => b).collect > res0: Array[(Int, Int)] = Array((1,8)) > > The fold always given me last value as 8 which suggests values preserve > sorting earlier defined in stage in DAG? > > On W

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Daniel Darabos
Akhil, I think Aniket uses the word "persisted" in a different way than what you mean. I.e. not in the RDD.persist() way. Aniket asks if running combineByKey on a sorted RDD will result in a sorted RDD. (I.e. the sorting is preserved.) I think the answer is no. combineByKey uses AppendOnlyMap, whi

Task duration graph on Spark stage UI

2014-11-06 Thread Daniel Darabos
Even though the stage UI has min, 25th%, median, 75th%, and max durations, I am often still left clueless about the distribution. For example, 100 out of 200 tasks (started at the same time) have completed in 1 hour. How much longer do I have to wait? I cannot guess well based on the five numbers.

Re: registering Array of CompactBuffer to Kryo

2014-10-02 Thread Daniel Darabos
How about this? Class.forName("[Lorg.apache.spark.util.collection.CompactBuffer;") On Tue, Sep 30, 2014 at 5:33 PM, Andras Barjak wrote: > Hi, > > what is the correct scala code to register an Array of this private spark > class to Kryo? > > "java.lang.IllegalArgumentException: Class is not

Re: Spark RDD member of class loses it's value when the class being used as graph attribute

2014-06-30 Thread Daniel Darabos
Can you share some example code of what you are doing? BTW Gmail puts down your mail as spam, saying it cannot verify it came from yahoo.com. Might want to check your mail client settings. (It could be a Gmail or Yahoo bug too of course.) On Fri, Jun 27, 2014 at 4:29 PM, harsh2005_7 wrote: > H

Re: Getting different answers running same line of code

2014-06-19 Thread Daniel Darabos
The easiest explanation would be if some other process is continuously modifying the files. You could make a copy in a new directory and run on that to eliminate this possibility. What do you see if you print "rd1.count()" multiple times? Have you tried the experiment on a smaller set of files? I

Re: join operation is taking too much time

2014-06-17 Thread Daniel Darabos
I've been wondering about this. Is there a difference in performance between these two? val rdd1 = sc.textFile(files.mkString(",")) val rdd2 = sc.union(files.map(sc .textFile(_))) I don't know about your use-case, Meethu, but it may be worth trying to see if reading all the files into one RDD (li

Re: What is the best way to handle transformations or actions that takes forever?

2014-06-17 Thread Daniel Darabos
I think you need to implement a timeout in your code. As far as I know, Spark will not interrupt the execution of your code as long as the driver is connected. Might be an idea though. On Tue, Jun 17, 2014 at 7:54 PM, Peng Cheng wrote: > I've tried enabling the speculative jobs, this seems part

Re: Is shuffle "stable"?

2014-06-14 Thread Daniel Darabos
> > Matei > > On Jun 14, 2014, at 12:14 PM, Daniel Darabos < > daniel.dara...@lynxanalytics.com> wrote: > > What I mean is, let's say I run this: > > sc.parallelize(Seq(0->3, 0->2, 0->1), > 3).partitionBy(HashPartitioner(3)).collect > > >

Is shuffle "stable"?

2014-06-14 Thread Daniel Darabos
What I mean is, let's say I run this: sc.parallelize(Seq(0->3, 0->2, 0->1), 3).partitionBy(HashPartitioner(3)).collect Will the result always be Array((0,3), (0,2), (0,1))? Or could I possibly get a different order? I'm pretty sure the shuffle files are taken in the order of the source partiti

Re: list of persisted rdds

2014-06-13 Thread Daniel Darabos
Check out SparkContext.getPersistentRDDs! On Fri, Jun 13, 2014 at 1:06 PM, mrm wrote: > Hi, > > How do I check the rdds that I have persisted? I have some code that looks > like: > "rd1.cache() > > rd2.cache() > ... > > rdN.cache()" > > How can I unpersist all rdd's at once? And is it

Re: Information on Spark UI

2014-06-11 Thread Daniel Darabos
About more succeeded tasks than total tasks: - This can happen if you have enabled speculative execution. Some partitions can get processed multiple times. - More commonly, the result of the stage may be used in a later calculation, and has to be recalculated. This happens if some of the results

Re: Hanging Spark jobs

2014-06-11 Thread Daniel Darabos
These stack traces come from the stuck node? Looks like it's waiting on data in BlockFetcherIterator. Waiting for data from another node. But you say all other nodes were done? Very curious. Maybe you could try turning on debug logging, and try to figure out what happens in BlockFetcherIterator (

Re: Better line number hints for logging?

2014-06-05 Thread Daniel Darabos
On Wed, Jun 4, 2014 at 10:39 PM, Matei Zaharia wrote: > That’s a good idea too, maybe we can change CallSiteInfo to do that. > I've filed an issue: https://issues.apache.org/jira/browse/SPARK-2035 Matei > > On Jun 4, 2014, at 8:44 AM, Daniel Darabos < > daniel.dara...@l

Re: Better line number hints for logging?

2014-06-04 Thread Daniel Darabos
Oh, this would be super useful for us too! Actually wouldn't it be best if you could see the whole call stack on the UI, rather than just one line? (Of course you would have to click to expand it.) On Wed, Jun 4, 2014 at 2:38 AM, John Salvatier wrote: > Ok, I will probably open a Jira. > > > O

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-04 Thread Daniel Darabos
On Tue, Jun 3, 2014 at 8:46 PM, Marek Wiewiorka wrote: > Hi All, > I've been experiencing a very strange error after upgrade from Spark 0.9 > to 1.0 - it seems that saveAsTestFile function is throwing > java.lang.UnsupportedOperationException that I have never seen before. > In the stack trace y

Re: Persist and unpersist

2014-05-28 Thread Daniel Darabos
On Wed, May 28, 2014 at 12:08 AM, Ankur Dave wrote: > I think what's desired here is for input to be unpersisted automatically > as soon as result is materialized. I don't think there's currently a way > to do this, but the usual workaround is to force result to be > materialized immediately and

Persist and unpersist

2014-05-27 Thread Daniel Darabos
I keep bumping into a problem with persisting RDDs. Consider this (silly) example: def everySecondFromBehind(input: RDD[Int]): RDD[Int] = { val count = input.count if (count % 2 == 0) { return input.filter(_ % 2 == 1) } else { return input.filter(_ % 2 == 0) } } The situation is

Re: My talk on "Spark: The Next Top (Compute) Model"

2014-05-01 Thread Daniel Darabos
Cool intro, thanks! One question. On slide 23 it says "Standalone ("local" mode)". That sounds a bit confusing without hearing the talk. Standalone mode is not local. It just does not depend on a cluster software. I think it's the best mode for EC2/GCE, because they provide a distributed filesyste

Re: GraphX. How to remove vertex or edge?

2014-05-01 Thread Daniel Darabos
Graph.subgraph() allows you to apply a filter to edges and/or vertices. On Thu, May 1, 2014 at 8:52 AM, Николай Кинаш wrote: > Hello. > > How to remove vertex or edges from graph in GraphX? >

Re: Shuffle phase is very slow, any help, thx!

2014-04-30 Thread Daniel Darabos
So the problem is that 99 tasks are fast (< 1 second), but 1 task is really slow (5+ hours), is that right? And your operation is graph.vertices.count? That is odd, but it could be that this job includes running previous transformations. How did you construct the graph? On Tue, Apr 29, 2014 at 3:4

Re: something about memory usage

2014-04-30 Thread Daniel Darabos
On Wed, Apr 30, 2014 at 1:52 PM, wxhsdp wrote: > Hi, guys > > i want to do some optimizations of my spark codes. i use VisualVM to > monitor the executor when run the app. > here's the snapshot: > < > http://apache-spark-user-list.1001560.n3.nabble.com/file/n5107/executor.png > > > > from the

Re: Shuffle Spill Issue

2014-04-30 Thread Daniel Darabos
t; > Best Regards, > Raymond Liu > > From: Daniel Darabos [mailto:daniel.dara...@lynxanalytics.com] > > I have no idea why shuffle spill is so large. But this might make it > smaller: > > val addition = (a: Int, b: Int) => a + b > val wordsCount = wordsPair.combineBy

Re: packaging time

2014-04-29 Thread Daniel Darabos
Tips from my experience. Disable scaladoc: sources in doc in Compile := List() Do not package the source: publishArtifact in packageSrc := false And most importantly do not run "sbt assembly". It creates a fat jar. Use "sbt package" or "sbt stage" (from sbt-native-packager). They create a direc

Re: Joining not-pair RDDs in Spark

2014-04-29 Thread Daniel Darabos
Create a key and join on that. val callPricesByHour = callPrices.map(p => ((p.year, p.month, p.day, p.hour), p)) val callsByHour = calls.map(c => ((c.year, c.month, c.day, c.hour), c)) val bills = callPricesByHour.join(callsByHour).mapValues({ case (p, c) => BillRow(c.customer, c.hour, c.minutes *

Re: Shuffle Spill Issue

2014-04-29 Thread Daniel Darabos
I have no idea why shuffle spill is so large. But this might make it smaller: val addition = (a: Int, b: Int) => a + b val wordsCount = wordsPair.combineByKey(identity, addition, addition) This way only one entry per distinct word will end up in the shuffle for each partition, instead of one entr

Re: questions about debugging a spark application

2014-04-28 Thread Daniel Darabos
Good question! I am also new to the JVM and would appreciate some tips. On Sun, Apr 27, 2014 at 5:19 AM, wxhsdp wrote: > Hi, all > i have some questions about debug in spark: > 1) when application finished, application UI is shut down, i can not see > the details about the app, like >

Re: Strange lookup behavior. Possible bug?

2014-04-28 Thread Daniel Darabos
That is quite mysterious, and I do not think we have enough information to answer. JavaPairRDD.lookup() works fine on a remote Spark cluster: $ MASTER=spark://localhost:7077 bin/spark-shell scala> val rdd = org.apache.spark.api.java.JavaPairRDD.fromRDD(sc.makeRDD(0 until 10, 3).map(x => ((x%3).toS

Re: Comparing RDD Items

2014-04-23 Thread Daniel Darabos
Hi! There is RDD.cartesian(), which creates the Cartiesian product of two RDDs. You could do data.cartesian(data) to get an RDD of all pairs of lines. It will be of length data.count * data.count of course. On Wed, Apr 23, 2014 at 4:48 PM, Jared Rodriguez wrote: > Hi there, > > I am new to Spar

Re: GraphX: .edges.distinct().count() is 10?

2014-04-23 Thread Daniel Darabos
This is caused by https://issues.apache.org/jira/browse/SPARK-1188. I think the fix will be in the next release. But until then, do: g.edges.map(_.copy()).distinct.count On Wed, Apr 23, 2014 at 2:26 AM, Ryan Compton wrote: > Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt > >

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread Daniel Darabos
With the right program you can always exhaust any amount of memory :). There is no silver bullet. You have to figure out what is happening in your code that causes a high memory use and address that. I spent all of last week doing this for a simple program of my own. Lessons I learned that may or m

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Daniel Darabos
On Wed, Apr 23, 2014 at 12:06 AM, jaeholee wrote: > How do you determine the number of partitions? For example, I have 16 > workers, and the number of cores and the worker memory set in spark-env.sh > are: > > CORE = 8 > MEMORY = 16g > So you have the capacity to work on 16 * 8 = 128 tasks at a

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Daniel Darabos
Most likely the data is not "just too big". For most operations the data is processed partition by partition. The partitions may be too big. This is what your last question hints at too: > val numWorkers = 10 > val data = sc.textFile("somedirectory/data.csv", numWorkers) This will work, but not q

Re: stdout in workers

2014-04-22 Thread Daniel Darabos
On Mon, Apr 21, 2014 at 7:59 PM, Jim Carroll wrote: > > I'm experimenting with a few things trying to understand how it's working. > I > took the JavaSparkPi example as a starting point and added a few System.out > lines. > > I added a system.out to the main body of the driver program (not inside

Re: sc.makeRDD bug with NumericRange

2014-04-18 Thread Daniel Darabos
To make up for mocking Scala, I've filed a bug ( https://issues.scala-lang.org/browse/SI-8518) and will try to patch this. On Fri, Apr 18, 2014 at 9:24 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Looks like NumericRange in Scala is just a joke. > > scala

Re: sc.makeRDD bug with NumericRange

2014-04-18 Thread Daniel Darabos
Looks like NumericRange in Scala is just a joke. scala> val x = 0.0 to 1.0 by 0.1 x: scala.collection.immutable.NumericRange[Double] = NumericRange(0.0, 0.1, 0.2, 0.30004, 0.4, 0.5, 0.6, 0.7, 0.7999, 0.8999, 0.) scala> x.take(3) res1: scala.coll

Re: Continuously running non-streaming jobs

2014-04-17 Thread Daniel Darabos
I'm quite new myself (just subscribed to the mailing list today :)), but this happens to be something we've had success with. So let me know if you hit any problems with this sort of usage. On Thu, Apr 17, 2014 at 9:11 PM, Jim Carroll wrote: > Daniel, > > I'm new to Spark but I thought that thr

Re: Continuously running non-streaming jobs

2014-04-17 Thread Daniel Darabos
The linked thread does a good job answering your question. You should create a SparkContext at startup and re-use it for all of your queries. For example we create a SparkContext in a web server at startup, and are then able to use the Spark cluster for serving Ajax queries with latency of a second

Re: confused by reduceByKey usage

2014-04-17 Thread Daniel Darabos
Here's a way to debug something like this: scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => { println("v1: " + v1) println("v2: " + v2) (v1.split(" ")(1).toInt + v2.split(" ")(1).toInt).toString }).collect You get: v1: 1 2 3 4 5 v2: 1 2 3 4 5 v1: 4 v