LynxKite is now open-source

2020-06-24 Thread Daniel Darabos
Hi, LynxKite is a graph analytics application built on Apache Spark. (From very early on, like Spark 0.9.) We have talked about it on occasion on Spark Summits. So I wanted to let you know that it's now open-source! https://github.com/lynxkite/lynxkite You should totally check it out if you

LynxKite is now open-source

2020-06-24 Thread Daniel Darabos
Hi, LynxKite is a graph analytics application built on Apache Spark. (From very early on, like Spark 0.9.) We have talked about it on occasion on Spark Summits. So I wanted to let you know that it's now open-source! https://github.com/lynxkite/lynxkite You should totally check it out if you

Re: Quick one on evaluation

2017-08-04 Thread Daniel Darabos
On Fri, Aug 4, 2017 at 4:36 PM, Jean Georges Perrin wrote: > Thanks Daniel, > > I like your answer for #1. It makes sense. > > However, I don't get why you say that there are always pending > transformations... After you call an action, you should be "clean" from > pending

Re: Quick one on evaluation

2017-08-03 Thread Daniel Darabos
On Wed, Aug 2, 2017 at 2:16 PM, Jean Georges Perrin wrote: > Hi Sparkians, > > I understand the lazy evaluation mechanism with transformations and > actions. My question is simpler: 1) are show() and/or printSchema() > actions? I would assume so... > show() is an action (it prints

Re: Strongly Connected Components

2016-11-11 Thread Daniel Darabos
immensely. (See the last slide of http://www.slideshare.net/SparkSummit/interactive-graph-analytics-daniel-darabos#33.) The large advantage is due to the lower number of necessary iterations. For why this is failing even with one iteration: I would first check your partitioning. Too many

Re: Spark 2.0 - DataFrames vs Dataset performance

2016-10-24 Thread Daniel Darabos
Hi Antoaneta, I believe the difference is not due to Datasets being slower (DataFrames are just an alias to Datasets now), but rather using a user defined function for filtering vs using Spark builtins. The builtin can use tricks from Project Tungsten, such as only deserializing the "event_type"

Re: how to run local[k] threads on a single core

2016-08-04 Thread Daniel Darabos
You could run the application in a Docker container constrained to one CPU with --cpuset-cpus ( https://docs.docker.com/engine/reference/run/#/cpuset-constraint). On Thu, Aug 4, 2016 at 8:51 AM, Sun Rui wrote: > I don’t think it possible as Spark does not support thread to

Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-25 Thread Daniel Darabos
Another possible explanation is that by accident you are still running Spark 1.6.1. Which download are you using? This is what I see: $ ~/spark-1.6.2-bin-hadoop2.6/bin/spark-shell log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN

Re: spark.executor.cores

2016-07-15 Thread Daniel Darabos
Mich's invocation is for starting a Spark application against an already running Spark standalone cluster. It will not start the cluster for you. We used to not use "spark-submit", but we started using it when it solved some problem for us. Perhaps that day has also come for you? :) On Fri, Jul

Re: How to Register Permanent User-Defined-Functions (UDFs) in SparkSQL

2016-07-12 Thread Daniel Darabos
Hi Lokesh, There is no way to do that. SqlContext.newSession documentation says: Returns a SQLContext as new session, with separated SQL configurations, temporary tables, registered functions, but sharing the same SparkContext, CacheManager, SQLListener and SQLTab. You have two options:

Re: What is the interpretation of Cores in Spark doc

2016-06-12 Thread Daniel Darabos
Spark is a software product. In software a "core" is something that a process can run on. So it's a "virtual core". (Do not call these "threads". A "thread" is not something a process can run on.) local[*] uses java.lang.Runtime.availableProcessors()

Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-07 Thread Daniel Darabos
On Sun, Jun 5, 2016 at 9:51 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > If you fill up the cache, 1.6.0+ will suffer performance degradation from > GC thrashing. You can set spark.memory.useLegacyMode to true, or > spark.memory.f

Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-05 Thread Daniel Darabos
If you fill up the cache, 1.6.0+ will suffer performance degradation from GC thrashing. You can set spark.memory.useLegacyMode to true, or spark.memory.fraction to 0.66, or spark.executor.extraJavaOptions to -XX:NewRatio=3 to avoid this issue. I think my colleague filed a ticket for this issue,

Re: best way to do deep learning on spark ?

2016-03-19 Thread Daniel Darabos
On Thu, Mar 17, 2016 at 3:51 AM, charles li wrote: > Hi, Alexander, > > that's awesome, and when will that feature be released ? Since I want to > know the opportunity cost between waiting for that release and use caffe or > tensorFlow ? > I don't expect MLlib will be

Re: coalesce and executor memory

2016-02-13 Thread Daniel Darabos
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers wrote: > in spark, every partition needs to fit in the memory available to the core > processing it. > That does not agree with my understanding of how it works. I think you could do

Re: coalesce and executor memory

2016-02-13 Thread Daniel Darabos
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers wrote: > in spark, every partition needs to fit in the memory available to the core > processing it. > That does not agree with my understanding of how it works. I think you could do

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-29 Thread Daniel Darabos
f things for it still to exist. Link to >> article: >> https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/ >> >> Hopefully a little more stability will come out with the upcoming Spark >> 1.6 release on EMR (I think that is happening sometime soon). >

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-26 Thread Daniel Darabos
Have you tried setting spark.emr.dropCharacters to a lower value? (It defaults to 8.) :) Just joking, sorry! Fantastic bug. What data source do you have for this DataFrame? I could imagine for example that it's a Parquet file and on EMR you are running with two wrong version of the Parquet

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Daniel Darabos
Hi, How do you know it doesn't work? The log looks roughly normal to me. Is Spark not running at the printed address? Can you not start jobs? On Mon, Jan 18, 2016 at 11:51 AM, Oleg Ruchovets wrote: > Hi , >I try to follow the spartk 1.6.0 to install spark on EC2. > >

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Daniel Darabos
On Mon, Jan 18, 2016 at 5:24 PM, Oleg Ruchovets wrote: > I thought script tries to install hadoop / hdfs also. And it looks like it > failed. Installation is only standalone spark without hadoop. Is it correct > behaviour? > Yes, it also sets up two HDFS clusters. Are they

Re: Using JDBC clients with "Spark on Hive"

2016-01-15 Thread Daniel Darabos
Does Hive JDBC work if you are not using Spark as a backend? I just had very bad experience with Hive JDBC in general. E.g. half the JDBC protocol is not implemented (https://issues.apache.org/jira/browse/HIVE-3175, filed in 2012). On Fri, Jan 15, 2016 at 2:15 AM, sdevashis

Re: stopping a process usgin an RDD

2016-01-04 Thread Daniel Darabos
You can cause a failure by throwing an exception in the code running on the executors. The task will be retried (if spark.task.maxFailures > 1), and then the stage is failed. No further tasks are processed after that, and an exception is thrown on the driver. You could catch the exception and see

Re: Is coalesce smart while merging partitions?

2015-10-08 Thread Daniel Darabos
> For example does spark try to merge the small partitions first or the election of partitions to merge is random? It is quite smart as Iulian has pointed out. But it does not try to merge small partitions first. Spark doesn't know the size of partitions. (The partitions are represented as

Re: Meaning of local[2]

2015-08-17 Thread Daniel Darabos
Hi Praveen, On Mon, Aug 17, 2015 at 12:34 PM, praveen S mylogi...@gmail.com wrote: What does this mean in .setMaster(local[2]) Local mode (executor in the same JVM) with 2 executor threads. Is this applicable only for standalone Mode? It is not applicable for standalone mode, only for

Re: Research ideas using spark

2015-07-14 Thread Daniel Darabos
Hi Shahid, To be honest I think this question is better suited for Stack Overflow than for a PhD thesis. On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark

Re: S3 vs HDFS

2015-07-09 Thread Daniel Darabos
I recommend testing it for yourself. Even if you have no application, you can just run the spark-ec2 script, log in, run spark-shell and try reading files from an S3 bucket and from hdfs://master IP:9000/. (This is the ephemeral HDFS cluster, which uses SSD.) I just tested our application this

Re: Split RDD into two in a single pass

2015-07-06 Thread Daniel Darabos
This comes up so often. I wonder if the documentation or the API could be changed to answer this question. The solution I found is from http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job. You basically write the items into two directories in a single

Re: .NET on Apache Spark?

2015-07-02 Thread Daniel Darabos
Indeed Spark does not have .NET bindings. On Thu, Jul 2, 2015 at 10:33 AM, Zwits daniel.van...@ortec-finance.com wrote: I'm currently looking into a way to run a program/code (DAG) written in .NET on a cluster using Spark. However I ran into problems concerning the coding language, Spark has

Re: map vs mapPartitions

2015-06-25 Thread Daniel Darabos
Spark creates a RecordReader and uses next() on it when you call input.next(). (See https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L215) How the RecordReader works is an HDFS question, but it's safe to say there is no difference between using

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Daniel Darabos
It would be even faster to load the data on the driver and sort it there without using Spark :). Using reduce() is cheating, because it only works as long as the data fits on one machine. That is not the targeted use case of a distributed computation system. You can repeat your test with more data

Re: coGroup on RDDPojos

2015-06-08 Thread Daniel Darabos
I suggest you include your code and the error message! It's not even immediately clear what programming language you mean to ask about. On Mon, Jun 8, 2015 at 2:50 PM, elbehery elbeherymust...@gmail.com wrote: Hi, I have two datasets of customer types, and I would like to apply coGrouping on

Re: Spark error in execution

2015-01-06 Thread Daniel Darabos
Hello! I just had a very similar stack trace. It was caused by an Akka version mismatch. (From trying to use Play 2.3 with Spark 1.1 by accident instead of 1.2.) On Mon, Nov 24, 2014 at 7:15 PM, Blackeye black...@iit.demokritos.gr wrote: I created an application in spark. When I run it with

Re: when will the spark 1.3.0 be released?

2014-12-17 Thread Daniel Darabos
Spark 1.2.0 is coming in the next 48 hours according to http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-tc9815.html On Wed, Dec 17, 2014 at 10:11 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi All, When will the Spark 1.2.0 be

Re: Can spark job have sideeffects (write files to FileSystem)

2014-12-11 Thread Daniel Darabos
Yes, this is perfectly legal. This is what RDD.foreach() is for! You may be encountering an IO exception while writing, and maybe using() suppresses it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd expect there is less that can go wrong with that simple call. On Thu, Dec

Re: How can I create an RDD with millions of entries created programmatically

2014-12-09 Thread Daniel Darabos
only one implementation which takes a List - requiring an in memory representation On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Hi, I think you have the right idea. I would not even worry about flatMap. val rdd = sc.parallelize(1 to 100

Re: Efficient self-joins

2014-12-08 Thread Daniel Darabos
/prefer_reducebykey_over_groupbykey.html On Mon, Dec 8, 2014 at 3:47 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Could you not use a groupByKey instead of the join? I mean something like this: val byDst = rdd.map { case (src, dst, w) = dst - (src, w) } byDst.groupByKey.map { case (dst, edges

Re: Efficient self-joins

2014-12-08 Thread Daniel Darabos
distributions? On Mon, Dec 8, 2014 at 3:59 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: I do not see how you hope to generate all incoming edge pairs without repartitioning the data by dstID. You need to perform this shuffle for joining too. Otherwise two incoming edges could

Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Daniel Darabos
Hi, I think you have the right idea. I would not even worry about flatMap. val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x = generateRandomObject(x)) Then when you try to evaluate something on this RDD, it will happen partition-by-partition. So 1000 random objects will be

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-05 Thread Daniel Darabos
Hi, Alexey, I'm getting the same error on startup with Spark 1.1.0. Everything works fine fortunately. The error is mentioned in the logs in https://issues.apache.org/jira/browse/SPARK-4498, so maybe it will also be fixed in Spark 1.2.0 and 1.1.2. I have no insight into it unfortunately. On Tue,

Re: Increasing the number of retry in case of job failure

2014-12-05 Thread Daniel Darabos
It is controlled by spark.task.maxFailures. See http://spark.apache.org/docs/latest/configuration.html#scheduling. On Fri, Dec 5, 2014 at 11:02 AM, shahab shahab.mok...@gmail.com wrote: Hello, By some (unknown) reasons some of my tasks, that fetch data from Cassandra, are failing so often,

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-05 Thread Daniel Darabos
On Fri, Dec 5, 2014 at 7:12 AM, Tobias Pfeiffer t...@preferred.jp wrote: Rahul, On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: I have done so thats why spark is able to load objectfile [e.g. person_obj] and spark has maintained serialVersionUID

Re: How to enforce RDD to be cached?

2014-12-03 Thread Daniel Darabos
On Wed, Dec 3, 2014 at 10:52 AM, shahab shahab.mok...@gmail.com wrote: Hi, I noticed that rdd.cache() is not happening immediately rather due to lazy feature of Spark, it is happening just at the moment you perform some map/reduce actions. Is this true? Yes, this is correct. If this is

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Daniel Darabos
Akhil, I think Aniket uses the word persisted in a different way than what you mean. I.e. not in the RDD.persist() way. Aniket asks if running combineByKey on a sorted RDD will result in a sorted RDD. (I.e. the sorting is preserved.) I think the answer is no. combineByKey uses AppendOnlyMap,

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Daniel Darabos
= (1, i)).sortBy(t = t._2).foldByKey(0)((a, b) = b).collect res0: Array[(Int, Int)] = Array((1,8)) The fold always given me last value as 8 which suggests values preserve sorting earlier defined in stage in DAG? On Wed Nov 19 2014 at 18:10:11 Daniel Darabos daniel.dara...@lynxanalytics.com

Task duration graph on Spark stage UI

2014-11-06 Thread Daniel Darabos
Even though the stage UI has min, 25th%, median, 75th%, and max durations, I am often still left clueless about the distribution. For example, 100 out of 200 tasks (started at the same time) have completed in 1 hour. How much longer do I have to wait? I cannot guess well based on the five numbers.

Re: registering Array of CompactBuffer to Kryo

2014-10-02 Thread Daniel Darabos
How about this? Class.forName([Lorg.apache.spark.util.collection.CompactBuffer;) On Tue, Sep 30, 2014 at 5:33 PM, Andras Barjak andras.bar...@lynxanalytics.com wrote: Hi, what is the correct scala code to register an Array of this private spark class to Kryo?

Re: Spark RDD member of class loses it's value when the class being used as graph attribute

2014-06-30 Thread Daniel Darabos
Can you share some example code of what you are doing? BTW Gmail puts down your mail as spam, saying it cannot verify it came from yahoo.com. Might want to check your mail client settings. (It could be a Gmail or Yahoo bug too of course.) On Fri, Jun 27, 2014 at 4:29 PM, harsh2005_7

Re: What is the best way to handle transformations or actions that takes forever?

2014-06-17 Thread Daniel Darabos
I think you need to implement a timeout in your code. As far as I know, Spark will not interrupt the execution of your code as long as the driver is connected. Might be an idea though. On Tue, Jun 17, 2014 at 7:54 PM, Peng Cheng pc...@uow.edu.au wrote: I've tried enabling the speculative jobs,

Re: join operation is taking too much time

2014-06-17 Thread Daniel Darabos
I've been wondering about this. Is there a difference in performance between these two? val rdd1 = sc.textFile(files.mkString(,)) val rdd2 = sc.union(files.map(sc .textFile(_))) I don't know about your use-case, Meethu, but it may be worth trying to see if reading all the files into one RDD

Is shuffle stable?

2014-06-14 Thread Daniel Darabos
What I mean is, let's say I run this: sc.parallelize(Seq(0-3, 0-2, 0-1), 3).partitionBy(HashPartitioner(3)).collect Will the result always be Array((0,3), (0,2), (0,1))? Or could I possibly get a different order? I'm pretty sure the shuffle files are taken in the order of the source

Re: Is shuffle stable?

2014-06-14 Thread Daniel Darabos
, at 12:14 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: What I mean is, let's say I run this: sc.parallelize(Seq(0-3, 0-2, 0-1), 3).partitionBy(HashPartitioner(3)).collect Will the result always be Array((0,3), (0,2), (0,1))? Or could I possibly get a different order? I'm

Re: list of persisted rdds

2014-06-13 Thread Daniel Darabos
Check out SparkContext.getPersistentRDDs! On Fri, Jun 13, 2014 at 1:06 PM, mrm ma...@skimlinks.com wrote: Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And

Re: Hanging Spark jobs

2014-06-11 Thread Daniel Darabos
These stack traces come from the stuck node? Looks like it's waiting on data in BlockFetcherIterator. Waiting for data from another node. But you say all other nodes were done? Very curious. Maybe you could try turning on debug logging, and try to figure out what happens in BlockFetcherIterator (

Re: Information on Spark UI

2014-06-11 Thread Daniel Darabos
About more succeeded tasks than total tasks: - This can happen if you have enabled speculative execution. Some partitions can get processed multiple times. - More commonly, the result of the stage may be used in a later calculation, and has to be recalculated. This happens if some of the results

Re: Better line number hints for logging?

2014-06-05 Thread Daniel Darabos
On Wed, Jun 4, 2014 at 10:39 PM, Matei Zaharia matei.zaha...@gmail.com wrote: That’s a good idea too, maybe we can change CallSiteInfo to do that. I've filed an issue: https://issues.apache.org/jira/browse/SPARK-2035 Matei On Jun 4, 2014, at 8:44 AM, Daniel Darabos daniel.dara

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-04 Thread Daniel Darabos
On Tue, Jun 3, 2014 at 8:46 PM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: Hi All, I've been experiencing a very strange error after upgrade from Spark 0.9 to 1.0 - it seems that saveAsTestFile function is throwing java.lang.UnsupportedOperationException that I have never seen before.

Re: Better line number hints for logging?

2014-06-04 Thread Daniel Darabos
Oh, this would be super useful for us too! Actually wouldn't it be best if you could see the whole call stack on the UI, rather than just one line? (Of course you would have to click to expand it.) On Wed, Jun 4, 2014 at 2:38 AM, John Salvatier jsalvat...@gmail.com wrote: Ok, I will probably

Persist and unpersist

2014-05-27 Thread Daniel Darabos
I keep bumping into a problem with persisting RDDs. Consider this (silly) example: def everySecondFromBehind(input: RDD[Int]): RDD[Int] = { val count = input.count if (count % 2 == 0) { return input.filter(_ % 2 == 1) } else { return input.filter(_ % 2 == 0) } } The situation is

Re: GraphX. How to remove vertex or edge?

2014-05-01 Thread Daniel Darabos
Graph.subgraph() allows you to apply a filter to edges and/or vertices. On Thu, May 1, 2014 at 8:52 AM, Николай Кинаш peroksi...@gmail.com wrote: Hello. How to remove vertex or edges from graph in GraphX?

Re: My talk on Spark: The Next Top (Compute) Model

2014-05-01 Thread Daniel Darabos
Cool intro, thanks! One question. On slide 23 it says Standalone (local mode). That sounds a bit confusing without hearing the talk. Standalone mode is not local. It just does not depend on a cluster software. I think it's the best mode for EC2/GCE, because they provide a distributed filesystem

Re: Shuffle Spill Issue

2014-04-30 Thread Daniel Darabos
From: Daniel Darabos [mailto:daniel.dara...@lynxanalytics.com] I have no idea why shuffle spill is so large. But this might make it smaller: val addition = (a: Int, b: Int) = a + b val wordsCount = wordsPair.combineByKey(identity, addition, addition) This way only one entry per distinct word

Re: Joining not-pair RDDs in Spark

2014-04-29 Thread Daniel Darabos
Create a key and join on that. val callPricesByHour = callPrices.map(p = ((p.year, p.month, p.day, p.hour), p)) val callsByHour = calls.map(c = ((c.year, c.month, c.day, c.hour), c)) val bills = callPricesByHour.join(callsByHour).mapValues({ case (p, c) = BillRow(c.customer, c.hour, c.minutes *

Re: Strange lookup behavior. Possible bug?

2014-04-28 Thread Daniel Darabos
That is quite mysterious, and I do not think we have enough information to answer. JavaPairRDDString, Tuple2.lookup() works fine on a remote Spark cluster: $ MASTER=spark://localhost:7077 bin/spark-shell scala val rdd = org.apache.spark.api.java.JavaPairRDD.fromRDD(sc.makeRDD(0 until 10, 3).map(x

Re: questions about debugging a spark application

2014-04-28 Thread Daniel Darabos
Good question! I am also new to the JVM and would appreciate some tips. On Sun, Apr 27, 2014 at 5:19 AM, wxhsdp wxh...@gmail.com wrote: Hi, all i have some questions about debug in spark: 1) when application finished, application UI is shut down, i can not see the details about the app,

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread Daniel Darabos
With the right program you can always exhaust any amount of memory :). There is no silver bullet. You have to figure out what is happening in your code that causes a high memory use and address that. I spent all of last week doing this for a simple program of my own. Lessons I learned that may or

Re: GraphX: .edges.distinct().count() is 10?

2014-04-23 Thread Daniel Darabos
This is caused by https://issues.apache.org/jira/browse/SPARK-1188. I think the fix will be in the next release. But until then, do: g.edges.map(_.copy()).distinct.count On Wed, Apr 23, 2014 at 2:26 AM, Ryan Compton compton.r...@gmail.comwrote: Try this:

Re: stdout in workers

2014-04-22 Thread Daniel Darabos
On Mon, Apr 21, 2014 at 7:59 PM, Jim Carroll jimfcarr...@gmail.com wrote: I'm experimenting with a few things trying to understand how it's working. I took the JavaSparkPi example as a starting point and added a few System.out lines. I added a system.out to the main body of the driver

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Daniel Darabos
Most likely the data is not just too big. For most operations the data is processed partition by partition. The partitions may be too big. This is what your last question hints at too: val numWorkers = 10 val data = sc.textFile(somedirectory/data.csv, numWorkers) This will work, but not quite

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Daniel Darabos
On Wed, Apr 23, 2014 at 12:06 AM, jaeholee jho...@lbl.gov wrote: How do you determine the number of partitions? For example, I have 16 workers, and the number of cores and the worker memory set in spark-env.sh are: CORE = 8 MEMORY = 16g So you have the capacity to work on 16 * 8 = 128

Re: sc.makeRDD bug with NumericRange

2014-04-18 Thread Daniel Darabos
Looks like NumericRange in Scala is just a joke. scala val x = 0.0 to 1.0 by 0.1 x: scala.collection.immutable.NumericRange[Double] = NumericRange(0.0, 0.1, 0.2, 0.30004, 0.4, 0.5, 0.6, 0.7, 0.7999, 0.8999, 0.) scala x.take(3) res1:

Re: sc.makeRDD bug with NumericRange

2014-04-18 Thread Daniel Darabos
To make up for mocking Scala, I've filed a bug ( https://issues.scala-lang.org/browse/SI-8518) and will try to patch this. On Fri, Apr 18, 2014 at 9:24 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Looks like NumericRange in Scala is just a joke. scala val x = 0.0 to 1.0 by 0.1

Re: Continuously running non-streaming jobs

2014-04-17 Thread Daniel Darabos
I'm quite new myself (just subscribed to the mailing list today :)), but this happens to be something we've had success with. So let me know if you hit any problems with this sort of usage. On Thu, Apr 17, 2014 at 9:11 PM, Jim Carroll jimfcarr...@gmail.com wrote: Daniel, I'm new to Spark but