RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
This is most likely due to the internal implementation of ALS in MLib. Probably for each parallel unit of execution (partition in Spark terms) the implementation allocates and uses a RAM buffer where it keeps interim results during the ALS iterations If we assume that the size of that

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
of them will be in a suspended mode waiting for free core (Thread contexts also occupy additional RAM ) From: Aniruddh Sharma [mailto:asharma...@gmail.com] Sent: Wednesday, July 8, 2015 12:52 PM To: Evo Eftimov Subject: Re: Out of Memory Errors on less number of cores in proportion to Partitions

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
Also try to increase the number of partions gradually – not in one big jump from 20 to 100 but adding e.g. 10 at a time and see whether there is a correlation with adding more RAM to the executors From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Wednesday, July 8, 2015 1:26 PM

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov
That was a) fuzzy b) insufficient – one can certainly use forach (only) on DStream RDDs – it works as empirical observation As another empirical observation: For each partition results in having one instance of the lambda/closure per partition when e.g. publishing to output systems

RE:

2015-07-07 Thread Evo Eftimov
spark.streaming.unpersist = false // in order for SStreaming to not drop the raw RDD data spark.cleaner.ttl = some reasonable value in seconds why is the above suggested provided the persist/vache operation on the constantly unioniuzed Batch RDD will have to be invoked anyway (after every

RE:

2015-07-07 Thread Evo Eftimov
... dstream.foreachRDD{ rdd = myRDD = myRDD.union(rdd.filter(myfilter)).cashe() } From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Tuesday, July 7, 2015 1:55 PM To: Evo Eftimov Cc: Anand Nalya; spark users Subject: Re: Evo, I'd let the OP clarify the question. I'm

R on spark

2015-06-27 Thread Evo Eftimov
I had a look at the new R on Spark API / Feature in Spark 1.4.0 For those skilled in the art (of R and distributed computing) it will be immediately clear that ON is a marketing ploy and what it actually is is TO ie Spark 1.4.0 offers INTERFACE from R TO DATA stored in Spark in distributed

RE: Spark Streaming: limit number of nodes

2015-06-24 Thread Evo Eftimov
BUT that limits the number of cores per Executor rather than the total cores for the job and hence will probably not yield the effect you need From: Wojciech Pituła [mailto:w.pit...@gmail.com] Sent: Wednesday, June 24, 2015 10:49 AM To: Evo Eftimov; user@spark.apache.org Subject: Re: Spark

RE: Spark Streaming: limit number of nodes

2015-06-24 Thread Evo Eftimov
There is no direct one to one mapping between Executor and Node Executor is simply the spark framework term for JVM instance with some spark framework system code running in it A node is a physical server machine You can have more than one JVM per node And vice versa you can

RE: Web UI vs History Server Bugs

2015-06-23 Thread Evo Eftimov
Probably your application has crashed or was terminated without invoking the stop method of spark context - in such cases it doesn't create the empty flag file which apparently tells the history server that it can safely show the log data - simpy go to some of the other dirs of the history server

Spark Streaming 1.3.0 ERROR LiveListenerBus

2015-06-19 Thread Evo Eftimov
Spark Streaming 1.3.0 on YARN during Job Execution keeps generating the following error while the application is running: ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException etc etc Caused by: java.io.IOException: Filesystem closed

RE: Machine Learning on GraphX

2015-06-18 Thread Evo Eftimov
What is GraphX: - It can be viewed as a kind of Distributed, Parallel, Graph Database - It can be viewed as Graph Data Structure (Data Structures 101 from your CS course) - It features some off the shelve algos for Graph Processing and Navigation (Algos and Data

RE: Spark or Storm

2015-06-17 Thread Evo Eftimov
The only thing which doesn't make much sense in Spark Streaming (and I am not saying it is done better in Storm) is the iterative and redundant shipping of the essentially the same tasks (closures/lambdas/functions) to the cluster nodes AND re-launching them there again and again This is a

RE: stop streaming context of job failure

2015-06-16 Thread Evo Eftimov
https://spark.apache.org/docs/latest/monitoring.html also subscribe to various Listeners for various Metrcis Types e.g. Job Stats/Statuses - this will allow you (in the driver) to decide when to stop the context gracefully (the listening and stopping can be done from a completely

RE: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread Evo Eftimov
Best is by measuring and recording how The Performance of your solution scales as The Workload scales - recording As In Data Points recording and then you can do some times series stat analysis and visualizations For example you can start with a single box with e.g. 8 CPU cores Use e.g. 1 or

RE: Join between DStream and Periodically-Changing-RDD

2015-06-15 Thread Evo Eftimov
“turn (keep turning) your HDFS file (Batch RDD) into a stream of messages (outside spark streaming)” – what I meant by that was “turn the Updates to your HDFS dataset into Messages” and send them as such to spark streaming From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Monday, June

RE: Join between DStream and Periodically-Changing-RDD

2015-06-15 Thread Evo Eftimov
of the app And at the same time you join the DStream RDDs of your actual Streaming Data with the above continuously updated DStream RDD representing your HDFS file From: Ilove Data [mailto:data4...@gmail.com] Sent: Monday, June 15, 2015 5:19 AM To: Tathagata Das Cc: Evo Eftimov; Akhil Das

RE: Join between DStream and Periodically-Changing-RDD

2015-06-10 Thread Evo Eftimov
It depends on how big the Batch RDD requiring reloading is Reloading it for EVERY single DStream RDD would slow down the stream processing inline with the total time required to reload the Batch RDD ….. But if the Batch RDD is not that big then that might not be an issues especially in

Re: Determining number of executors within RDD

2015-06-10 Thread Evo Eftimov
Yes  i think it is ONE worker ONE executor as executor is nothing but jvm instance spawned by the worker  To run more executors ie jvm instances on the same physical cluster node you need to run more than one worker on that node and then allocate only part of the sys resourced to that

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark Batch Jobs (besides anyone can put something like that in 5 min), while I am under the impression that Dmytiy is working on Spark Streaming app Besides the Job Server is essentially for sharing the Spark Context

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
once in your Spark Streaming App and then keep joining and then e.g. filtering every incoming DStream RDD with the (big static) Batch RDD From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Friday, June 5, 2015 3:27 PM To: 'Dmitry Goldenberg' Cc: 'Yiannis Gkoufas'; 'Olivier Girardot'; 'user

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
It is called Indexed RDD https://github.com/amplab/spark-indexedrdd From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Friday, June 5, 2015 3:15 PM To: Evo Eftimov Cc: Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org Subject: Re: How to share large resources like

RE: Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread Evo Eftimov
Foreachpartition callback is provided with Iterator by the Spark Frameowrk – while iterator.hasNext() …… Also check whether this is not some sort of Python Spark API bug – Python seems to be the foster child here – Scala and Java are the darlings From: John Omernik

RE: How to increase the number of tasks

2015-06-05 Thread Evo Eftimov
It may be that your system runs out of resources (ie 174 is the ceiling) due to the following 1. RDD Partition = (Spark) Task 2. RDD Partition != (Spark) Executor 3. (Spark) Task != (Spark) Executor 4. (Spark) Task = JVM Thread 5. (Spark) Executor = JVM

RE: How to increase the number of tasks

2015-06-05 Thread Evo Eftimov
€ρ@Ҝ (๏̯͡๏) Cc: Evo Eftimov; user Subject: Re: How to increase the number of tasks just multiply 2-4 with the cpu core number of the node . 2015-06-05 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com: I did not change spark.default.parallelism, What is recommended value for it. On Fri

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
share data between Jobs, while an RDD is ALWAYS visible within Jobs using the same Spark Context From: Charles Earl [mailto:charles.ce...@gmail.com] Sent: Friday, June 5, 2015 12:10 PM To: Evo Eftimov Cc: Dmitry Goldenberg; Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org Subject: Re

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov
Dmitry was concerned about the “serialization cost” NOT the “memory footprint – hence option a) is still viable since a Broadcast is performed only ONCE for the lifetime of Driver instance From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, June 3, 2015 2:44 PM To: Evo Eftimov Cc

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov
Hmmm a spark streaming app code doesn't execute in the linear fashion assumed in your previous code snippet - to achieve your objectives you should do something like the following in terms of your second objective - saving the initialization and serialization of the params you can: a) broadcast

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Evo Eftimov
more From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Wednesday, June 3, 2015 4:46 PM To: Evo Eftimov Cc: Cody Koeninger; Andrew Or; Gerard Maas; spark users Subject: Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics? Evo

RE: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Evo Eftimov
I don’t think the number of CPU cores controls the “number of parallel tasks”. The number of Tasks corresponds first and foremost to the number of (Dstream) RDD Partitions The Spark documentation doesn’t mention what is meant by “Task” in terms of Standard Multithreading Terminology ie a

RE: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
@DG; The key metrics should be - Scheduling delay – its ideal state is to remain constant over time and ideally be less than the time of the microbatch window - The average job processing time should remain less than the micro-batch window - Number of Lost Jobs

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
shut down your job gracefuly. Besides msnaging the offsets explicitly is not a big deal if necessary Sent from Samsung Mobile div Original message /divdivFrom: Dmitry Goldenberg dgoldenberg...@gmail.com /divdivDate:2015/05/28 13:16 (GMT+00:00) /divdivTo: Evo Eftimov evo.efti

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
there is no free RAM ) and taking a performance hit from that, BUT only until there is no free RAM From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Thursday, May 28, 2015 2:34 PM To: Evo Eftimov Cc: Gerard Maas; spark users Subject: Re: FW: Re: Autoscaling Spark cluster based on topic

RE: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
An Executor is a JVM instance spawned and running on a Cluster Node (Server machine). Task is essentially a JVM Thread – you can have as many Threads as you want per JVM. You will also hear about “Executor Slots” – these are essentially the CPU Cores available on the machine and granted for use

RE: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
9:30 AM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: How does spark manage the memory of executor with multiple tasks Yes, I know that one task represent a JVM thread. This is what I confused. Usually users want to specify the memory on task level, so how can I do it if task

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
Original message /divdivFrom: Arush Kharbanda ar...@sigmoidanalytics.com /divdivDate:2015/05/26 10:55 (GMT+00:00) /divdivTo: canan chen ccn...@gmail.com /divdivCc: Evo Eftimov evo.efti...@isecc.com,user@spark.apache.org /divdivSubject: Re: How does spark manage the memory of executor

RE: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Evo Eftimov
If the message consumption rate is higher than the time required to process ALL data for a micro batch (ie the next RDD produced for your stream) the following happens – lets say that e.g. your micro batch time is 3 sec: 1. Based on your message streaming and consumption rate, you

RE: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Evo Eftimov
performance in the name of the reliability/integrity of your system ie not loosing messages) From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Friday, May 22, 2015 9:39 PM To: 'Tathagata Das'; 'Gautam Bajaj' Cc: 'user' Subject: RE: Storing spark processed output to Database asynchronously

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Evo Eftimov
A receiver occupies a cpu core, an executor is simply a jvm instance and as such it can be granted any number of cores and ram So check how many cores you have per executor Sent from Samsung Mobile div Original message /divdivFrom: Mike Trienis mike.trie...@orcsol.com

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
OR you can run Drools in a Central Server Mode ie as a common/shared service, but that would slowdown your Spark Streaming job due to the remote network call which will have to be generated for every single message From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Friday, May 22, 2015

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
From: Antonio Giambanco [mailto:antogia...@gmail.com] Sent: Friday, May 22, 2015 11:07 AM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Spark Streaming and Drools Thanks a lot Evo, do you know where I can find some examples? Have a great one A G 2015-05-22 12:00 GMT+02:00

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
You can deploy and invoke Drools as a Singleton on every Spark Worker Node / Executor / Worker JVM You can invoke it from e.g. map, filter etc and use the result from the Rule to make decision how to transform/filter an event/message From: Antonio Giambanco [mailto:antogia...@gmail.com]

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
The only “tricky” bit would be when you want to manage/update the Rule Base in your Drools Engines already running as Singletons in Executor JVMs on Worker Nodes. The invocation of Drools from Spark Streaming to evaluate a Rule already loaded in Drools is not a problem. From: Evo Eftimov

RE: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Evo Eftimov
Check whether the name can be resolved in the /etc/hosts file (or DNS) of the worker (the same btw applies for the Node where you run the driver app – all other nodes must be able to resolve its name) From: Stephen Boesch [mailto:java...@gmail.com] Sent: Wednesday, May 20, 2015 10:07

RE: Spark 1.3.1 Performance Tuning/Patterns for Data Generation Heavy/Throughput Jobs

2015-05-19 Thread Evo Eftimov
Is that a Spark or Spark Streaming application Re the map transformation which is required you can also try flatMap Finally an Executor is essentially a JVM spawn by a Spark Worker Node or YARN – giving 60GB RAM to a single JVM will certainly result in “off the charts” GC. I would

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
, as Evo says Spark Streaming DOES crash in “unceremonious way” when the free RAM available for In Memory Cashed RDDs gets exhausted From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, May 18, 2015 2:03 PM To: Evo Eftimov Cc: Dmitry Goldenberg; user@spark.apache.org Subject: Re

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
communication and facts From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, May 18, 2015 2:28 PM To: Evo Eftimov Cc: Dmitry Goldenberg; user@spark.apache.org Subject: Re: Spark Streaming and reducing latency we = Sigmoid back-pressuring mechanism = Stoping the receiver from

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
Ps: ultimately though remember that none of this stuff is part of spark streming as of yet Sent from Samsung Mobile div Original message /divdivFrom: Akhil Das ak...@sigmoidanalytics.com /divdivDate:2015/05/18 16:56 (GMT+00:00) /divdivTo: Evo Eftimov evo.efti...@isecc.com

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
You can use spark.streaming.receiver.maxRate not set Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov
-4-1410542878200 not found From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Monday, May 18, 2015 12:13 PM To: 'Dmitry Goldenberg'; 'Akhil Das' Cc: 'user@spark.apache.org' Subject: RE: Spark Streaming and reducing latency You can use spark.streaming.receiver.maxRate not set

RE: Spark Streaming and reducing latency

2015-05-17 Thread Evo Eftimov
This is the nature of Spark Streaming as a System Architecture: 1. It is a batch processing system architecture (Spark Batch) optimized for Streaming Data 2. In terms of sources of Latency in such System Architecture, bear in mind that besides “batching”, there is also the

RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Evo Eftimov
You can make ANY standard receiver sleep by implementing a custom Message Deserializer class with sleep method inside it. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Sunday, May 17, 2015 4:29 PM To: Haopu Wang Cc: user Subject: Re: [SparkStreaming] Is it possible to delay the

RE: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Evo Eftimov
to me From: Tathagata Das [mailto:t...@databricks.com] Sent: Friday, May 15, 2015 6:45 PM To: Evo Eftimov Cc: user Subject: Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond How are you configuring the fair scheduler pools? On Fri, May 15, 2015 at 8:33 AM, Evo Eftimov

RE: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Evo Eftimov
Ok thanks a lot for clarifying that – btw was your application a Spark Streaming App – I am also looking for confirmation that FAIR scheduling is supported for Spark Streaming Apps From: Richard Marscher [mailto:rmarsc...@localytics.com] Sent: Friday, May 15, 2015 7:20 PM To: Evo Eftimov

RE: swap tuple

2015-05-14 Thread Evo Eftimov
Where is the “Tuple” supposed to be in String, String - you can refer to a “Tuple” if it was e.g. String, Tuple2String, String From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of Holden Karau Sent: Thursday, May 14, 2015 5:56 PM To: Yasemin Kaya Cc:

RE: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Evo Eftimov
to be inherent to the “commercial” vendors, but I can confirm as fact it is also in effect to the “open source movement” (because human nature remains the same) From: David Morales [mailto:dmora...@stratio.com] Sent: Thursday, May 14, 2015 4:30 PM To: Paolo Platter Cc: Evo Eftimov; Matei Zaharia

RE: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Evo Eftimov
That has been a really rapid “evaluation” of the “work” and its “direction” From: David Morales [mailto:dmora...@stratio.com] Sent: Thursday, May 14, 2015 4:12 PM To: Matei Zaharia Cc: user@spark.apache.org Subject: Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

RE: DStream Union vs. StreamingContext Union

2015-05-12 Thread Evo Eftimov
I can confirm it does work in Java From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Tuesday, May 12, 2015 5:53 PM To: Evo Eftimov Cc: Saisai Shao; user@spark.apache.org Subject: Re: DStream Union vs. StreamingContext Union Thanks Evo. I tried chaining Dstream unions like

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov
from Samsung Mobile div Original message /divdivFrom: Evo Eftimov evo.efti...@isecc.com /divdivDate:2015/05/10 12:02 (GMT+00:00) /divdivTo: 'Gerard Maas' gerard.m...@gmail.com /divdivCc: 'Sergio Jiménez Barrio' drarse.a...@gmail.com,'spark users' user@spark.apache.org

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov
. * spark.cassandra.output.throughput_mb_per_sec: maximum write throughput allowed per single core in MB/s limit this on long (+8 hour) runs to 70% of your max throughput as seen on a smaller job for stability From: Sergio Jiménez Barrio [mailto:drarse.a...@gmail.com] Sent: Sunday, May 10, 2015 12:59 PM To: Evo Eftimov

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov
I think the message that it has written 2 rows is misleading If you look further down you will see that it could not initialize a connection pool for Casandra (presumably while trying to write the previously mentioned 2 rows) Another confirmation of this hypothesis is the phrase “error

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov
and distribution profile which may skip a potential error in the Connection Pool code From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Sunday, May 10, 2015 11:56 AM To: Evo Eftimov Cc: Sergio Jiménez Barrio; spark users Subject: Re: Spark streaming closes with Cassandra Conector I'm

RE: Map one RDD into two RDD

2015-05-07 Thread Evo Eftimov
: Bill Q [mailto:bill.q@gmail.com] Sent: Thursday, May 7, 2015 4:55 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Map one RDD into two RDD Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far

RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov
What is called Bolt in Storm is essentially a combination of [Transformation/Action and DStream RDD] in Spark – so to achieve a higher parallelism for specific Transformation/Action on specific Dstream RDD simply repartition it to the required number of partitions which directly relates to the

RE: Map one RDD into two RDD

2015-05-06 Thread Evo Eftimov
RDD1 = RDD.filter() RDD2 = RDD.filter() From: Bill Q [mailto:bill.q@gmail.com] Sent: Tuesday, May 5, 2015 10:42 PM To: user@spark.apache.org Subject: Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input

RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov
To: Evo Eftimov Cc: anshu shukla; ayan guha; user@spark.apache.org Subject: Re: Creating topology in spark streaming Hi, I agree with Evo, Spark works at a different abstraction level than Storm, and there is not a direct translation from Storm topologies to Spark Streaming jobs. I think

RE: Receiver Fault Tolerance

2015-05-06 Thread Evo Eftimov
This is about Kafka Receiver IF you are using Spark Streaming Ps: that book is now behind the curve in a quite a few areas since the release of 1.3.1 – read the documentation and forums From: James King [mailto:jakwebin...@gmail.com] Sent: Wednesday, May 6, 2015 1:09 PM To: user

spark filestrea problem

2015-05-02 Thread Evo Eftimov
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception

spark filestream problem

2015-05-02 Thread Evo Eftimov
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception -- View

spark filestream problem

2015-05-02 Thread Evo Eftimov
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception --

RE: spark filestream problem

2015-05-02 Thread Evo Eftimov
is to a) copy the required file in a temp location and then b) move it from there to the dir monitored by spark filestream - this will ensure it is with recent timestamp -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Saturday, May 2, 2015 5:09 PM To: user

RE: spark filestream problem

2015-05-02 Thread Evo Eftimov
is to a) copy the required file in a temp location and then b) move it from there to the dir monitored by spark filestream - this will ensure it is with recent timestamp -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Saturday, May 2, 2015 5:07 PM To: user

RE: Tasks run only on one machine

2015-04-24 Thread Evo Eftimov
# of tasks = # of partitions, hence you can provide the desired number of partitions to the textFile API which should result a) in a better spatial distribution of the RDD b) each partition will be operated upon by a separate task You can provide the number of p -Original Message-

RE: Slower performance when bigger memory?

2015-04-24 Thread Evo Eftimov
You can resort to Serialized storage (still in memory) of your RDDs - this will obviate the need for GC since the RDD elements are stored as serialized objects off the JVM heap (most likely in Tachion which is distributed in memory files system used by Spark internally) Also review the Object

RE: Custom paritioning of DSTream

2015-04-23 Thread Evo Eftimov
it is not Action) applied to an RDD within foreach is distributed across the cluster since it gets applied to an RDD From: davidkl [via Apache Spark User List] [mailto:ml-node+s1001560n22630...@n3.nabble.com] Sent: Thursday, April 23, 2015 10:13 AM To: Evo Eftimov Subject: Re: Custom paritioning

RE: writing to hdfs on master node much faster

2015-04-20 Thread Evo Eftimov
Check whether your partitioning results in balanced partitions ie partitions with similar sizes - one of the reasons for the performance differences observed by you may be that after your explicit repartitioning, the partition on your master node is much smaller than the RDD partitions on the

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov
What is meant by “streams” here: 1. Two different DSTream Receivers producing two different DSTreams consuming from two different kafka topics, each with different message rate 2. One kafka topic (hence only one message rate to consider) but with two different DStream receivers

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov
And what is the message rate of each topic mate – that was the other part of the required clarifications From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com] Sent: Monday, April 20, 2015 3:38 PM To: Evo Eftimov; user@spark.apache.org Subject: Re: Equal number of RDD Blocks Hi, I have two

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov
data more evenly you can partition it explicitly Also contact Data Bricks why the Receivers are not being distributed on different cluster nodes From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com] Sent: Monday, April 20, 2015 3:57 PM To: Evo Eftimov; user@spark.apache.org Subject: Re: Equal

RE: Super slow caching in 1.3?

2015-04-20 Thread Evo Eftimov
a detailed description / spec of both From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, April 16, 2015 7:23 PM To: Evo Eftimov Cc: Christian Perez; user Subject: Re: Super slow caching in 1.3? Here are the types that we specialize, other types will be much slower

Custom paritioning of DSTream

2015-04-20 Thread Evo Eftimov
Is the only way to implement a custom partitioning of DStream via the foreach approach so to gain access to the actual RDDs comprising the DSTReam and hence their paritionBy method DSTReam has only a repartition method accepting only the number of partitions, BUT not the method of partitioning

RE: Can a map function return null

2015-04-19 Thread Evo Eftimov
In fact you can return “NULL” from your initial map and hence not resort to OptionalString at all From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Sunday, April 19, 2015 9:48 PM To: 'Steve Lewis' Cc: 'Olivier Girardot'; 'user@spark.apache.org' Subject: RE: Can a map function return

RE: Can a map function return null

2015-04-19 Thread Evo Eftimov
throw Spark exception THEN as far as I am concerned, chess-mate From: Steve Lewis [mailto:lordjoe2...@gmail.com] Sent: Sunday, April 19, 2015 8:16 PM To: Evo Eftimov Cc: Olivier Girardot; user@spark.apache.org Subject: Re: Can a map function return null So you imagine something like

RE: How to do dispatching in Streaming?

2015-04-17 Thread Evo Eftimov
the bottom lime / the big picture – in some models, friction can be a huge factor in the equations in some other it is just part of the landscape From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Friday, April 17, 2015 10:12 AM To: Evo Eftimov Cc: Tathagata Das; Jianshi Huang; user; Shao

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-17 Thread Evo Eftimov
not be very appropriate in production because the two resource managers will be competing for cluster resources - but you can use this for performance tests From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 6:28 PM To: 'Manish Gupta 8'; 'user@spark.apache.org

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
And yet another way is to demultiplex at one point which will yield separate DStreams for each message type which you can then process in independent DAG pipelines in the following way: MessageType1DStream = MainDStream.filter(message type1) MessageType2DStream = MainDStream.filter(message

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
which can not be done at the same time and has to be processed sequentially is a BAD thing So the key is whether it is about 1 or 2 and if it is about 1, whether it leads to e.g. Higher Throughput and Lower Latency or not Regards, Evo Eftimov From: Gerard Maas [mailto:gerard.m

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
Also you can have each message type in a different topic (needs to be arranged upstream from your Spark Streaming app ie in the publishing systems and the messaging brokers) and then for each topic you can have a dedicated instance of InputReceiverDStream which will be the start of a dedicated

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
How do you intend to fetch the required data - from within Spark or using an app / code / module outside Spark -Original Message- From: mas [mailto:mas.ha...@gmail.com] Sent: Thursday, April 16, 2015 4:08 PM To: user@spark.apache.org Subject: Data partitioning and node tracking in

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Ningjun, to speed up your current design you can do the following: 1.partition the large doc RDD based on the hash function on the key ie the docid 2. persist the large dataset in memory to be available for subsequent queries without reloading and repartitioning for every search query 3.

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
/framework your app code should not be bothered on which physical node exactly, a partition resides Regards Evo Eftimov From: MUHAMMAD AAMIR [mailto:mas.ha...@gmail.com] Sent: Thursday, April 16, 2015 4:20 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Data partitioning and node

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
, April 16, 2015 4:32 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Data partitioning and node tracking in Spark-GraphX Thanks a lot for the reply. Indeed it is useful but to be more precise i have 3D data and want to index it using octree. Thus i aim to build a two level indexing

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
because all worker instances run in the memory of a single machine .. Regards, Evo Eftimov From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Thursday, April 16, 2015 6:03 PM To: user@spark.apache.org Subject: General configurations on CDH5 to achieve maximum Spark Performance Hi

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
Michael what exactly do you mean by flattened version/structure here e.g.: 1. An Object with only primitive data types as attributes 2. An Object with no more than one level of other Objects as attributes 3. An Array/List of primitive types 4. An Array/List of Objects This question is in

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
-on-yarn.html From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Thursday, April 16, 2015 6:21 PM To: Evo Eftimov; user@spark.apache.org Subject: RE: General configurations on CDH5 to achieve maximum Spark Performance Thanks Evo. Yes, my concern is only regarding the infrastructure

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
HDFS adapter and invoke it in forEachRDD and foreach Regards Evo Eftimov From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:33 PM To: user@spark.apache.org Subject: saveAsTextFile I am using Spark Streaming where during each micro-batch I

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Nop Sir, it is possible - check my reply earlier -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, April 16, 2015 6:35 PM To: Vadim Bichutskiy Cc: user@spark.apache.org Subject: Re: saveAsTextFile You can't, since that's how it's designed to work. Batches

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Basically you need to unbundle the elements of the RDD and then store them wherever you want - Use foreacPartition and then foreach -Original Message- From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:39 PM To: Sean Owen Cc:

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
files and directories From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:45 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: saveAsTextFile Thanks Evo for your detailed explanation. On Apr 16, 2015, at 1:38 PM, Evo Eftimov evo.efti

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
with 16GB and 4 cores). I found something called IndexedRDD on the web https://github.com/amplab/spark-indexedrdd Has anybody use it? Ningjun -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 12:18 PM To: 'Sean Owen'; Wang, Ningjun (LNG-NPV

  1   2   >