This is most likely due to the internal implementation of ALS in MLib. Probably
for each parallel unit of execution (partition in Spark terms) the
implementation allocates and uses a RAM buffer where it keeps interim results
during the ALS iterations
If we assume that the size of that
of them will be in a suspended mode
waiting for free core (Thread contexts also occupy additional RAM )
From: Aniruddh Sharma [mailto:asharma...@gmail.com]
Sent: Wednesday, July 8, 2015 12:52 PM
To: Evo Eftimov
Subject: Re: Out of Memory Errors on less number of cores in proportion to
Partitions
Also try to increase the number of partions gradually – not in one big jump
from 20 to 100 but adding e.g. 10 at a time and see whether there is a
correlation with adding more RAM to the executors
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Wednesday, July 8, 2015 1:26 PM
That was a) fuzzy b) insufficient – one can certainly use forach (only) on
DStream RDDs – it works as empirical observation
As another empirical observation:
For each partition results in having one instance of the lambda/closure per
partition when e.g. publishing to output systems
spark.streaming.unpersist = false // in order for SStreaming to not drop the
raw RDD data
spark.cleaner.ttl = some reasonable value in seconds
why is the above suggested provided the persist/vache operation on the
constantly unioniuzed Batch RDD will have to be invoked anyway (after every
...
dstream.foreachRDD{ rdd =
myRDD = myRDD.union(rdd.filter(myfilter)).cashe()
}
From: Gerard Maas [mailto:gerard.m...@gmail.com]
Sent: Tuesday, July 7, 2015 1:55 PM
To: Evo Eftimov
Cc: Anand Nalya; spark users
Subject: Re:
Evo,
I'd let the OP clarify the question. I'm
I had a look at the new R on Spark API / Feature in Spark 1.4.0
For those skilled in the art (of R and distributed computing) it will be
immediately clear that ON is a marketing ploy and what it actually is is
TO ie Spark 1.4.0 offers INTERFACE from R TO DATA stored in Spark in
distributed
BUT that limits the number of cores
per Executor rather than the total cores for the job and hence will probably
not yield the effect you need
From: Wojciech Pituła [mailto:w.pit...@gmail.com]
Sent: Wednesday, June 24, 2015 10:49 AM
To: Evo Eftimov; user@spark.apache.org
Subject: Re: Spark
There is no direct one to one mapping between Executor and Node
Executor is simply the spark framework term for JVM instance with some spark
framework system code running in it
A node is a physical server machine
You can have more than one JVM per node
And vice versa you can
Probably your application has crashed or was terminated without invoking the
stop method of spark context - in such cases it doesn't create the empty
flag file which apparently tells the history server that it can safely show
the log data - simpy go to some of the other dirs of the history server
Spark Streaming 1.3.0 on YARN during Job Execution keeps generating the
following error while the application is running:
ERROR LiveListenerBus: Listener EventLoggingListener threw an exception
java.lang.reflect.InvocationTargetException
etc
etc
Caused by: java.io.IOException: Filesystem closed
What is GraphX:
- It can be viewed as a kind of Distributed, Parallel, Graph Database
- It can be viewed as Graph Data Structure (Data Structures 101 from
your CS course)
- It features some off the shelve algos for Graph Processing and
Navigation (Algos and Data
The only thing which doesn't make much sense in Spark Streaming (and I am
not saying it is done better in Storm) is the iterative and redundant
shipping of the essentially the same tasks (closures/lambdas/functions) to
the cluster nodes AND re-launching them there again and again
This is a
https://spark.apache.org/docs/latest/monitoring.html
also subscribe to various Listeners for various Metrcis Types e.g. Job
Stats/Statuses - this will allow you (in the driver) to decide when to stop
the context gracefully (the listening and stopping can be done from a
completely
Best is by measuring and recording how The Performance of your solution
scales as The Workload scales - recording As In Data Points recording and
then you can do some times series stat analysis and visualizations
For example you can start with a single box with e.g. 8 CPU cores
Use e.g. 1 or
“turn (keep turning) your HDFS file (Batch RDD) into a stream of messages
(outside spark streaming)” – what I meant by that was “turn the Updates to your
HDFS dataset into Messages” and send them as such to spark streaming
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Monday, June
of the app
And at the same time you join the DStream RDDs of your actual Streaming Data
with the above continuously updated DStream RDD representing your HDFS file
From: Ilove Data [mailto:data4...@gmail.com]
Sent: Monday, June 15, 2015 5:19 AM
To: Tathagata Das
Cc: Evo Eftimov; Akhil Das
It depends on how big the Batch RDD requiring reloading is
Reloading it for EVERY single DStream RDD would slow down the stream processing
inline with the total time required to reload the Batch RDD …..
But if the Batch RDD is not that big then that might not be an issues
especially in
Yes i think it is ONE worker ONE executor as executor is nothing but jvm
instance spawned by the worker
To run more executors ie jvm instances on the same physical cluster node you
need to run more than one worker on that node and then allocate only part of
the sys resourced to that
Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark
Batch Jobs (besides anyone can put something like that in 5 min), while I am
under the impression that Dmytiy is working on Spark Streaming app
Besides the Job Server is essentially for sharing the Spark Context
once in your Spark Streaming App
and then keep joining and then e.g. filtering every incoming DStream RDD with
the (big static) Batch RDD
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Friday, June 5, 2015 3:27 PM
To: 'Dmitry Goldenberg'
Cc: 'Yiannis Gkoufas'; 'Olivier Girardot'; 'user
It is called Indexed RDD https://github.com/amplab/spark-indexedrdd
From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
Sent: Friday, June 5, 2015 3:15 PM
To: Evo Eftimov
Cc: Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like
Foreachpartition callback is provided with Iterator by the Spark Frameowrk –
while iterator.hasNext() ……
Also check whether this is not some sort of Python Spark API bug – Python seems
to be the foster child here – Scala and Java are the darlings
From: John Omernik
It may be that your system runs out of resources (ie 174 is the ceiling) due to
the following
1. RDD Partition = (Spark) Task
2. RDD Partition != (Spark) Executor
3. (Spark) Task != (Spark) Executor
4. (Spark) Task = JVM Thread
5. (Spark) Executor = JVM
€ρ@Ҝ (๏̯͡๏)
Cc: Evo Eftimov; user
Subject: Re: How to increase the number of tasks
just multiply 2-4 with the cpu core number of the node .
2015-06-05 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:
I did not change spark.default.parallelism,
What is recommended value for it.
On Fri
share data
between Jobs, while an RDD is ALWAYS visible within Jobs using the same Spark
Context
From: Charles Earl [mailto:charles.ce...@gmail.com]
Sent: Friday, June 5, 2015 12:10 PM
To: Evo Eftimov
Cc: Dmitry Goldenberg; Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re
Dmitry was concerned about the “serialization cost” NOT the “memory footprint –
hence option a) is still viable since a Broadcast is performed only ONCE for
the lifetime of Driver instance
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, June 3, 2015 2:44 PM
To: Evo Eftimov
Cc
Hmmm a spark streaming app code doesn't execute in the linear fashion
assumed in your previous code snippet - to achieve your objectives you
should do something like the following
in terms of your second objective - saving the initialization and
serialization of the params you can:
a) broadcast
more
From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
Sent: Wednesday, June 3, 2015 4:46 PM
To: Evo Eftimov
Cc: Cody Koeninger; Andrew Or; Gerard Maas; spark users
Subject: Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of
growth in Kafka or Spark's metrics?
Evo
I don’t think the number of CPU cores controls the “number of parallel tasks”.
The number of Tasks corresponds first and foremost to the number of (Dstream)
RDD Partitions
The Spark documentation doesn’t mention what is meant by “Task” in terms of
Standard Multithreading Terminology ie a
@DG; The key metrics should be
- Scheduling delay – its ideal state is to remain constant over time
and ideally be less than the time of the microbatch window
- The average job processing time should remain less than the
micro-batch window
- Number of Lost Jobs
shut down your job
gracefuly. Besides msnaging the offsets explicitly is not a big deal if
necessary
Sent from Samsung Mobile
div Original message /divdivFrom: Dmitry Goldenberg
dgoldenberg...@gmail.com /divdivDate:2015/05/28 13:16 (GMT+00:00)
/divdivTo: Evo Eftimov evo.efti
there is no free RAM ) and
taking a performance hit from that, BUT only until there is no free RAM
From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
Sent: Thursday, May 28, 2015 2:34 PM
To: Evo Eftimov
Cc: Gerard Maas; spark users
Subject: Re: FW: Re: Autoscaling Spark cluster based on topic
An Executor is a JVM instance spawned and running on a Cluster Node (Server
machine). Task is essentially a JVM Thread – you can have as many Threads as
you want per JVM. You will also hear about “Executor Slots” – these are
essentially the CPU Cores available on the machine and granted for use
9:30 AM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: How does spark manage the memory of executor with multiple tasks
Yes, I know that one task represent a JVM thread. This is what I confused.
Usually users want to specify the memory on task level, so how can I do it if
task
Original message /divdivFrom: Arush Kharbanda
ar...@sigmoidanalytics.com /divdivDate:2015/05/26 10:55 (GMT+00:00)
/divdivTo: canan chen ccn...@gmail.com /divdivCc: Evo Eftimov
evo.efti...@isecc.com,user@spark.apache.org /divdivSubject: Re: How does
spark manage the memory of executor
If the message consumption rate is higher than the time required to process ALL
data for a micro batch (ie the next RDD produced for your stream) the
following happens – lets say that e.g. your micro batch time is 3 sec:
1. Based on your message streaming and consumption rate, you
performance in the
name of the reliability/integrity of your system ie not loosing messages)
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Friday, May 22, 2015 9:39 PM
To: 'Tathagata Das'; 'Gautam Bajaj'
Cc: 'user'
Subject: RE: Storing spark processed output to Database asynchronously
A receiver occupies a cpu core, an executor is simply a jvm instance and as
such it can be granted any number of cores and ram
So check how many cores you have per executor
Sent from Samsung Mobile
div Original message /divdivFrom: Mike Trienis
mike.trie...@orcsol.com
OR you can run Drools in a Central Server Mode ie as a common/shared service,
but that would slowdown your Spark Streaming job due to the remote network call
which will have to be generated for every single message
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Friday, May 22, 2015
From: Antonio Giambanco [mailto:antogia...@gmail.com]
Sent: Friday, May 22, 2015 11:07 AM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: Spark Streaming and Drools
Thanks a lot Evo,
do you know where I can find some examples?
Have a great one
A G
2015-05-22 12:00 GMT+02:00
You can deploy and invoke Drools as a Singleton on every Spark Worker Node /
Executor / Worker JVM
You can invoke it from e.g. map, filter etc and use the result from the Rule to
make decision how to transform/filter an event/message
From: Antonio Giambanco [mailto:antogia...@gmail.com]
The only “tricky” bit would be when you want to manage/update the Rule Base in
your Drools Engines already running as Singletons in Executor JVMs on Worker
Nodes. The invocation of Drools from Spark Streaming to evaluate a Rule already
loaded in Drools is not a problem.
From: Evo Eftimov
Check whether the name can be resolved in the /etc/hosts file (or DNS) of the
worker
(the same btw applies for the Node where you run the driver app – all other
nodes must be able to resolve its name)
From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Wednesday, May 20, 2015 10:07
Is that a Spark or Spark Streaming application
Re the map transformation which is required you can also try flatMap
Finally an Executor is essentially a JVM spawn by a Spark Worker Node or YARN –
giving 60GB RAM to a single JVM will certainly result in “off the charts” GC. I
would
, as Evo says
Spark Streaming DOES crash in “unceremonious way” when the free RAM available
for In Memory Cashed RDDs gets exhausted
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Monday, May 18, 2015 2:03 PM
To: Evo Eftimov
Cc: Dmitry Goldenberg; user@spark.apache.org
Subject: Re
communication and facts
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Monday, May 18, 2015 2:28 PM
To: Evo Eftimov
Cc: Dmitry Goldenberg; user@spark.apache.org
Subject: Re: Spark Streaming and reducing latency
we = Sigmoid
back-pressuring mechanism = Stoping the receiver from
Ps: ultimately though remember that none of this stuff is part of spark
streming as of yet
Sent from Samsung Mobile
div Original message /divdivFrom: Akhil Das
ak...@sigmoidanalytics.com /divdivDate:2015/05/18 16:56 (GMT+00:00)
/divdivTo: Evo Eftimov evo.efti...@isecc.com
You can use
spark.streaming.receiver.maxRate
not set
Maximum rate (number of records per second) at which each receiver will receive
data. Effectively, each stream will consume at most this number of records per
second. Setting this configuration to 0 or a negative number will put no
-4-1410542878200 not found
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Monday, May 18, 2015 12:13 PM
To: 'Dmitry Goldenberg'; 'Akhil Das'
Cc: 'user@spark.apache.org'
Subject: RE: Spark Streaming and reducing latency
You can use
spark.streaming.receiver.maxRate
not set
This is the nature of Spark Streaming as a System Architecture:
1. It is a batch processing system architecture (Spark Batch) optimized
for Streaming Data
2. In terms of sources of Latency in such System Architecture, bear in
mind that besides “batching”, there is also the
You can make ANY standard receiver sleep by implementing a custom Message
Deserializer class with sleep method inside it.
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Sunday, May 17, 2015 4:29 PM
To: Haopu Wang
Cc: user
Subject: Re: [SparkStreaming] Is it possible to delay the
to me
From: Tathagata Das [mailto:t...@databricks.com]
Sent: Friday, May 15, 2015 6:45 PM
To: Evo Eftimov
Cc: user
Subject: Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond
How are you configuring the fair scheduler pools?
On Fri, May 15, 2015 at 8:33 AM, Evo Eftimov
Ok thanks a lot for clarifying that – btw was your application a Spark
Streaming App – I am also looking for confirmation that FAIR scheduling is
supported for Spark Streaming Apps
From: Richard Marscher [mailto:rmarsc...@localytics.com]
Sent: Friday, May 15, 2015 7:20 PM
To: Evo Eftimov
Where is the “Tuple” supposed to be in String, String - you can refer to a
“Tuple” if it was e.g. String, Tuple2String, String
From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of
Holden Karau
Sent: Thursday, May 14, 2015 5:56 PM
To: Yasemin Kaya
Cc:
to be inherent to the “commercial”
vendors, but I can confirm as fact it is also in effect to the “open source
movement” (because human nature remains the same)
From: David Morales [mailto:dmora...@stratio.com]
Sent: Thursday, May 14, 2015 4:30 PM
To: Paolo Platter
Cc: Evo Eftimov; Matei Zaharia
That has been a really rapid “evaluation” of the “work” and its “direction”
From: David Morales [mailto:dmora...@stratio.com]
Sent: Thursday, May 14, 2015 4:12 PM
To: Matei Zaharia
Cc: user@spark.apache.org
Subject: Re: SPARKTA: a real-time aggregation engine based on Spark Streaming
I can confirm it does work in Java
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
Sent: Tuesday, May 12, 2015 5:53 PM
To: Evo Eftimov
Cc: Saisai Shao; user@spark.apache.org
Subject: Re: DStream Union vs. StreamingContext Union
Thanks Evo. I tried chaining Dstream unions like
from Samsung Mobile
div Original message /divdivFrom: Evo Eftimov
evo.efti...@isecc.com /divdivDate:2015/05/10 12:02 (GMT+00:00)
/divdivTo: 'Gerard Maas' gerard.m...@gmail.com /divdivCc: 'Sergio
Jiménez Barrio' drarse.a...@gmail.com,'spark users' user@spark.apache.org
.
* spark.cassandra.output.throughput_mb_per_sec: maximum write throughput
allowed per single core in MB/s limit this on long (+8 hour) runs to 70% of
your max throughput as seen on a smaller job for stability
From: Sergio Jiménez Barrio [mailto:drarse.a...@gmail.com]
Sent: Sunday, May 10, 2015 12:59 PM
To: Evo Eftimov
I think the message that it has written 2 rows is misleading
If you look further down you will see that it could not initialize a connection
pool for Casandra (presumably while trying to write the previously mentioned 2
rows)
Another confirmation of this hypothesis is the phrase “error
and distribution profile which may
skip a potential error in the Connection Pool code
From: Gerard Maas [mailto:gerard.m...@gmail.com]
Sent: Sunday, May 10, 2015 11:56 AM
To: Evo Eftimov
Cc: Sergio Jiménez Barrio; spark users
Subject: Re: Spark streaming closes with Cassandra Conector
I'm
: Bill Q [mailto:bill.q@gmail.com]
Sent: Thursday, May 7, 2015 4:55 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: Map one RDD into two RDD
Thanks for the replies. We decided to use concurrency in Scala to do the two
mappings using the same source RDD in parallel. So far
What is called Bolt in Storm is essentially a combination of
[Transformation/Action and DStream RDD] in Spark – so to achieve a higher
parallelism for specific Transformation/Action on specific Dstream RDD simply
repartition it to the required number of partitions which directly relates to
the
RDD1 = RDD.filter()
RDD2 = RDD.filter()
From: Bill Q [mailto:bill.q@gmail.com]
Sent: Tuesday, May 5, 2015 10:42 PM
To: user@spark.apache.org
Subject: Map one RDD into two RDD
Hi all,
I have a large RDD that I map a function to it. Based on the nature of each
record in the input
To: Evo Eftimov
Cc: anshu shukla; ayan guha; user@spark.apache.org
Subject: Re: Creating topology in spark streaming
Hi,
I agree with Evo, Spark works at a different abstraction level than Storm, and
there is not a direct translation from Storm topologies to Spark Streaming
jobs. I think
This is about Kafka Receiver IF you are using Spark Streaming
Ps: that book is now behind the curve in a quite a few areas since the release
of 1.3.1 – read the documentation and forums
From: James King [mailto:jakwebin...@gmail.com]
Sent: Wednesday, May 6, 2015 1:09 PM
To: user
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception
--
View
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception
--
is to a) copy the required file in a temp location and
then b) move it from there to the dir monitored by spark filestream - this
will ensure it is with recent timestamp
-Original Message-
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Saturday, May 2, 2015 5:09 PM
To: user
is to a) copy the required file in a temp location and
then b) move it from there to the dir monitored by spark filestream - this
will ensure it is with recent timestamp
-Original Message-
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Saturday, May 2, 2015 5:07 PM
To: user
# of tasks = # of partitions, hence you can provide the desired number of
partitions to the textFile API which should result a) in a better spatial
distribution of the RDD b) each partition will be operated upon by a separate
task
You can provide the number of p
-Original Message-
You can resort to Serialized storage (still in memory) of your RDDs - this
will obviate the need for GC since the RDD elements are stored as serialized
objects off the JVM heap (most likely in Tachion which is distributed in
memory files system used by Spark internally)
Also review the Object
it is not Action) applied to an
RDD within foreach is distributed across the cluster since it gets applied
to an RDD
From: davidkl [via Apache Spark User List]
[mailto:ml-node+s1001560n22630...@n3.nabble.com]
Sent: Thursday, April 23, 2015 10:13 AM
To: Evo Eftimov
Subject: Re: Custom paritioning
Check whether your partitioning results in balanced partitions ie partitions
with similar sizes - one of the reasons for the performance differences
observed by you may be that after your explicit repartitioning, the partition
on your master node is much smaller than the RDD partitions on the
What is meant by “streams” here:
1. Two different DSTream Receivers producing two different DSTreams
consuming from two different kafka topics, each with different message rate
2. One kafka topic (hence only one message rate to consider) but with two
different DStream receivers
And what is the message rate of each topic mate – that was the other part of
the required clarifications
From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com]
Sent: Monday, April 20, 2015 3:38 PM
To: Evo Eftimov; user@spark.apache.org
Subject: Re: Equal number of RDD Blocks
Hi,
I have two
data more evenly you can partition it explicitly
Also contact Data Bricks why the Receivers are not being distributed on
different cluster nodes
From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com]
Sent: Monday, April 20, 2015 3:57 PM
To: Evo Eftimov; user@spark.apache.org
Subject: Re: Equal
a
detailed description / spec of both
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Thursday, April 16, 2015 7:23 PM
To: Evo Eftimov
Cc: Christian Perez; user
Subject: Re: Super slow caching in 1.3?
Here are the types that we specialize, other types will be much slower
Is the only way to implement a custom partitioning of DStream via the foreach
approach so to gain access to the actual RDDs comprising the DSTReam and
hence their paritionBy method
DSTReam has only a repartition method accepting only the number of
partitions, BUT not the method of partitioning
In fact you can return “NULL” from your initial map and hence not resort to
OptionalString at all
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Sunday, April 19, 2015 9:48 PM
To: 'Steve Lewis'
Cc: 'Olivier Girardot'; 'user@spark.apache.org'
Subject: RE: Can a map function return
throw Spark exception THEN
as far as I am concerned, chess-mate
From: Steve Lewis [mailto:lordjoe2...@gmail.com]
Sent: Sunday, April 19, 2015 8:16 PM
To: Evo Eftimov
Cc: Olivier Girardot; user@spark.apache.org
Subject: Re: Can a map function return null
So you imagine something like
the bottom lime / the big picture – in some models, friction
can be a huge factor in the equations in some other it is just part of the
landscape
From: Gerard Maas [mailto:gerard.m...@gmail.com]
Sent: Friday, April 17, 2015 10:12 AM
To: Evo Eftimov
Cc: Tathagata Das; Jianshi Huang; user; Shao
not be very appropriate in production because the two
resource managers will be competing for cluster resources - but you can use
this for performance tests
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Thursday, April 16, 2015 6:28 PM
To: 'Manish Gupta 8'; 'user@spark.apache.org
And yet another way is to demultiplex at one point which will yield separate
DStreams for each message type which you can then process in independent DAG
pipelines in the following way:
MessageType1DStream = MainDStream.filter(message type1)
MessageType2DStream = MainDStream.filter(message
which can not be done at the same time and has to be
processed sequentially is a BAD thing
So the key is whether it is about 1 or 2 and if it is about 1, whether it leads
to e.g. Higher Throughput and Lower Latency or not
Regards,
Evo Eftimov
From: Gerard Maas [mailto:gerard.m
Also you can have each message type in a different topic (needs to be arranged
upstream from your Spark Streaming app ie in the publishing systems and the
messaging brokers) and then for each topic you can have a dedicated instance of
InputReceiverDStream which will be the start of a dedicated
How do you intend to fetch the required data - from within Spark or using
an app / code / module outside Spark
-Original Message-
From: mas [mailto:mas.ha...@gmail.com]
Sent: Thursday, April 16, 2015 4:08 PM
To: user@spark.apache.org
Subject: Data partitioning and node tracking in
Ningjun, to speed up your current design you can do the following:
1.partition the large doc RDD based on the hash function on the key ie the docid
2. persist the large dataset in memory to be available for subsequent queries
without reloading and repartitioning for every search query
3.
/framework your app
code should not be bothered on which physical node exactly, a partition resides
Regards
Evo Eftimov
From: MUHAMMAD AAMIR [mailto:mas.ha...@gmail.com]
Sent: Thursday, April 16, 2015 4:20 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: Data partitioning and node
, April 16, 2015 4:32 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: Data partitioning and node tracking in Spark-GraphX
Thanks a lot for the reply. Indeed it is useful but to be more precise i have
3D data and want to index it using octree. Thus i aim to build a two level
indexing
because all worker instances run in the memory of a single
machine ..
Regards,
Evo Eftimov
From: Manish Gupta 8 [mailto:mgupt...@sapient.com]
Sent: Thursday, April 16, 2015 6:03 PM
To: user@spark.apache.org
Subject: General configurations on CDH5 to achieve maximum Spark Performance
Hi
Michael what exactly do you mean by flattened version/structure here e.g.:
1. An Object with only primitive data types as attributes
2. An Object with no more than one level of other Objects as attributes
3. An Array/List of primitive types
4. An Array/List of Objects
This question is in
-on-yarn.html
From: Manish Gupta 8 [mailto:mgupt...@sapient.com]
Sent: Thursday, April 16, 2015 6:21 PM
To: Evo Eftimov; user@spark.apache.org
Subject: RE: General configurations on CDH5 to achieve maximum Spark
Performance
Thanks Evo. Yes, my concern is only regarding the infrastructure
HDFS adapter and invoke it in forEachRDD and foreach
Regards
Evo Eftimov
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
Sent: Thursday, April 16, 2015 6:33 PM
To: user@spark.apache.org
Subject: saveAsTextFile
I am using Spark Streaming where during each micro-batch I
Nop Sir, it is possible - check my reply earlier
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Thursday, April 16, 2015 6:35 PM
To: Vadim Bichutskiy
Cc: user@spark.apache.org
Subject: Re: saveAsTextFile
You can't, since that's how it's designed to work. Batches
Basically you need to unbundle the elements of the RDD and then store them
wherever you want - Use foreacPartition and then foreach
-Original Message-
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
Sent: Thursday, April 16, 2015 6:39 PM
To: Sean Owen
Cc:
files and
directories
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
Sent: Thursday, April 16, 2015 6:45 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: saveAsTextFile
Thanks Evo for your detailed explanation.
On Apr 16, 2015, at 1:38 PM, Evo Eftimov evo.efti
with 16GB and 4 cores).
I found something called IndexedRDD on the web
https://github.com/amplab/spark-indexedrdd
Has anybody use it?
Ningjun
-Original Message-
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Thursday, April 16, 2015 12:18 PM
To: 'Sean Owen'; Wang, Ningjun (LNG-NPV
1 - 100 of 112 matches
Mail list logo