Unsubscribe

2021-11-17 Thread Edwin Barahona
Unsubscribe


Spark Streaming - Increasing number of executors slows down processing rate

2017-06-19 Thread Mal Edwin
Hi All,
I am struggling with an odd issue and would like your help in addressing it.

Environment
AWS Cluster (40 Spark Nodes & 4 node Kafka cluster)
Spark Kafka Streaming submitted in Yarn cluster mode
Kafka - Single topic, 400 partitions
Spark 2.1 on Cloudera
Kafka 10.0 on Cloudera

We have zero messages in Kafka and starting this spark job with 100 Executors 
each with 14GB of RAM and single executor core.
The time to process 0 records(end of each batch) is 5seconds

When we increase the executors to 400 and everything else remains the same 
except we reduce memory to 11GB, we see the time to process 0 records(end of 
each batch) increases 10times to  50Second and some cases it goes to 103 
seconds.

Spark Streaming configs that we are setting are
Batchwindow = 60 seconds
Backpressure.enabled = true
spark.memory.fraction=0.3 (we store more data in our own data structures)
spark.streaming.kafka.consumer.poll.ms=1

Have tried increasing driver memory to 4GB and also increased driver.cores to 4.

If anybody has faced similar issues please provide some pointers to how to 
address this issue.

Thanks a lot for your time.

Regards,
Edwin



Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-18 Thread Mal Edwin

Hi,
You can enable backpressure to handle this.

spark.streaming.backpressure.enabled
spark.streaming.receiver.maxRate

Thanks,
Edwin

On Mar 18, 2017, 12:53 AM -0400, sagarcasual . <sagarcas...@gmail.com>, wrote:
> Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct 
> approach. The streaming part works fine but when we initially start the job, 
> we have to deal with really huge Kafka message backlog, millions of messages, 
> and that first batch runs for over 40 hours,  and after 12 hours or so it 
> becomes very very slow, it keeps crunching messages, but at a very low speed. 
> Any idea how to overcome this issue? Once the job is all caught up, 
> subsequent batches are quick and fast since the load is really tiny to 
> process. So any idea how to avoid this problem?




RE: RE: Fast write datastore...

2017-03-16 Thread Mal Edwin
Hi All,
I believe here what we are looking for is a serving layer where user queries 
can be executed on a subset of processed data.
In this scenario, we are using Impala for this as it provides a layered 
caching, in our use case it caches some set in memory and then some in HDFS and 
the full set is on S3.

Our processing layer is SparkStreaming + HBase  —> extracts to Parquet on S3 —> 
Impala is serving layer serving user requests. Impala also has a SQL interface. 
Drawback is Impala is not managed via Yarn and has its own resource manager and 
you would have to figure out a way to man Yarn and impala co-exist.

Thanks,
Edwin

On Mar 16, 2017, 5:44 AM -0400, yohann jardin <yohannjar...@hotmail.com>, wrote:
> Hello everyone,
>
> I'm also really interested in the answers as I will be facing the same issue 
> soon.
> Muthu, if you evaluate again Apache Ignite, can you share your results? I 
> also noticed Alluxio to store spark results in memory that you might want to 
> investigate.
>
> In my case I want to use them to have a real time dashboard (or like waiting 
> very few seconds to refine a dashboard), and that use case seems similar to 
> your filter/aggregate previously computed spark results.
>
> Regards,
> Yohann
>
> De : Rick Moritz <rah...@gmail.com>
> Envoyé : jeudi 16 mars 2017 10:37
> À : user
> Objet : Re: RE: Fast write datastore...
>
> If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet 
> might also be an option. Of course, management-wise it has much more overhead 
> than using ES, since you need to manually define partitions and buckets, 
> which is suboptimal. On the other hand, for querying, you can probably get 
> some decent performance by hooking up Impala or Presto or LLAP-Hive, if Spark 
> were too slow/cumbersome.
> Depending on your particular access patterns, this may not be very practical, 
> but as a general approach it might be one way to get intermediate results 
> quicker, and with less of a storage-zoo than some alternatives.
>
> > On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal <tr.s...@gmail.com> wrote:
> > > I do think Kafka is an overkill in this case. There are no streaming use- 
> > > cases that needs a queue to do pub-sub.
> > >
> > > > On 16-Mar-2017 11:47 AM, "vvshvv" <vvs...@gmail.com> wrote:
> > > > > Hi,
> > > > >
> > > > > >> A slightly over-kill solution may be Spark to Kafka to 
> > > > > >> ElasticSearch?
> > > > >
> > > > > I do not think so, in this case you will be able to process Parquet 
> > > > > files as usual, but Kafka will allow your Elasticsearch cluster to be 
> > > > > stable and survive regarding the number of rows.
> > > > >
> > > > > Regards,
> > > > > Uladzimir
> > > > >
> > > > >
> > > > >
> > > > > On jasbir.s...@accenture.com, Mar 16, 2017 7:52 AM wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Will MongoDB not fit this solution?
> > > > > >
> > > > > >
> > > > > >
> > > > > > From: Vova Shelgunov [mailto:vvs...@gmail.com]
> > > > > > Sent: Wednesday, March 15, 2017 11:51 PM
> > > > > > To: Muthu Jayakumar <bablo...@gmail.com>
> > > > > > Cc: vincent gromakowski <vincent.gromakow...@gmail.com>; Richard 
> > > > > > Siebeling <rsiebel...@gmail.com>; user <user@spark.apache.org>; 
> > > > > > Shiva Ramagopal <tr.s...@gmail.com>
> > > > > > Subject: Re: Fast write datastore...
> > > > > >
> > > > > > Hi Muthu,.
> > > > > >
> > > > > > I did not catch from your message, what performance do you expect 
> > > > > > from subsequent queries?
> > > > > >
> > > > > > Regards,
> > > > > > Uladzimir
> > > > > >
> > > > > > On Mar 15, 2017 9:03 PM, "Muthu Jayakumar" <bablo...@gmail.com> 
> > > > > > wrote:
> > > > > > > Hello Uladzimir / Shiva,
> > > > > > >
> > > > > > > From ElasticSearch documentation (i have to see the logical plan 
> > > > > > > of a query to confirm), the richness of filters (like regex,..) 
> > > > > > > is pretty good while comparing to Cassandra. As for aggregates, i 
> > > > > > > thi

Apache Spark broadcast error: Error sending message as driverActor is null [message = UpdateBlockInfo(BlockManagerId

2015-01-22 Thread Edwin
I'm using Apache Spark 1.1.0 and I'm currently having issue with broadcast
method. So when I call broadcast function on a small dataset to a 5 nodes
cluster, I experiencing the Error sending message as driverActor is null
after broadcast the variables several times (apps running under jboss).

Any help would be appreciate.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-broadcast-error-Error-sending-message-as-driverActor-is-null-message-UpdateBlockInfo-Bld-tp21320.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Proper way to check SparkContext's status within code

2014-12-11 Thread Edwin
Hi, 
Is there a way to check the status of the SparkContext regarding whether
it's alive or not through the code, not through UI or anything else?

Thanks 
Edwin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Proper-way-to-check-SparkContext-s-status-within-code-tp20638.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



cache function is not working on RDD from parallelize

2014-11-05 Thread Edwin
Hi,
On a 5 node cluster, say I have data on the driver application node,
and then I call parallelize on the data, I get a rdd back. 
However, when I call cache on the rdd the rdd won't be cached (I checked
that through timing on count the realized-cached rdd, take as long as before
it was realized). So does anyone have any idea on this?
Thanks
Edwin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/cache-function-is-not-working-on-RDD-from-parallelize-tp18219.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



apache spark union function cause executors disassociate (Lost executor 1 on 172.32.1.12: remote Akka client disassociated)

2014-09-30 Thread Edwin
I have a 3 nodes ec2, each assigned 18G for the spark-executor-mem, So after
I run my spark batch job, I got two rdd from different forks, but with the
exact same format. And when i perform union operations, I got executors
disassociate error and the whole spark job fail and quit. Memory shouldn't
be a problem (can tell from the UI). what worth mentioning is that one rdd
is significant bigger than the other one (much bigger), does anyone have any
idea why?
Thanks
Edwin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/apache-spark-union-function-cause-executors-disassociate-Lost-executor-1-on-172-32-1-12-remote-Akka--tp15442.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: apache spark union function cause executors disassociate (Lost executor 1 on 172.32.1.12: remote Akka client disassociated)

2014-09-30 Thread Edwin
does union function cause any data shuffling?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/apache-spark-union-function-cause-executors-disassociate-Lost-executor-1-on-172-32-1-12-remote-Akka--tp15442p15444.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: apache spark union function cause executors disassociate (Lost executor 1 on 172.32.1.12: remote Akka client disassociated)

2014-09-30 Thread Edwin


19:02:45,963 INFO  [org.apache.spark.MapOutputTrackerMaster]
(spark-akka.actor.default-dispatcher-14) Size of output statuses for shuffle
1 is 216 bytes
19:02:45,964 INFO  [org.apache.spark.scheduler.DAGScheduler]
(spark-akka.actor.default-dispatcher-14) Got job 5 (getCallSite at null:-1)
with 24 output partitions (allowLocal=false)
19:02:45,964 INFO  [org.apache.spark.scheduler.DAGScheduler]
(spark-akka.actor.default-dispatcher-14) Final stage: Stage 12 (getCallSite
at null:-1)
19:02:45,964 INFO  [org.apache.spark.scheduler.DAGScheduler]
(spark-akka.actor.default-dispatcher-14) Parents of final stage: List(Stage
15, Stage 13)
19:02:45,970 INFO  [org.apache.spark.scheduler.DAGScheduler]
(spark-akka.actor.default-dispatcher-14) Missing parents: List()
19:02:45,971 INFO  [org.apache.spark.scheduler.DAGScheduler]
(spark-akka.actor.default-dispatcher-14) Submitting Stage 12 (UnionRDD[31]
at getCallSite at null:-1), which has no missing parents
19:02:47,085 INFO  [org.apache.spark.scheduler.DAGScheduler]
(spark-akka.actor.default-dispatcher-14) Submitting 24 missing tasks from
Stage 12 (UnionRDD[31] at getCallSite at null:-1)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/apache-spark-union-function-cause-executors-disassociate-Lost-executor-1-on-172-32-1-12-remote-Akka--tp15442p15445.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Transient association error on a 3 nodes cluster

2014-09-23 Thread Edwin
I'm running my application on a three nodes cluster(8 cores each, 12 G memory
each), and I receive the follow actor error, does anyone have any idea?

14:31:18,061 ERROR [akka.remote.EndpointWriter]
(spark-akka.actor.default-dispatcher-17) Transient association error
(association remains live): akka.remote.OversizedPayloadException:
Discarding oversized payload sent to
Actor[akka.tcp://sparkExecutor@172.32.1.155:43121/user/Executor#1335254673]:
max allowed size 10485760 bytes, actual size of encoded class
org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$LaunchTask
was 19553539 bytes.


Thanks
Edwin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Transient-association-error-on-a-3-nodes-cluster-tp14914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org