Unsubscribe
Unsubscribe
Spark Streaming - Increasing number of executors slows down processing rate
Hi All, I am struggling with an odd issue and would like your help in addressing it. Environment AWS Cluster (40 Spark Nodes & 4 node Kafka cluster) Spark Kafka Streaming submitted in Yarn cluster mode Kafka - Single topic, 400 partitions Spark 2.1 on Cloudera Kafka 10.0 on Cloudera We have zero messages in Kafka and starting this spark job with 100 Executors each with 14GB of RAM and single executor core. The time to process 0 records(end of each batch) is 5seconds When we increase the executors to 400 and everything else remains the same except we reduce memory to 11GB, we see the time to process 0 records(end of each batch) increases 10times to 50Second and some cases it goes to 103 seconds. Spark Streaming configs that we are setting are Batchwindow = 60 seconds Backpressure.enabled = true spark.memory.fraction=0.3 (we store more data in our own data structures) spark.streaming.kafka.consumer.poll.ms=1 Have tried increasing driver memory to 4GB and also increased driver.cores to 4. If anybody has faced similar issues please provide some pointers to how to address this issue. Thanks a lot for your time. Regards, Edwin
Re: Spark Streaming from Kafka, deal with initial heavy load.
Hi, You can enable backpressure to handle this. spark.streaming.backpressure.enabled spark.streaming.receiver.maxRate Thanks, Edwin On Mar 18, 2017, 12:53 AM -0400, sagarcasual . <sagarcas...@gmail.com>, wrote: > Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct > approach. The streaming part works fine but when we initially start the job, > we have to deal with really huge Kafka message backlog, millions of messages, > and that first batch runs for over 40 hours, and after 12 hours or so it > becomes very very slow, it keeps crunching messages, but at a very low speed. > Any idea how to overcome this issue? Once the job is all caught up, > subsequent batches are quick and fast since the load is really tiny to > process. So any idea how to avoid this problem?
RE: RE: Fast write datastore...
Hi All, I believe here what we are looking for is a serving layer where user queries can be executed on a subset of processed data. In this scenario, we are using Impala for this as it provides a layered caching, in our use case it caches some set in memory and then some in HDFS and the full set is on S3. Our processing layer is SparkStreaming + HBase —> extracts to Parquet on S3 —> Impala is serving layer serving user requests. Impala also has a SQL interface. Drawback is Impala is not managed via Yarn and has its own resource manager and you would have to figure out a way to man Yarn and impala co-exist. Thanks, Edwin On Mar 16, 2017, 5:44 AM -0400, yohann jardin <yohannjar...@hotmail.com>, wrote: > Hello everyone, > > I'm also really interested in the answers as I will be facing the same issue > soon. > Muthu, if you evaluate again Apache Ignite, can you share your results? I > also noticed Alluxio to store spark results in memory that you might want to > investigate. > > In my case I want to use them to have a real time dashboard (or like waiting > very few seconds to refine a dashboard), and that use case seems similar to > your filter/aggregate previously computed spark results. > > Regards, > Yohann > > De : Rick Moritz <rah...@gmail.com> > Envoyé : jeudi 16 mars 2017 10:37 > À : user > Objet : Re: RE: Fast write datastore... > > If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet > might also be an option. Of course, management-wise it has much more overhead > than using ES, since you need to manually define partitions and buckets, > which is suboptimal. On the other hand, for querying, you can probably get > some decent performance by hooking up Impala or Presto or LLAP-Hive, if Spark > were too slow/cumbersome. > Depending on your particular access patterns, this may not be very practical, > but as a general approach it might be one way to get intermediate results > quicker, and with less of a storage-zoo than some alternatives. > > > On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal <tr.s...@gmail.com> wrote: > > > I do think Kafka is an overkill in this case. There are no streaming use- > > > cases that needs a queue to do pub-sub. > > > > > > > On 16-Mar-2017 11:47 AM, "vvshvv" <vvs...@gmail.com> wrote: > > > > > Hi, > > > > > > > > > > >> A slightly over-kill solution may be Spark to Kafka to > > > > > >> ElasticSearch? > > > > > > > > > > I do not think so, in this case you will be able to process Parquet > > > > > files as usual, but Kafka will allow your Elasticsearch cluster to be > > > > > stable and survive regarding the number of rows. > > > > > > > > > > Regards, > > > > > Uladzimir > > > > > > > > > > > > > > > > > > > > On jasbir.s...@accenture.com, Mar 16, 2017 7:52 AM wrote: > > > > > > Hi, > > > > > > > > > > > > Will MongoDB not fit this solution? > > > > > > > > > > > > > > > > > > > > > > > > From: Vova Shelgunov [mailto:vvs...@gmail.com] > > > > > > Sent: Wednesday, March 15, 2017 11:51 PM > > > > > > To: Muthu Jayakumar <bablo...@gmail.com> > > > > > > Cc: vincent gromakowski <vincent.gromakow...@gmail.com>; Richard > > > > > > Siebeling <rsiebel...@gmail.com>; user <user@spark.apache.org>; > > > > > > Shiva Ramagopal <tr.s...@gmail.com> > > > > > > Subject: Re: Fast write datastore... > > > > > > > > > > > > Hi Muthu,. > > > > > > > > > > > > I did not catch from your message, what performance do you expect > > > > > > from subsequent queries? > > > > > > > > > > > > Regards, > > > > > > Uladzimir > > > > > > > > > > > > On Mar 15, 2017 9:03 PM, "Muthu Jayakumar" <bablo...@gmail.com> > > > > > > wrote: > > > > > > > Hello Uladzimir / Shiva, > > > > > > > > > > > > > > From ElasticSearch documentation (i have to see the logical plan > > > > > > > of a query to confirm), the richness of filters (like regex,..) > > > > > > > is pretty good while comparing to Cassandra. As for aggregates, i > > > > > > > thi
Apache Spark broadcast error: Error sending message as driverActor is null [message = UpdateBlockInfo(BlockManagerId
I'm using Apache Spark 1.1.0 and I'm currently having issue with broadcast method. So when I call broadcast function on a small dataset to a 5 nodes cluster, I experiencing the Error sending message as driverActor is null after broadcast the variables several times (apps running under jboss). Any help would be appreciate. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-broadcast-error-Error-sending-message-as-driverActor-is-null-message-UpdateBlockInfo-Bld-tp21320.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Proper way to check SparkContext's status within code
Hi, Is there a way to check the status of the SparkContext regarding whether it's alive or not through the code, not through UI or anything else? Thanks Edwin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Proper-way-to-check-SparkContext-s-status-within-code-tp20638.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
cache function is not working on RDD from parallelize
Hi, On a 5 node cluster, say I have data on the driver application node, and then I call parallelize on the data, I get a rdd back. However, when I call cache on the rdd the rdd won't be cached (I checked that through timing on count the realized-cached rdd, take as long as before it was realized). So does anyone have any idea on this? Thanks Edwin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cache-function-is-not-working-on-RDD-from-parallelize-tp18219.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
apache spark union function cause executors disassociate (Lost executor 1 on 172.32.1.12: remote Akka client disassociated)
I have a 3 nodes ec2, each assigned 18G for the spark-executor-mem, So after I run my spark batch job, I got two rdd from different forks, but with the exact same format. And when i perform union operations, I got executors disassociate error and the whole spark job fail and quit. Memory shouldn't be a problem (can tell from the UI). what worth mentioning is that one rdd is significant bigger than the other one (much bigger), does anyone have any idea why? Thanks Edwin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/apache-spark-union-function-cause-executors-disassociate-Lost-executor-1-on-172-32-1-12-remote-Akka--tp15442.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: apache spark union function cause executors disassociate (Lost executor 1 on 172.32.1.12: remote Akka client disassociated)
does union function cause any data shuffling? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/apache-spark-union-function-cause-executors-disassociate-Lost-executor-1-on-172-32-1-12-remote-Akka--tp15442p15444.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: apache spark union function cause executors disassociate (Lost executor 1 on 172.32.1.12: remote Akka client disassociated)
19:02:45,963 INFO [org.apache.spark.MapOutputTrackerMaster] (spark-akka.actor.default-dispatcher-14) Size of output statuses for shuffle 1 is 216 bytes 19:02:45,964 INFO [org.apache.spark.scheduler.DAGScheduler] (spark-akka.actor.default-dispatcher-14) Got job 5 (getCallSite at null:-1) with 24 output partitions (allowLocal=false) 19:02:45,964 INFO [org.apache.spark.scheduler.DAGScheduler] (spark-akka.actor.default-dispatcher-14) Final stage: Stage 12 (getCallSite at null:-1) 19:02:45,964 INFO [org.apache.spark.scheduler.DAGScheduler] (spark-akka.actor.default-dispatcher-14) Parents of final stage: List(Stage 15, Stage 13) 19:02:45,970 INFO [org.apache.spark.scheduler.DAGScheduler] (spark-akka.actor.default-dispatcher-14) Missing parents: List() 19:02:45,971 INFO [org.apache.spark.scheduler.DAGScheduler] (spark-akka.actor.default-dispatcher-14) Submitting Stage 12 (UnionRDD[31] at getCallSite at null:-1), which has no missing parents 19:02:47,085 INFO [org.apache.spark.scheduler.DAGScheduler] (spark-akka.actor.default-dispatcher-14) Submitting 24 missing tasks from Stage 12 (UnionRDD[31] at getCallSite at null:-1) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/apache-spark-union-function-cause-executors-disassociate-Lost-executor-1-on-172-32-1-12-remote-Akka--tp15442p15445.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Transient association error on a 3 nodes cluster
I'm running my application on a three nodes cluster(8 cores each, 12 G memory each), and I receive the follow actor error, does anyone have any idea? 14:31:18,061 ERROR [akka.remote.EndpointWriter] (spark-akka.actor.default-dispatcher-17) Transient association error (association remains live): akka.remote.OversizedPayloadException: Discarding oversized payload sent to Actor[akka.tcp://sparkExecutor@172.32.1.155:43121/user/Executor#1335254673]: max allowed size 10485760 bytes, actual size of encoded class org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$LaunchTask was 19553539 bytes. Thanks Edwin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Transient-association-error-on-a-3-nodes-cluster-tp14914.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org