Distribution of spark 3.0.1 with Hive1.2

2020-11-10 Thread Dmitry
Hi all, I am trying to make distribution 3.0.1 with spark 3 using ./dev/make-distribution.sh --name spark3-hive12 --pip --tgz -Phive-1.2 -Phadoop-2.7 -Pyarn The problem is maven can't found right profile for hive and build ends without hive jars ++ /Users/reireirei/spark/spark/build/mvn help:eval

Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Dmitry
Hello spent a lot of time to find what I did wrong , but not found. I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try to run examples against Spark 2.3. Tried several docker images builds: * several builds that I build myself * andrusha/spark-k8s:2.3.0-hadoop2.7 from docke

Re: Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Dmitry
spark-examples_2.11-2.3.0.jar. > > On Tue, Apr 10, 2018 at 1:34 AM, Dmitry wrote: > >> Hello spent a lot of time to find what I did wrong , but not found. >> I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try >> to run examples against

Implementing FIRST_VALUE, LEAD, LAG in Spark

2015-02-13 Thread Dmitry Tolpeko
;))) data.groupByKey().map{case(a, b)=>(a, firstValue(b))}.collect So I create a new list after groupByKey. Is it right approach to do this in Spark? Are there any other options? Please point me to any drawbacks. Thanks, Dmitry

Re: Implementing FIRST_VALUE, LEAD, LAG in Spark

2015-02-17 Thread Dmitry Tolpeko
te any feedback. Dmitry On Fri, Feb 13, 2015 at 11:54 AM, Dmitry Tolpeko wrote: > Hello, > > To convert existing Map Reduce jobs to Spark, I need to implement window > functions such as FIRST_VALUE, LEAD, LAG and so on. For example, > FIRST_VALUE function: > > Source (1st column

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
Are you proposing I downgrade Solrj's httpclient dependency to be on par with that of Spark/Hadoop? Or upgrade Spark/Hadoop's httpclient to the latest? Solrj has to stay with its selected version. I could try and rebuild Spark with the latest httpclient but I've no idea what effects that may cau

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
I think I'm going to have to rebuild Spark with commons.httpclient.version set to 4.3.1 which looks to be the version chosen by Solrj, rather than the 4.2.6 that Spark's pom mentions. Might work. On Wed, Feb 18, 2015 at 1:37 AM, Arush Kharbanda wrote: > Hi > > Did you try to make maven pick the

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
t 10:50 AM, Emre Sevinc wrote: > Hello Dmitry, > > I had almost the same problem and solved it by using version 4.0.0 of > SolrJ: > > > org.apache.solr > solr-solrj > 4.0.0 > > > In my case, I was lucky that version 4.0.0 of SolrJ had a

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
Thanks, Emre! Will definitely try this. On Wed, Feb 18, 2015 at 11:00 AM, Emre Sevinc wrote: > > On Wed, Feb 18, 2015 at 4:54 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would >&g

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
Thanks, Cody. Yes, I originally started off by looking at that but I get a compile error if I try and use that approach: constructor JdbcRDD in class JdbcRDD cannot be applied to given types. Not to mention that JavaJdbcRDDSuite somehow manages to not pass in the class tag (the last argument). Wo

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
a static method JdbcRDD.create, not new JdbcRDD. Is that what > you tried doing? > > On Wed, Feb 18, 2015 at 12:00 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Thanks, Cody. Yes, I originally started off by looking at that but I get >> a compile err

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
I'm not sure what "on the driver" means but I've tried setting spark.files.userClassPathFirst to true, in $SPARK-HOME/conf/spark-defaults.conf and also in the SparkConf programmatically; it appears to be ignored. The solution was to follow Emre's recommendation and downgrade the selected Solrj dis

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
Koeninger wrote: > Is sc there a SparkContext or a JavaSparkContext? The compilation error > seems to indicate the former, but JdbcRDD.create expects the latter > > On Wed, Feb 18, 2015 at 12:30 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I have t

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
e and pass an instance of that to JdbcRDD.create ? > > On Wed, Feb 18, 2015 at 3:48 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Cody, you were right, I had a copy and paste snag where I ended up with a >> vanilla SparkContext rather than a Java

Re: NotSerializableException: org.apache.http.impl.client.DefaultHttpClient when trying to send documents to Solr

2015-02-18 Thread Dmitry Goldenberg
Thank you, Jose. That fixed it. On Wed, Feb 18, 2015 at 6:31 PM, Jose Fernandez wrote: > You need to instantiate the server in the forEachPartition block or Spark > will attempt to serialize it to the task. See the design patterns section > in the Spark Streaming guide. > > > Jose Fernandez | Pr

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-19 Thread Dmitry Goldenberg
DD[T] = { > > > JFunction here is the interface org.apache.spark.api.java.function.Function, > not scala Function0 > > LIkewise, ConnectionFactory is an interface defined inside JdbcRDD, not > scala Function0 > > On Wed, Feb 18, 2015 at 4:50 PM, Dmitry Goldenberg

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-19 Thread Dmitry Goldenberg
t that takes 2 numbers to specify the bounds. Of > course, a numeric primary key is going to be the most efficient way to do > that. > > On Thu, Feb 19, 2015 at 8:57 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Yup, I did see that. Good point though, Cody.

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
Sean, How does this model actually work? Let's say we want to run one job as N threads executing one particular task, e.g. streaming data out of Kafka into a search engine. How do we configure our Spark job execution? Right now, I'm seeing this job running as a single thread. And it's quite a bi

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
's nothing do with Spark then. If you run code on your driver, > it's not distributed. If you run Spark over an RDD with 1 partition, > only one task works on it. > > On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg > wrote: > > Sean, > > > > How does t

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
rtitions in a streaming RDD is determined by the block > interval and the batch interval. If you have a batch interval of 10s > and block interval of 1s you'll get 10 partitions of data in the RDD. > > On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg > wrote: > > Under

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
batch interval of 10s > and block interval of 1s you'll get 10 partitions of data in the RDD. > > On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg > wrote: > > Understood. We'll use the multi-threaded code we already have.. > > > > How are these execu

Re: Spark and RabbitMQ

2015-05-12 Thread Dmitry Goldenberg
Thanks, Akhil. It looks like in the second example, for Rabbit they're doing this: https://www.rabbitmq.com/mqtt.html. On Tue, May 12, 2015 at 7:37 AM, Akhil Das wrote: > I found two examples Java version >

Re: Spark Streaming and reducing latency

2015-05-18 Thread Dmitry Goldenberg
Thanks, Akhil. So what do folks typically do to increase/contract the capacity? Do you plug in some cluster auto-scaling solution to make this elastic? Does Spark have any hooks for instrumenting auto-scaling? In other words, how do you avoid overwheling the receivers in a scenario when your sy

Re: Migrate Relational to Distributed

2015-05-23 Thread Dmitry Tolpeko
can help. Thanks, Dmitry On Sat, May 23, 2015 at 1:22 AM, Brant Seibert wrote: > Hi, The healthcare industry can do wonderful things with Apache Spark. > But, > there is already a very large base of data and applications firmly rooted > in > the relational paradigm and they a

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Dmitry Goldenberg
Got it, thank you, Tathagata and Ted. Could you comment on my other question as well? Basically, I'm trying to get a handle on a good approa

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-27 Thread Dmitry Goldenberg
Thanks, Rajesh. I think that acquring/relinquishing executors is important but I feel like there are at least two layers for resource allocation and autoscaling. It seems that acquiring and relinquishing executors is a way to optimize resource utilization within a pre-set Spark cluster of machine

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
l of machines, my subsequent questions then are, a) will Spark sense the addition of a new node / is it sufficient that the cluster manager is aware, then work just starts flowing there? and b) what would be a way to gracefully remove a worker node when the load subsides, so that no currently runn

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thanks, Evo. Per the last part of your comment, it sounds like we will need to implement a job manager which will be in control of starting the jobs, monitoring the status of the Kafka topic(s), shutting jobs down and marking them as ones to relaunch, scaling the cluster up/down by adding/removing

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
> > > > > Sent from Samsung Mobile > > ---- Original message > > From: Evo Eftimov > > Date:2015/05/28 13:22 (GMT+00:00) > > To: Dmitry Goldenberg > > Cc: Gerard Maas ,spark users > > Subject: Re: Autoscaling Spark cluster based on top

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
at the optimal patterns... Regards, - Dmitry On Thu, May 28, 2015 at 3:21 PM, Andrew Or wrote: > Hi all, > > As the author of the dynamic allocation feature I can offer a few insights > here. > > Gerard's explanation was both correct and concise: dynamic allocation is > not int

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
>> Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK >>> – it will be your insurance policy against sys crashes due to memory leaks. >>> Until there is free RAM, spark streaming (spark) will NOT resort to disk – >>> and of course resorting to d

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Dmitry Goldenberg
Thank you, Tathagata, Cody, Otis. - Dmitry On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic wrote: > I think you can use SPM - http://sematext.com/spm - it will give you all > Spark and all Kafka metrics, including offsets broken down by topic, etc. > out of the box. I see more

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Dmitry Goldenberg
am - ? public class Param { // ==> potentially a very hefty resource to load private Map dictionary = new HashMap(); ... } I'm groking that Spark will serialize Param right before the call to foreachRDD, if we're to broadcast... On Wed, Jun 3, 2015 at 9:58 AM,

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
. >> >> -Andrew >> >> >> >> 2015-05-28 8:02 GMT-07:00 Evo Eftimov : >> >> Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK >>> – it will be your insurance policy against sys crashes due to memory leaks. >>

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
e and make them contact > and register with the Master without bringing down the Master (or any of > the currently running worker nodes) > > > > Then just shutdown your currently running spark streaming job/app and > restart it with new params to take advantage of the larg

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
> nodes which allows you to double, triple etc in the background/parallel the > resources of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Go

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
es of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Wednesday, June 3, 2015 4:46 PM > *To:* E

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
und/parallel the > resources of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Wednesday,

Re: StreamingListener, anyone?

2015-06-04 Thread Dmitry Goldenberg
blocked in awaitTermination. So what would be a way to trigger the termination in the driver? "context.awaitTermination() allows the current thread to wait for the termination of a context by stop() or by an exception" - presumably, we need to call stop() somewhere or perhaps throw. Chee

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Dmitry Goldenberg
from Samsung Mobile > > Original message > > From: Evo Eftimov > > Date:2015/05/28 13:22 (GMT+00:00) > > To: Dmitry Goldenberg > > Cc: Gerard Maas ,spark users > > Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth &

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Dmitry Goldenberg
Thanks so much, Yiannis, Olivier, Huang! On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas wrote: > Hi there, > > I would recommend checking out > https://github.com/spark-jobserver/spark-jobserver which I think gives > the functionality you are looking for. > I haven't tested it though. > > BR >

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Dmitry Goldenberg
ads > > > > Re Dmytiis intial question – you can load large data sets as Batch > (Static) RDD from any Spark Streaming App and then join DStream RDDs > against them to emulate “lookups” , you can also try the “Lookup RDD” – > there is a git hub project > > > > *From

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-09 Thread Dmitry Goldenberg
re > unless you want it to. > > If you want it, just call cache. > > On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "set the storage policy for the DStream RDDs to MEMORY AND DISK" - it >> appears t

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
rces of the currently running cluster > > > > I was thinking more about the scenario where you have e.g. 100 boxes and > want to / can add e.g. 20 more > > > > *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] > *Sent:* Wednesday, June 3, 2015 4:46 PM > *T

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
gt; > > > > > > On Thu, Jun 11, 2015 at 7:30 AM, Cody Koeninger > wrote: > >> Depends on what you're reusing multiple times (if anything). >> >> Read >> http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence >> >

Re: Registering custom metrics

2015-06-22 Thread Dmitry Goldenberg
Great, thank you, Silvio. In your experience, is there any way to instument a callback into Coda Hale or the Spark consumers from the metrics sink? If the sink performs some steps once it has received the metrics, I'd like to be able to make the consumers aware of that via some sort of a callback.

Re: Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-23 Thread Dmitry Goldenberg
Yes, Akhil. We already have an origination timestamp in the body of the message when we send it. But we can't guarantee the network speed nor a precise enough synchronization of clocks across machines. Pulling the timestamp from Kafka itself would be a step forward although the broker is most l

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
Richard, That's exactly the strategy I've been trying, which is a wrapper singleton class. But I was seeing the inner object being created multiple times. I wonder if the problem has to do with the way I'm processing the RDD's. I'm using JavaDStream to stream data (from Kafka). Then I'm processin

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
My singletons do in fact stick around. They're one per worker, looks like. So with 4 workers running on the box, we're creating one singleton per worker process/jvm, which seems OK. Still curious about foreachPartition vs. foreachRDD though... On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher wr

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
"These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives." Sean, different operations as they are, they can certainly be used on the same data set. In that sense, they are alternatives. Code can be written using on

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
Thanks, Cody. The "good boy" comment wasn't from me :) I was the one asking for help. On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger wrote: > Sean already answered your question. foreachRDD and foreachPartition are > completely different, there's nothing fuzzy or insufficient about that > a

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
ean it > 'works' to call foreachRDD on an RDD? > > @Dmitry are you asking about foreach vs foreachPartition? that's quite > different. foreachPartition does not give more parallelism but lets > you operate on a whole batch of data at once, which is nice if you > need

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
I and DB clients on the executor > side. I think it's the most straightforward approach to dealing with any > non-serializable object you need. > > I don't entirely follow what over-network data shuffling effects you are > alluding to (maybe more specific to streaming?). &g

Re: What is a best practice for passing environment variables to Spark workers?

2015-07-10 Thread Dmitry Goldenberg
Thanks, Akhil. We're trying the conf.setExecutorEnv() approach since we've already got environment variables set. For system properties we'd go the conf.set("spark.") route. We were concerned that doing the below type of thing did not work, which this blog post seems to confirm ( http://proge

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-18 Thread Dmitry Goldenberg
Thanks, Ted. Well, so far even there I'm only seeing my post and not, for example, your response. On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu wrote: > Was this one of the threads you participated ? > http://search-hadoop.com/m/JW1q5w0p8x1 > > You should be able to find your posts on search-hadoop.co

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Dmitry Goldenberg
r/ >>> >>> Nick >>> >>> On Thu, Mar 19, 2015 at 5:18 AM Ted Yu wrote: >>> >>>> There might be some delay: >>>> >>>> >>>> http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responses&subj=Apache+Sp

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Dmitry Goldenberg
posted? > > On Thu, Mar 19, 2015 at 10:49 AM Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> It seems that those archives are not necessarily easy to find stuff in. >> Is there a search engine on top of them? so as to find e.g. your own posts >> easi

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Yes, and Kafka topics are basically queues. So perhaps what's needed is just KafkaRDD with starting offset being 0 and finish offset being a very large number... Sent from my iPhone > On Apr 29, 2015, at 1:52 AM, ayan guha wrote: > > I guess what you mean is not streaming. If you create a st

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Part of the issues is, when you read messages in a topic, the messages are peeked, not polled, so there'll be no "when the queue is empty", as I understand it. So it would seem I'd want to do KafkaUtils.createRDD, which takes an array of OffsetRange's. Each OffsetRange is characterized by topic, p

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
nning. If you > haven't set any rate limiting parameters, then the first batch will contain > all the messages. You can then kill the job after the first batch. It's > possible you may be able to kill the job from a > StreamingListener.onBatchCompleted, but I've neve

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
contain > all the messages. You can then kill the job after the first batch. It's > possible you may be able to kill the job from a > StreamingListener.onBatchCompleted, but I've never tried and don't know > what the consequences may be. > > On Wed, Apr 29, 2015

Profiling a spark job

2016-04-05 Thread Dmitry Olshansky
nt in EPollWait. However I'm using standalone mode with local master without starting separate daemon (could it be that I should?) --- Dmitry Olshansky - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional c

Re: [discuss] dropping Python 2.6 support

2016-01-10 Thread Dmitry Kniazev
st in the same environment. For example, we use virtualenv to run Spark with Python 2.7 and do not touch system Python 2.6. Thank you, Dmitry 09.01.2016, 06:36, "Sasha Kacanski" : > +1 > Companies that use stock python in redhat 2.6 will need to upgrade or install > fresh vers

Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Dmitry Goldenberg
We have a bunch of Spark jobs deployed and a few large resource files such as e.g. a dictionary for lookups or a statistical model. Right now, these are deployed as part of the Spark jobs which will eventually make the mongo-jars too bloated for deployments. What are some of the best practices to

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
ion could be to store them as blobs in a cache like Redis and then > read + broadcast them from the driver. Or you could store them in HDFS and > read + broadcast from the driver. > > Regards > Sab > >> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg >> wrote:

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
Jorn, you said Ignite or ... ? What was the second choice you were thinking of? It seems that got omitted. > On Jan 12, 2016, at 2:44 AM, Jörn Franke wrote: > > You can look at ignite as a HDFS cache or for storing rdds. > >> On 11 Jan 2016, at 21:14, Dmitry Goldenberg

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
I'd guess that if the resources are broadcast Spark would put them into Tachyon... > On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg > wrote: > > Would it make sense to load them into Tachyon and read and broadcast them > from there since Tachyon is already a part of the

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
, Gene Pang wrote: > Hi Dmitry, > > Yes, Tachyon can help with your use case. You can read and write to > Tachyon via the filesystem api ( > http://tachyon-project.org/documentation/File-System-API.html). There is > a native Java API as well as a Hadoop-compatible API. Spark is a

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
since it's pluggable into Spark), Redis, and the like. Ideally, I would think we'd want resources to be loaded into the cluster memory as needed; paged in/out on-demand in an LRU fashion. From this perspective, it's not yet clear to me what the best option(s) would be. Any thoughts / re

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
The other thing from some folks' recommendations on this list was Apache Ignite. Their In-Memory File System ( https://ignite.apache.org/features/igfs.html) looks quite interesting. On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg wrote: > OK so it looks like Tachyon is a cluste

Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being created in those directories nor am I seeing the effects I'd expect from checkpointing. I'd expect any data that comes into Kafk

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
f it's quite long. > > It sort of seems wrong though since > https://spark.apache.org/docs/latest/streaming-programming-guide.html > suggests it was intended to be a multiple of the batch interval. The > slide duration wouldn't always be relevant anyway. > > On Fri,

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
est/streaming-programming-guide.html > suggests it was intended to be a multiple of the batch interval. The > slide duration wouldn't always be relevant anyway. > > On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg > wrote: > > I've instrumented checkpointing per the

How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
We're getting the below error. Tried increasing spark.executor.memory e.g. from 1g to 2g but the below error still happens. Any recommendations? Something to do with specifying -Xmx in the submit job scripts? Thanks. Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit excee

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
at looks like it's during recovery from a checkpoint, so it'd be driver > memory not executor memory. > > How big is the checkpoint directory that you're trying to restore from? > > On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com&g

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
r we can estimate the > size of the checkpoint and compare with Runtime.getRuntime().freeMemory(). > > If the size of checkpoint is much bigger than free memory, log warning, etc > > Cheers > > On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com&

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
om? Thanks. On Mon, Aug 10, 2015 at 1:07 PM, Cody Koeninger wrote: > You need to keep a certain number of rdds around for checkpointing, based > on e.g. the window size. Those would all need to be loaded at once. > > On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg < > dgold

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
to workers. Hefty file system usage, hefty memory consumption... What can we do to offset some of these costs? On Mon, Aug 10, 2015 at 4:27 PM, Cody Koeninger wrote: > The rdd is indeed defined by mostly just the offsets / topic partitions. > > On Mon, Aug 10, 2015 at 3:24 PM

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
s the interval? that could at least > explain why it's not doing anything, if it's quite long. > > It sort of seems wrong though since > https://spark.apache.org/docs/latest/streaming-programming-guide.html > suggests it was intended to be a multiple of the batch interval. Th

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
er program could terminate as the last batch is being processed... On Fri, Aug 14, 2015 at 6:17 PM, Cody Koeninger wrote: > You'll resume and re-process the rdd that didnt finish > > On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote

Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-03 Thread Dmitry Goldenberg
rkers, and consumers and the value seems "stuck" at 10 seconds. Any ideas? We're running Spark 1.3.0 built for Hadoop 2.4. Thanks. - Dmitry

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
c={}.)...", appName, topic); // ... iterate ... log.info("Finished data partition processing (appName={}, topic={}). Documents processed: {}.", appName, topic, docCount); } Any ideas? Thanks. - Dmitry On Thu, Sep 3, 2015 at 10:45 PM, Tathagata Das wrote: > Are you accidentally recoveri

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
UI says? It should show > the underlying batch duration of the StreamingContext, the details of when > the batch starts, etc. > > BTW, it seems that the 5.6 or 6.8 seconds delay is present only when data > is present (that is, * Documents processed: > 0)* > > On Fri,

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
ontext(sparkConf, params);jssc.start(); jssc.awaitTermination(); jssc.close(); On Fri, Sep 4, 2015 at 8:48 PM, Tathagata Das wrote: > Are you sure you are not accidentally recovering from checkpoint? How are > you using StreamingContext.getOrCreate() in your code? > > TD &

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
ew ProcessPartitionFunction(params); rdd.foreachPartition(func); return null; } }); return jssc; } On Fri, Sep 4, 2015 at 8:57 PM, Dmitry Goldenberg wrote: > I'd think that we wouldn't be "accidentally recovering from checkpoint" > hours or ev

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
or moving the contents of the checkpoint directory > and restarting the job? > > On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Sorry, more relevant code below: >> >> SparkConf sparkConf = createSparkConf(appName, ka

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
what values are actually being read > out of your params object for the various settings. > > > On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I've stopped the jobs, the workers, and the master. Deleted the contents >

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
ger wrote: > Well, I'm not sure why you're checkpointing messages. > > I'd also put in some logging to see what values are actually being read > out of your params object for the various settings. > > > On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg < > dg

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
:23 PM, Tathagata Das wrote: > Why are you checkpointing the direct kafka stream? It serves not purpose. > > TD > > On Tue, Sep 8, 2015 at 9:35 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I just disabled checkpointing in our consumers and

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
ata > in each batch > > (which checkpoint is enabled using > streamingContext.checkpoint(checkpointDir)) and can recover from failure by > reading the exact same data back from Kafka. > > > TD > > On Tue, Sep 8, 2015 at 4:38 PM, Dmitry Goldenberg < > dgoldenbe

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
sing here? On Tue, Sep 8, 2015 at 11:42 PM, Tathagata Das wrote: > Well, you are returning JavaStreamingContext.getOrCreate(params. > getCheckpointDir(), factory); > That is loading the checkpointed context, independent of whether params > .isCheckpointed() is true. > > &

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-09 Thread Dmitry Goldenberg
pointing to override its checkpoint duration millis, is there? Is the default there max(batchdurationmillis, 10seconds)? Is there a way to override this? Thanks. On Wed, Sep 9, 2015 at 2:44 PM, Tathagata Das wrote: > > > See inline. > > On Tue, Sep 8, 2015 at 9:02 PM, Dmitry

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-10 Thread Dmitry Goldenberg
the time of > recovery? Trying to understand your usecase. > > > On Wed, Sep 9, 2015 at 12:03 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> >> when you use getOrCreate, and there exists a valid checkpoint, it will >> always return the c

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Dmitry Goldenberg
>> checkpoints can't be used between controlled restarts Is that true? If so, why? From my testing, checkpoints appear to be working fine, we get the data we've missed between the time the consumer went down and the time we brought it back up. >> If I cannot make checkpoints between code upgrades

A way to kill laggard jobs?

2015-09-11 Thread Dmitry Goldenberg
Is there a way to kill a laggard Spark job manually, and more importantly, is there a way to do it programmatically based on a configurable timeout value? Thanks.

A way to timeout and terminate a laggard 'Stage' ?

2015-09-14 Thread Dmitry Goldenberg
Is there a way in Spark to automatically terminate laggard "stage's", ones that appear to be hanging? In other words, is there a timeout for processing of a given RDD? In the Spark GUI, I see the "kill" function for a given Stage under 'Details for Job <...>". Is there something in Spark that w

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Dmitry Goldenberg
proach, but yes you >> can have a separate program to keep an eye in the webui (possibly parsing >> the content) and make it trigger the kill task/job once it detects a lag. >> (Again you will have to figure out the correct numbers before killing any >> job) >> >>

Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-08 Thread Dmitry Goldenberg
ather be able to not have to pre-create the topics before I start the consumers. Any thoughts/comments would be appreciated. Thanks. - Dmitry Exception in thread "main" org.apache.spark

SparkConf.setExecutorEnv works differently in Spark 2.0.0

2016-10-08 Thread Dmitry Goldenberg
way to pass environment variables or system properties to the executor side, preferably a programmatic way rather than configuration-wise? Thanks, - Dmitry

Re: Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-29 Thread Dmitry Goldenberg
ipt at install time. This seems like a Kafka doc issue potentially, to explain what exactly one can expect from the auto.create.topics.enable flag. -Dmitry On Sat, Oct 8, 2016 at 1:26 PM, Cody Koeninger wrote: > So I just now retested this with 1.5.2, and 2.0.0, and the behavior is > ex

  1   2   >