To add more info, this project is on an older version of Spark, 1.5.0, and
on an older version of Kafka which is 0.8.2.1 (2.10-0.8.2.1).
On Tue, Apr 2, 2019 at 11:39 AM Dmitry Goldenberg
wrote:
> Hi,
>
> I've got 3 questions/issues regarding checkpointing, was hoping someone
> cou
Hi,
I've got 3 questions/issues regarding checkpointing, was hoping someone
could help shed some light on this.
We've got a Spark Streaming consumer consuming data from a Kafka topic;
works fine generally until I switch it to the checkpointing mode by calling
the 'checkpoint' method on the
Hi all,
I'm seeing an odd behavior where if I switch the context from regular to
checkpointed, the system properties are no longer automatically carried
over into the worker / executors and turn out to be null there.
This is in Java, using spark-streaming_2.10, version 1.5.0.
I'm placing
Hi,
I was wondering if folks have some ideas, recommendation for how to fix
this error (full stack trace included below).
We're on Kafka 0.10.0.0 and spark_streaming_2.11 v. 2.0.0.
We've tried a few things as suggested in these sources:
-
Hi,
Any reason why we might be getting this error? The code seems to work fine
in the non-distributed mode but the same code when run from a Spark job is
not able to get to Elastic.
Spark version: 2.0.1 built for Hadoop 2.4, Scala 2.11
Elastic version: 2.3.1
I've verified the Elastic hosts and
er to run with auto.create set to false (because it makes sure the
> topic is actually set up the way you want, reduces the likelihood of
> spurious topics being created, etc).
>
>
>
> On Sat, Oct 8, 2016 at 11:44 AM, Dmitry Goldenberg
> <dgoldenberg...@gmail.com> wrote:
&
loaded into the cluster memory as needed; paged in/out on-demand in an
LRU fashion. From this perspective, it's not yet clear to me what the best
option(s) would be. Any thoughts / recommendations would be appreciated.
On Tue, Jan 12, 2016 at 3:04 PM, Dmitry Goldenberg <dgoldenberg...@gmail.c
The other thing from some folks' recommendations on this list was Apache
Ignite. Their In-Memory File System (
https://ignite.apache.org/features/igfs.html) looks quite interesting.
On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com
> wrote:
> OK so it l
I'd guess that if the resources are broadcast Spark would put them into
Tachyon...
> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>
> wrote:
>
> Would it make sense to load them into Tachyon and read and broadcast them
> from there since Tach
Jorn, you said Ignite or ... ? What was the second choice you were thinking of?
It seems that got omitted.
> On Jan 12, 2016, at 2:44 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> You can look at ignite as a HDFS cache or for storing rdds.
>
>> On 11 Jan 2016, at
nthan.com> wrote:
>
> One option could be to store them as blobs in a cache like Redis and then
> read + broadcast them from the driver. Or you could store them in HDFS and
> read + broadcast from the driver.
>
> Regards
> Sab
>
>> On Tue, Jan 12, 2016 at 1
. Spark is also able to
> interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read
> input files from Tachyon and write output files to Tachyon.
>
> I hope that helps,
> Gene
>
> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.c
We have a bunch of Spark jobs deployed and a few large resource files such
as e.g. a dictionary for lookups or a statistical model.
Right now, these are deployed as part of the Spark jobs which will
eventually make the mongo-jars too bloated for deployments.
What are some of the best practices
N/m, these are just profiling snapshots :) Sorry for the wide distribution.
On Tue, Nov 10, 2015 at 9:46 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com
> wrote:
> We're seeing a bunch of .snapshot files being created under
> /home/spark/Snapshots, such as the following
We're seeing a bunch of .snapshot files being created under
/home/spark/Snapshots, such as the following for example:
CoarseGrainedExecutorBackend-2015-08-27-shutdown.snapshot
CoarseGrainedExecutorBackend-2015-08-31-shutdown-1.snapshot
SparkSubmit-2015-08-31-shutdown-1.snapshot
Thanks, Ted, will try it out.
On Wed, Sep 30, 2015 at 9:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> See the tail of this:
> https://bugzilla.redhat.com/show_bug.cgi?id=1005811
>
> FYI
>
> > On Sep 30, 2015, at 5:54 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>
Is there a way to ensure Spark doesn't write to /tmp directory?
We've got spark.local.dir specified in the spark-defaults.conf file to
point at another directory. But we're seeing many of
these snappy-unknown-***-libsnappyjava.so files being written to /tmp still.
Is there a config setting or
f protobuf jar is loaded ahead of hbase-protocol.jar, things start to get
> interesting ...
>
> On Tue, Sep 29, 2015 at 6:12 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Ted, I think I have tried these settings with the hbase protocol jar, to
ith fewer partitions/replicas, see if it works.
>
> -adrian
>
> From: Dmitry Goldenberg
> Date: Tuesday, September 29, 2015 at 3:37 PM
> To: Adrian Tanase
> Cc: "user@spark.apache.org"
> Subject: Re: Kafka error "partitions don't have a leader" /
> LeaderNo
> -adrian
>
> From: Dmitry Goldenberg
> Date: Tuesday, September 29, 2015 at 3:26 PM
> To: "user@spark.apache.org"
> Subject: Kafka error "partitions don't have a leader" /
> LeaderNotAvailableException
>
> I apologize for posting this Kafka related issue
I apologize for posting this Kafka related issue into the Spark list. Have
gotten no responses on the Kafka list and was hoping someone on this list
could shed some light on the below.
---
We're running into
, you
> may want to look through the kafka jira, e.g.
>
> https://issues.apache.org/jira/browse/KAFKA-899
>
>
> On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> "more partitions and replicas than available brokers" --
you
> may want to look through the kafka jira, e.g.
>
> https://issues.apache.org/jira/browse/KAFKA-899
>
>
> On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> "more partitions and replicas than available brokers"
release of Spark
> command line for running Spark job
>
> Cheers
>
> On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> We're seeing this occasionally. Granted, this was caused by a wrinkle in
>> the Solr schema but
We're seeing this occasionally. Granted, this was caused by a wrinkle in
the Solr schema but this bubbled up all the way in Spark and caused job
failures.
I just checked and SolrException class is actually in the consumer job jar
we use. Is there any reason why Spark cannot find the
I'm actually not sure how either one of these would possibly cause Spark to
find SolrException. Whether the driver or executor class path is first,
should it not matter, if the class is in the consumer job jar?
On Tue, Sep 29, 2015 at 9:12 PM, Dmitry Goldenberg <dgoldenberg...@gmail.
wrote:
> Have you tried the following ?
> --conf spark.driver.userClassPathFirst=true --conf spark.executor.
> userClassPathFirst=true
>
> On Tue, Sep 29, 2015 at 4:38 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Release of Spar
correct numbers before killing any
>> job)
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Sep 14, 2015 at 10:40 PM, Dmitry Goldenberg <
>> dgoldenberg...@gmail.com> wrote:
>>
>>> Is there a way in Spark to automatically terminate
Is there a way in Spark to automatically terminate laggard "stage's", ones
that appear to be hanging? In other words, is there a timeout for
processing of a given RDD?
In the Spark GUI, I see the "kill" function for a given Stage under
'Details for Job <...>".
Is there something in Spark that
Is there a way to kill a laggard Spark job manually, and more importantly,
is there a way to do it programmatically based on a configurable timeout
value?
Thanks.
>> checkpoints can't be used between controlled restarts
Is that true? If so, why? From my testing, checkpoints appear to be working
fine, we get the data we've missed between the time the consumer went down
and the time we brought it back up.
>> If I cannot make checkpoints between code
erval at the time of
> recovery? Trying to understand your usecase.
>
>
> On Wed, Sep 9, 2015 at 12:03 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> >> when you use getOrCreate, and there exists a valid checkpoint, it will
>> always return the
inting to override
its checkpoint duration millis, is there? Is the default there
max(batchdurationmillis, 10seconds)? Is there a way to override this?
Thanks.
On Wed, Sep 9, 2015 at 2:44 PM, Tathagata Das <t...@databricks.com> wrote:
>
>
> See inline.
>
> On Tue, Sep
M, Tathagata Das <t...@databricks.com> wrote:
> Why are you checkpointing the direct kafka stream? It serves not purpose.
>
> TD
>
> On Tue, Sep 8, 2015 at 9:35 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I just disabled checkpointing in
gt;
>
> On Tue, Sep 8, 2015 at 8:28 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> That is good to know. However, that doesn't change the problem I'm
>> seeing. Which is that, even with that piece of code commented out
>> (stream.checkpoint()), th
; in each batch
>
> (which checkpoint is enabled using
> streamingContext.checkpoint(checkpointDir)) and can recover from failure by
> reading the exact same data back from Kafka.
>
>
> TD
>
> On Tue, Sep 8, 2015 at 4:38 PM, Dmitry Goldenberg <
> dgoldenberg...@gma
eninger <c...@koeninger.org> wrote:
> Well, I'm not sure why you're checkpointing messages.
>
> I'd also put in some logging to see what values are actually being read
> out of your params object for the various settings.
>
>
> On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenber
logging to see what values are actually being read
> out of your params object for the various settings.
>
>
> On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I've stopped the jobs, the workers, and the master. Deleted the con
deleting or moving the contents of the checkpoint directory
> and restarting the job?
>
> On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Sorry, more relevant code below:
>>
>> SparkConf sparkConf = createSparkConf(a
15 at 10:45 PM, Tathagata Das <t...@databricks.com> wrote:
> Are you accidentally recovering from checkpoint files which has 10 second
> as the batch interval?
>
>
> On Thu, Sep 3, 2015 at 7:34 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I'm s
t; On Fri, Sep 4, 2015 at 3:38 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Tathagata,
>>
>> Checkpointing is turned on but we were not recovering. I'm looking at the
>> logs now, feeding fresh content hours after the restart. Here's a snippet:
>&
;
>> On Fri, Sep 4, 2015 at 3:50 PM, Tathagata Das <t...@databricks.com>
>> wrote:
>>
>>> Could you see what the streaming tab in the Spark UI says? It should
>>> show the underlying batch duration of the StreamingContext, the details of
>>> when the b
t;() {
@Override
public Void call(JavaRDD rdd) throws Exception {
ProcessPartitionFunction func = new
ProcessPartitionFunction(params);
rdd.foreachPartition(func);
return null;
}
});
return jssc;
}
On Fri, Sep 4, 2015 at 8:57 PM, Dmitry Goldenberg &
I'm seeing an oddity where I initially set the batchdurationmillis to 1
second and it works fine:
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.milliseconds(batchDurationMillis));
Then I tried changing the value to 10 seconds. The change didn't seem to
take. I've
.
It sort of seems wrong though since
https://spark.apache.org/docs/latest/streaming-programming-guide.html
suggests it was intended to be a multiple of the batch interval. The
slide duration wouldn't always be relevant anyway.
On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg
dgoldenberg
could
terminate as the last batch is being processed...
On Fri, Aug 14, 2015 at 6:17 PM, Cody Koeninger c...@koeninger.org wrote:
You'll resume and re-process the rdd that didnt finish
On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Our additional question
We're getting the below error. Tried increasing spark.executor.memory e.g.
from 1g to 2g but the below error still happens.
Any recommendations? Something to do with specifying -Xmx in the submit job
scripts?
Thanks.
Exception in thread main java.lang.OutOfMemoryError: GC overhead limit
a checkpoint whether we can estimate the
size of the checkpoint and compare with Runtime.getRuntime().freeMemory().
If the size of checkpoint is much bigger than free memory, log warning, etc
Cheers
On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Thanks, Cody
, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org wrote:
You need to keep a certain number of rdds around for checkpointing, based
on e.g. the window size. Those would all need to be loaded at once.
On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote
:
I wonder during recovery from a checkpoint whether we can estimate
the size of the checkpoint and compare with Runtime.getRuntime().
freeMemory().
If the size of checkpoint is much bigger than free memory, log
warning, etc
Cheers
On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg
wouldn't always be relevant anyway.
On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
I've instrumented checkpointing per the programming guide and I can tell
that Spark Streaming is creating the checkpoint directories but I'm not
seeing any content being
I've instrumented checkpointing per the programming guide and I can tell
that Spark Streaming is creating the checkpoint directories but I'm not
seeing any content being created in those directories nor am I seeing the
effects I'd expect from checkpointing. I'd expect any data that comes into
https://spark.apache.org/docs/latest/streaming-programming-guide.html
suggests it was intended to be a multiple of the batch interval. The
slide duration wouldn't always be relevant anyway.
On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
I've instrumented
Thanks, Akhil.
We're trying the conf.setExecutorEnv() approach since we've already got
environment variables set. For system properties we'd go the
conf.set(spark.) route.
We were concerned that doing the below type of thing did not work, which
this blog post seems to confirm (
Richard,
That's exactly the strategy I've been trying, which is a wrapper singleton
class. But I was seeing the inner object being created multiple times.
I wonder if the problem has to do with the way I'm processing the RDD's.
I'm using JavaDStream to stream data (from Kafka). Then I'm
My singletons do in fact stick around. They're one per worker, looks like.
So with 4 workers running on the box, we're creating one singleton per
worker process/jvm, which seems OK.
Still curious about foreachPartition vs. foreachRDD though...
On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher
These are quite different operations. One operates on RDDs in DStream and
one operates on partitions of an RDD. They are not alternatives.
Sean, different operations as they are, they can certainly be used on the
same data set. In that sense, they are alternatives. Code can be written
using one
the processing.
On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
These are quite different operations. One operates on RDDs in DStream
and
one operates on partitions of an RDD. They are not alternatives.
Sean, different operations as they are, they can certainly
Thanks, Cody. The good boy comment wasn't from me :) I was the one
asking for help.
On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger c...@koeninger.org wrote:
Sean already answered your question. foreachRDD and foreachPartition are
completely different, there's nothing fuzzy or insufficient
Great, thank you, Silvio. In your experience, is there any way to instument
a callback into Coda Hale or the Spark consumers from the metrics sink? If
the sink performs some steps once it has received the metrics, I'd like to
be able to make the consumers aware of that via some sort of a
of the currently running cluster
I was thinking more about the scenario where you have e.g. 100 boxes and
want to / can add e.g. 20 more
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent:* Wednesday, June 3, 2015 4:46 PM
*To:* Evo Eftimov
*Cc:* Cody Koeninger; Andrew Or; Gerard
?
On Thu, Jun 11, 2015 at 7:30 AM, Cody Koeninger c...@koeninger.org
wrote:
Depends on what you're reusing multiple times (if anything).
Read
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
On Wed, Jun 10, 2015 at 12:18 AM, Dmitry Goldenberg
dgoldenberg
unless you want it to.
If you want it, just call cache.
On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
set the storage policy for the DStream RDDs to MEMORY AND DISK - it
appears the storage level can be specified in the createStream methods
Dmytiis intial question – you can load large data sets as Batch
(Static) RDD from any Spark Streaming App and then join DStream RDDs
against them to emulate “lookups” , you can also try the “Lookup RDD” –
there is a git hub project
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent
Thanks so much, Yiannis, Olivier, Huang!
On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:
Hi there,
I would recommend checking out
https://github.com/spark-jobserver/spark-jobserver which I think gives
the functionality you are looking for.
I haven't tested it
Shixiong,
Thanks, interesting point. So if we want to only process one batch then
terminate the consumer, what's the best way to achieve that? Presumably the
listener could set a flag on the driver notifying it that it can terminate.
But the driver is not in a loop, it's basically blocked in
Date:2015/05/28 13:22 (GMT+00:00)
To: Dmitry Goldenberg
Cc: Gerard Maas ,spark users
Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
in Kafka or Spark's metrics?
You can always spin new boxes in the background and bring them into the
cluster fold when fully
” resources /
nodes which allows you to double, triple etc in the background/parallel the
resources of the currently running cluster
I was thinking more about the scenario where you have e.g. 100 boxes and
want to / can add e.g. 20 more
*From:* Dmitry Goldenberg [mailto:dgoldenberg
So Evo, option b is to singleton the Param, as in your modified snippet,
i.e. instantiate is once per an RDD.
But if I understand correctly the a) option is broadcast, meaning
instantiation is in the Driver once before any transformations and actions,
correct? That's where my serialization costs
of the currently running cluster
I was thinking more about the scenario where you have e.g. 100 boxes and
want to / can add e.g. 20 more
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent:* Wednesday, June 3, 2015 4:46 PM
*To:* Evo Eftimov
*Cc:* Cody Koeninger; Andrew
, spark streaming (spark) will NOT resort to disk –
and of course resorting to disk from time to time (ie when there is no free
RAM ) and taking a performance hit from that, BUT only until there is no
free RAM
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent:* Thursday, May 28
the
resources of the currently running cluster
I was thinking more about the scenario where you have e.g. 100 boxes and
want to / can add e.g. 20 more
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent:* Wednesday, June 3, 2015 4:46 PM
*To:* Evo Eftimov
*Cc:* Cody Koeninger
Thank you, Tathagata, Cody, Otis.
- Dmitry
On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:
I think you can use SPM - http://sematext.com/spm - it will give you all
Spark and all Kafka metrics, including offsets broken down by topic, etc.
out of the box.
to time (ie when there is no free
RAM ) and taking a performance hit from that, BUT only until there is no
free RAM
*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent:* Thursday, May 28, 2015 2:34 PM
*To:* Evo Eftimov
*Cc:* Gerard Maas; spark users
*Subject:* Re: FW: Re
Thank you, Gerard.
We're looking at the receiver-less setup with Kafka Spark streaming so I'm
not sure how to apply your comments to that case (not that we have to use
receiver-less but it seems to offer some advantages over the
receiver-based).
As far as the number of Kafka receivers is fixed
Thanks, Evo. Per the last part of your comment, it sounds like we will
need to implement a job manager which will be in control of starting the
jobs, monitoring the status of the Kafka topic(s), shutting jobs down and
marking them as ones to relaunch, scaling the cluster up/down by
Eftimov
Date:2015/05/28 13:22 (GMT+00:00)
To: Dmitry Goldenberg
Cc: Gerard Maas ,spark users
Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
in Kafka or Spark's metrics?
You can always spin new boxes in the background and bring them into the
cluster fold when
Got it, thank you, Tathagata and Ted.
Could you comment on my other question
http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tt23062.html
as well? Basically, I'm trying to get a handle on a good
Thanks, Akhil. So what do folks typically do to increase/contract the capacity?
Do you plug in some cluster auto-scaling solution to make this elastic?
Does Spark have any hooks for instrumenting auto-scaling?
In other words, how do you avoid overwheling the receivers in a scenario when
your
Thanks, Akhil. It looks like in the second example, for Rabbit they're
doing this: https://www.rabbitmq.com/mqtt.html.
On Tue, May 12, 2015 at 7:37 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I found two examples Java version
Sean,
How does this model actually work? Let's say we want to run one job as N
threads executing one particular task, e.g. streaming data out of Kafka
into a search engine. How do we configure our Spark job execution?
Right now, I'm seeing this job running as a single thread. And it's quite a
RDD is determined by the block
interval and the batch interval. If you have a batch interval of 10s
and block interval of 1s you'll get 10 partitions of data in the RDD.
On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Understood. We'll use the multi-threaded
then. If you run code on your driver,
it's not distributed. If you run Spark over an RDD with 1 partition,
only one task works on it.
On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Sean,
How does this model actually work? Let's say we want to run one job as N
have a batch interval of 10s
and block interval of 1s you'll get 10 partitions of data in the RDD.
On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Understood. We'll use the multi-threaded code we already have..
How are these execution slots filled up? I
Yes, and Kafka topics are basically queues. So perhaps what's needed is just
KafkaRDD with starting offset being 0 and finish offset being a very large
number...
Sent from my iPhone
On Apr 29, 2015, at 1:52 AM, ayan guha guha.a...@gmail.com wrote:
I guess what you mean is not streaming.
Part of the issues is, when you read messages in a topic, the messages are
peeked, not polled, so there'll be no when the queue is empty, as I
understand it.
So it would seem I'd want to do KafkaUtils.createRDD, which takes an array
of OffsetRange's. Each OffsetRange is characterized by topic,
. You can then kill the job after the first batch. It's
possible you may be able to kill the job from a
StreamingListener.onBatchCompleted, but I've never tried and don't know
what the consequences may be.
On Wed, Apr 29, 2015 at 8:52 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote
=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view
On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Thanks, Ted. Well, so far even there I'm only seeing my post and not,
for example, your response.
On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu
, 2015 at 10:49 AM Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
It seems that those archives are not necessarily easy to find stuff in.
Is there a search engine on top of them? so as to find e.g. your own posts
easily?
On Thu, Mar 19, 2015 at 10:34 AM, Nicholas Chammas
nicholas.cham
, ConnectionFactory is an interface defined inside JdbcRDD, not
scala Function0
On Wed, Feb 18, 2015 at 4:50 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
That's exactly what I was doing. However, I ran into runtime issues with
doing that. For instance, I had a
public class DbConnection
. Of
course, a numeric primary key is going to be the most efficient way to do
that.
On Thu, Feb 19, 2015 at 8:57 AM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Yup, I did see that. Good point though, Cody. The mismatch was happening
for me when I was trying to get the 'new JdbcRDD
Thanks, Cody. Yes, I originally started off by looking at that but I get a
compile error if I try and use that approach: constructor JdbcRDD in class
JdbcRDDT cannot be applied to given types. Not to mention that
JavaJdbcRDDSuite somehow manages to not pass in the class tag (the last
argument).
/apache/spark/JavaJdbcRDDSuite.java#L90
is calling a static method JdbcRDD.create, not new JdbcRDD. Is that what
you tried doing?
On Wed, Feb 18, 2015 at 12:00 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Thanks, Cody. Yes, I originally started off by looking at that but I get
I'm not sure what on the driver means but I've tried
setting spark.files.userClassPathFirst to true,
in $SPARK-HOME/conf/spark-defaults.conf and also in the SparkConf
programmatically; it appears to be ignored. The solution was to follow
Emre's recommendation and downgrade the selected Solrj
...@koeninger.org wrote:
Is sc there a SparkContext or a JavaSparkContext? The compilation error
seems to indicate the former, but JdbcRDD.create expects the latter
On Wed, Feb 18, 2015 at 12:30 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
I have tried that as well, I get a compile
Are you proposing I downgrade Solrj's httpclient dependency to be on par with
that of Spark/Hadoop? Or upgrade Spark/Hadoop's httpclient to the latest?
Solrj has to stay with its selected version. I could try and rebuild Spark with
the latest httpclient but I've no idea what effects that may
that version 4.0.0 of SolrJ had all the
functionality I needed.
--
Emre Sevinç
http://www.bigindustries.be/
On Wed, Feb 18, 2015 at 4:39 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
I think I'm going to have to rebuild Spark with
commons.httpclient.version set to 4.3.1 which
Thanks, Emre! Will definitely try this.
On Wed, Feb 18, 2015 at 11:00 AM, Emre Sevinc emre.sev...@gmail.com wrote:
On Wed, Feb 18, 2015 at 4:54 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would
that not collide
I think I'm going to have to rebuild Spark with commons.httpclient.version
set to 4.3.1 which looks to be the version chosen by Solrj, rather than the
4.2.6 that Spark's pom mentions. Might work.
On Wed, Feb 18, 2015 at 1:37 AM, Arush Kharbanda ar...@sigmoidanalytics.com
wrote:
Hi
Did you
the
org.apache.spark.api.java.function.Function
interface and pass an instance of that to JdbcRDD.create ?
On Wed, Feb 18, 2015 at 3:48 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Cody, you were right, I had a copy and paste snag where I ended up with a
vanilla SparkContext rather than a Java one. I also had
100 matches
Mail list logo