Hard to say without seeing the code, but if you do multiple actions on an
Rdd without caching, the Rdd will be computed multiple times.
On Sep 10, 2016 2:43 AM, "Cheng Yi" wrote:
After some investigation, the problem i see is liked caused by a filter and
union of the
Does the same thing happen if you're only using direct stream plus back
pressure, not the receiver stream?
On Sep 9, 2016 6:41 PM, "Jeff Nadler" wrote:
> Maybe this is a pretty esoteric implementation, but I'm seeing some bad
> behavior with backpressure plus multiple Kafka
- If you're seeing repeated attempts to process the same message, you
should be able to look in the UI or logs and see that a task has
failed. Figure out why that task failed before chasing other things
- You're not using the latest version, the latest version is for spark
2.0. There are two
To be clear, "safe" has very little to do with this.
It's pretty clear that there's very little risk of the spark module
for kinesis being considered a derivative work, much less all of
spark.
The use limitation in 3.3 that caused the amazon license to be put on
the apache X list also doesn't
ask how big an overhead is that?
>
> It happens intermittently and each time it happens retry is successful.
>
> Srikanth
>
> On Wed, Sep 7, 2016 at 3:55 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> That's not what I would have expected to happen with a
k.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecuto
dy. Setting poll timeout helped.
>> Our network is fine but brokers are not fully provisioned in test cluster.
>> But there isn't enough load to max out on broker capacity.
>> Curious that kafkacat running on the same node doesn't have any issues.
>>
>> Srikanth
>&g
I don't see a reason to remove the non-assembly artifact, why would
you? You're not distributing copies of Amazon licensed code, and the
Amazon license goes out of its way not to over-reach regarding
derivative works.
This seems pretty clearly to fall in the spirit of
The restart doesn't have to be all that
> silent. It requires us to set a flag. I thought auto.offset.reset is that
> flag.
> But there isn't much I can do at this point given that retention has cleaned
> things up.
> The app has to start. Let admins address the data loss on the side.
>
tion instead of honoring
> auto.offset.reset.
> It isn't a corner case where retention expired after driver created a batch.
> Its easily reproducible and consistent.
>
> On Tue, Sep 6, 2016 at 3:34 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> You don't
In general, see the material linked from
https://github.com/koeninger/kafka-exactly-once if you want a better
understanding of the direct stream.
For spark-streaming-kafka-0-8, the direct stream doesn't really care
about consumer group, since it uses the simple consumer. For the 0.10
version,
on was why I got the exception instead of it using
> auto.offset.reset on restart?
>
>
>
>
> On Tue, Sep 6, 2016 at 10:48 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> If you leave enable.auto.commit set to true, it will commit offsets to
>>
> enable.auto.commit = true
> auto.offset.reset = latest
>
> Srikanth
>
> On Sat, Sep 3, 2016 at 8:59 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Seems like you're confused about the purpose of that line of code, it
>> applies to executors, not the driver.
The Kafka commit api isn't transactional, you aren't going to get
exactly once behavior out of it even if you were committing offsets on
a per-partition basis. This doesn't really have anything to do with
Spark; the old code you posted was already inherently broken.
Make your outputs idempotent
r
> temporarily excluding partitions is there any way I can supply
> topic-partition info on the fly at the beginning of every pull dynamically.
> Will streaminglistener be of any help?
>
> On Fri, Sep 2, 2016 at 10:37 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>
If you just want to pause the whole stream, just stop the app and then
restart it when you're ready.
If you want to do some type of per-partition manipulation, you're
going to need to write some code. The 0.10 integration makes the
underlying kafka consumer pluggable, so you may be able to wrap
Why not just use different files for Kafka? Nothing else in Spark
should be using those Kafka configuration parameters.
On Thu, Sep 1, 2016 at 3:26 AM, Eric Ho wrote:
> I'm interested in what I should put into the trustStore file, not just for
> Spark but also for Kafka
[
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453789#comment-15453789
]
Cody Koeninger edited comment on SPARK-15406 at 9/1/16 2:26 AM:
There's
[
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453789#comment-15453789
]
Cody Koeninger commented on SPARK-15406:
There's a big difference between continuing to publish
[
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453324#comment-15453324
]
Cody Koeninger commented on SPARK-15406:
If people want to use older versions of kafka, why
[
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453246#comment-15453246
]
Cody Koeninger commented on SPARK-15406:
Yes.
> Structured streaming support for consuming f
[
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453124#comment-15453124
]
Cody Koeninger commented on SPARK-15406:
I don't think it makes sense to try and support multiple
; Warden Ave
> Markham, ON L6G 1C7
> Canada
>
>
>
> - Original message -
> From: Cody Koeninger <c...@koeninger.org>
> To: Eric Ho <e...@analyticsmd.com>
> Cc: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Re: Spark to
Encryption is only available in spark-streaming-kafka-0-10, not 0-8.
You enable it the same way you enable it for the Kafka project's new
consumer, by setting kafka configuration parameters appropriately.
http://kafka.apache.org/documentation.html#security_ssl
On Wed, Aug 31, 2016 at 2:03 AM,
http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/
On Wed, Aug 31, 2016 at 2:19 AM, Sean Owen wrote:
> Weird, I recompiled Spark with a similar change to Model and it seemed
> to work but maybe I missed a step in there.
>
> On Wed, Aug 31, 2016 at 6:33 AM,
http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/
On Wed, Aug 31, 2016 at 2:19 AM, Sean Owen wrote:
> Weird, I recompiled Spark with a similar change to Model and it seemed
> to work but maybe I missed a step in there.
>
> On Wed, Aug 31, 2016 at 6:33 AM,
> wrote:
>>>
>>> thats great
>>>
>>> is this effort happening anywhere that is publicly visible? github?
>>>
>>> On Tue, Aug 16, 2016 at 2:04 AM, Reynold Xin <r...@databricks.com> wrote:
>>>>
>>>> We (the team at
[
https://issues.apache.org/jira/browse/SPARK-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15441981#comment-15441981
]
Cody Koeninger commented on SPARK-17280:
I can take a look but there's not a lot to go
http://spark.apache.org/docs/latest/api/java/index.html
messageHandler receives a kafka MessageAndMetadata object.
Alternatively, if you just need metadata information on a
per-partition basis, you can use HasOffsetRanges
stable scenarios (e.g.
> advertised hostname failures on EMR).
>
> Maelstrom will work I believe even for Spark 1.3 and Kafka 0.8.2.1 (and of
> course with the latest Kafka 0.10 as well)
>
>
> On Wed, Aug 24, 2016 at 9:49 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>&
You can set that poll timeout higher with
spark.streaming.kafka.consumer.poll.ms
but half a second is fairly generous. I'd try to take a look at
what's going on with your network or kafka broker during that time.
On Tue, Aug 23, 2016 at 4:44 PM, Srikanth wrote:
> Hello,
Were you aware that the spark 2.0 / kafka 0.10 integration also reuses
kafka consumer instances on the executors?
On Tue, Aug 23, 2016 at 3:19 PM, Jeoffrey Lim wrote:
> Hi,
>
> I have released the first version of a new Kafka integration with Spark
> that we use in the
See
https://github.com/koeninger/kafka-exactly-once
On Aug 23, 2016 10:30 AM, "KhajaAsmath Mohammed"
wrote:
> Hi Experts,
>
> I am looking for some information on how to acheive zero data loss while
> working with kafka and Spark. I have searched online and blogs have
>
[
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430872#comment-15430872
]
Cody Koeninger commented on SPARK-17147:
My point is more that this probably isn't just two lines
[
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429742#comment-15429742
]
Cody Koeninger commented on SPARK-17147:
Have you successfully used the 0.8 consumer
both fault tolerance and
> application code/config changes without checkpointing? Is there anything
> else which checkpointing gives? I might be missing something.
>
>
> Regards,
> Chandan
>
>
> On Thu, Aug 18, 2016 at 8:27 PM, Cody Koeninger <c...@koeninger.org>
Checkpointing is not kafka-specific. It encompasses metadata about the
application. You can't re-use a checkpoint if your application has changed.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code
On Thu, Aug 18, 2016 at 4:39 AM, chandan prakash
streaming-kafka-0-10-assembly??
>
> Srikanth
>
> On Fri, Aug 12, 2016 at 5:15 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Hrrm, that's interesting. Did you try with subscribe pattern, out of
>> curiosity?
>>
>> I haven't tested repartitioning
https://issues.apache.org/jira/browse/SPARK-15406
I'm not working on it (yet?), never got an answer to the question of
who was planning to work on it.
On Mon, Aug 15, 2016 at 9:12 PM, Guo, Chenzhao wrote:
> Hi all,
>
>
>
> I’m trying to write Structured Streaming test
No, you really shouldn't rely on checkpoints if you cant afford to
reprocess from the beginning of your retention (or lose data and start
from the latest messages).
If you're in a real bind, you might be able to get something out of
the serialized data in the checkpoint, but it'd probably be
logger.info(s"rdd has ${rdd.getNumPartitions} partitions.")
>
>
> Should I be setting some parameter/config? Is the doc for new integ
> available?
>
> Thanks,
> Srikanth
>
> On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>
e same consumer group name. But this is not working though . Somehow
> createstream is picking the offset from some where other than
> /consumers/ from zookeeper
>
>
> Sent from Samsung Mobile.
>
>
>
>
>
>
>
>
> Original message
[
https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418747#comment-15418747
]
Cody Koeninger commented on SPARK-16917:
It sounds to me like the documentation is clear, because
16/08/10 18:16:44 INFO JobScheduler: Added jobs for time 1470833204000 ms
> 16/08/10 18:16:45 INFO JobScheduler: Added jobs for time 1470833205000 ms
> 16/08/10 18:16:46 INFO JobScheduler: Added jobs for time 1470833206000 ms
> 16/08/10 18:16:47 INFO JobScheduler: Added jobs for time 1470833
Those logs you're posting are from right after your failure, they don't
include what actually went wrong when attempting to read json. Look at your
logs more carefully.
On Aug 10, 2016 2:07 AM, "Diwakar Dhanuskodi"
wrote:
> Hi Siva,
>
> With below code, it is stuck
;dataFrame.foreach(println)
>}
>else
>{
> println("Empty DStream ")
>}*/
> })
>
> On Wed, Aug 10, 2016 at 2:35 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Take out the conditional and the sqlcontext and just do
>>
Take out the conditional and the sqlcontext and just do
rdd => {
rdd.foreach(println)
as a base line to see if you're reading the data you expect
On Tue, Aug 9, 2016 at 3:47 PM, Diwakar Dhanuskodi
wrote:
> Hi,
>
> I am reading json messages from kafka . Topics
The Kafka 0.10 support in spark 2.0 allows for pattern based topic
subscription
On Aug 8, 2016 1:12 AM, "r7raul1...@163.com" wrote:
> How to add new topic to kafka without requiring restart of the streaming
> context?
>
> --
> r7raul1...@163.com
>
[
https://issues.apache.org/jira/browse/SPARK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410623#comment-15410623
]
Cody Koeninger commented on SPARK-16917:
I think the doc changes I submitted make it pretty clear
Are you using KafkaUtils.createDirectStream?
On Wed, Aug 3, 2016 at 9:42 AM, Soumitra Johri
wrote:
> Hi,
>
> I am running a steaming job with 4 executors and 16 cores so that each
> executor has two cores to work with. The input Kafka topic has 4 partitions.
> With
MatrixFactorizationModel is serializable. Instantiate it on the
driver, not on the executors.
On Wed, Aug 3, 2016 at 2:01 AM, wrote:
> hello guys:
> I have an app which consumes json messages from kafka and recommend
> movies for the users in those messages ,the
leq...@gmail.com> wrote:
> How to do that? if I put the queue inside .transform operation, it doesn't
> work.
>
> On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Can you keep a queue per executor in memory?
>>
>> On Mon
leq...@gmail.com> wrote:
> How to do that? if I put the queue inside .transform operation, it doesn't
> work.
>
> On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Can you keep a queue per executor in memory?
>>
>> On Mon
roblem in detail here:
> https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>
> Could you please give me some suggestions or advice to fix this problem?
>
> Thanks
>
> On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <c...@koeninger.org> wrote:
>&
roblem in detail here:
> https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>
> Could you please give me some suggestions or advice to fix this problem?
>
> Thanks
>
> On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <c...@koeninger.org> wrote:
>&
[
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401208#comment-15401208
]
Cody Koeninger commented on SPARK-16534:
It's on the PR. Yes, one comitter veto is generally
[
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401128#comment-15401128
]
Cody Koeninger commented on SPARK-16534:
This idea got a -1 from Reynold, so unless anyone's
Yeah, and the without hadoop was even more confusing... because if you
weren't using hdfs at all, you still needed to download one of the
hadoop-x packages in order to get hadoop io classes used by almost
everything. :)
On Fri, Jul 29, 2016 at 3:06 PM, Marcelo Vanzin wrote:
Most stream systems you're still going to incur the cost of reading
each message... I suppose you could rotate among reading just the
latest messages from a single partition of a Kafka topic if they were
evenly balanced.
But once you've read the messages, nothing's stopping you from
filtering
Most stream systems you're still going to incur the cost of reading
each message... I suppose you could rotate among reading just the
latest messages from a single partition of a Kafka topic if they were
evenly balanced.
But once you've read the messages, nothing's stopping you from
filtering
[
https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397520#comment-15397520
]
Cody Koeninger commented on SPARK-16746:
>From conversation on mailing list, it wasn't cl
[
https://issues.apache.org/jira/browse/SPARK-16762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15397516#comment-15397516
]
Cody Koeninger commented on SPARK-16762:
Couple things
- probably better to bring this kind
>
> clickDF = cDF.filter(cDF['request.clientIP'].isNotNull())
>
> It fails for some cases and errors our with below message
>
> AnalysisException: u'No such struct field clientIP in cookies, nscClientIP1,
> nscClientIP2, uAgent;'
>
>
> On Tue, Jul 26, 2016 at 12:05 PM,
Have you tried filtering out corrupt records with something along the lines of
df.filter(df("_corrupt_record").isNull)
On Tue, Jul 26, 2016 at 1:53 PM, vr spark wrote:
> i am reading data from kafka using spark streaming.
>
> I am reading json and creating dataframe.
> I
Can you go ahead and open a Jira ticket with that explanation?
Is there a reason you need to use receivers instead of the direct stream?
On Tue, Jul 26, 2016 at 4:45 AM, Andy Zhao wrote:
> Hi guys,
>
> I wrote a spark streaming program which consume 1000 messages from
This seems really low risk to me. In order to be impacted, it'd have
to be someone who was using the kafka integration in spark 2.0, which
isn't even officially released yet.
On Mon, Jul 25, 2016 at 7:23 PM, Vahid S Hashemian
wrote:
> Sorry, meant to ask if any Apache
This seems really low risk to me. In order to be impacted, it'd have
to be someone who was using the kafka integration in spark 2.0, which
isn't even officially released yet.
On Mon, Jul 25, 2016 at 7:23 PM, Vahid S Hashemian
wrote:
> Sorry, meant to ask if any Apache
For 2.0, the kafka dstream support is in two separate subprojects
depending on which version of Kafka you are using
spark-streaming-kafka-0-10
or
spark-streaming-kafka-0-8
corresponding to brokers that are version 0.10+ or 0.8+
On Mon, Jul 25, 2016 at 12:29 PM, Reynold Xin
For 2.0, the kafka dstream support is in two separate subprojects
depending on which version of Kafka you are using
spark-streaming-kafka-0-10
or
spark-streaming-kafka-0-8
corresponding to brokers that are version 0.10+ or 0.8+
On Mon, Jul 25, 2016 at 12:29 PM, Reynold Xin
s example.
> Looks simple but its not very obvious how it works :-)
> I'll watch out for the docs and ScalaDoc.
>
> Srikanth
>
> On Fri, Jul 22, 2016 at 2:15 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> No, restarting from a checkpoint won't do it, you
-0.10
On Fri, Jul 22, 2016 at 1:05 PM, Srikanth <srikanth...@gmail.com> wrote:
> In Spark 1.x, if we restart from a checkpoint, will it read from new
> partitions?
>
> If you can, pls point us to some doc/link that talks about Kafka 0.10 integ
> in Spark 2.0.
>
> On Fri, Ju
For the integration for kafka 0.8, you are literally starting a
streaming job against a fixed set of topicapartitions, It will not
change throughout the job, so you'll need to restart the spark job if
you change kafka partitions.
For the integration for kafka 0.10 / spark 2.0, if you use
[
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389797#comment-15389797
]
Cody Koeninger commented on SPARK-12177:
This has already been merged for the upcoming Spark 2.0
the data per key sorted with timestamp , I will always
> get the latest 4 ts data on get(key). Spark streaming will get the ID from
> Kafka, then read the data from HBASE using get(ID). This will eliminate
> usage of Windowing from Spark-Streaming . Is it good to use ?
>
> Regards,
Unless you're using only 1 partition per topic, there's no reasonable
way of doing this. Offsets for one topicpartition do not necessarily
have anything to do with offsets for another topicpartition. You
could do the last (200 / number of partitions) messages per
topicpartition, but you have no
Yes, if you need more parallelism, you need to either add more kafka
partitions or shuffle in spark.
Do you actually need the dataframe api, or are you just using it as a
way to infer the json schema? Inferring the schema is going to
require reading through the RDD once before doing any other
The bottom line short answer for this is that if you actually care
about data integrity, you need to store your offsets transactionally
alongside your results in the same data store.
If you're ok with double-counting in the event of failures, saving
offsets _after_ saving your results, using
We've been running direct stream jobs in production for over a year,
with uptimes in the range of months.
I'm pretty slammed with work right now, but when I get time to submit
a PR for the 0.10 docs i'll remove the experimental note from 0.8
On Mon, Jul 11, 2016 at 4:35 PM, Tathagata Das
[
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15378240#comment-15378240
]
Cody Koeninger commented on SPARK-16534:
[~jerryshao] if you want to work on this, go
Maybe obvious, but what happens when you change the s3 write to a
println of all the data? That should identify whether it's the issue.
count() and read.json() will involve additional tasks (run through the
items in the rdd to count them, likewise to infer the schema) but for
300 records that
Yeah, it's a reasonable lowest common denominator between java and scala,
and what's passed to that convenience constructor is actually what's used
to construct the class.
FWIW, in the 0.10 direct stream api when there's unavoidable wrapping /
conversion anyway (since the underlying class takes a
Just as an offhand guess, are you doing something like
updateStateByKey without expiring old keys?
On Fri, Jul 8, 2016 at 2:44 AM, Jörn Franke wrote:
> Memory fragmentation? Quiet common with in-memory systems.
>
>> On 08 Jul 2016, at 08:56, aasish.kumar
In any case thanks, now I understand how to use Spark.
>
> PS: I will continue work with Spark but to minimize emails stream I plan to
> unsubscribe from this mail list
>
> 2016-07-06 18:55 GMT+02:00 Cody Koeninger <c...@koeninger.org>:
>>
>> If you aren't proce
s ok. If there are peaks of loading more than possibility of
> computational system or data dependent time of calculation, Spark is not
> able to provide a periodically stable results output. Sometimes this is
> appropriate but sometime this is not appropriate.
>
> 2016-0
, it works but throughput is much less than without limitations
> because of this is an absolute upper limit. And time of processing is half
> of available.
>
> Regarding Spark 2.0 structured streaming I will look it some later. Now I
> don't know how to strictly measure th
t; that Flink will strictly terminate processing of messages by time. Deviation
> of the time window from 10 seconds to several minutes is impossible.
>
> PS: I prepared this example to make possible easy observe the problem and
> fix it if it is a bug. For me it is obvious. May I ask y
g speed in Spark's app is near to speed of data generation all
> is ok.
> I added delayFactor in
> https://github.com/rssdev10/spark-kafka-streaming/blob/master/src/main/java/SparkStreamingConsumer.java
> to emulate slow processing. And streaming process is in degradation. When
&g
I know some usages of the 0.10 kafka connector will be broken until
https://github.com/apache/spark/pull/14026 is merged, but the 0.10
connector is a new feature, so not blocking.
Sean I'm assuming the DirectKafkaStreamSuite failure you saw was for
0.8? I'll take another look at it.
On Wed,
I don't think that's a scala compiler bug.
println is a valid expression that returns unit.
Unit is not a single-argument function, and does not match any of the
overloads of foreachPartition
You may be used to a conversion taking place when println is passed to
method expecting a function, but
. But in any case I need first response after 10 second. Not minutes
> or hours after.
>
> Thanks.
>
>
>
> 2016-07-05 17:12 GMT+02:00 Cody Koeninger <c...@koeninger.org>:
>>
>> If you're talking about limiting the number of messages per batch to
&g
If you're talking about limiting the number of messages per batch to
try and keep from exceeding batch time, see
http://spark.apache.org/docs/latest/configuration.html
look for backpressure and maxRatePerParition
But if you're only seeing zeros after your job runs for a minute, it
sounds like
If it's a batch job, don't use a stream.
You have to store the offsets reliably somewhere regardless. So it sounds
like your only issue is with identifying offsets per partition? Look at
KafkaCluster.scala, methods getEarliestLeaderOffsets /
getLatestLeaderOffsets.
On Tue, Jul 5, 2016 at 7:40
Cody Koeninger created SPARK-16359:
--
Summary: unidoc workaround for multiple kafka versions
Key: SPARK-16359
URL: https://issues.apache.org/jira/browse/SPARK-16359
Project: Spark
Issue Type
e're on bare metal.
>
> the test launch code executes this for each build:
> # Generate random point for Zinc
> export ZINC_PORT
> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
>
> On Fri, Jul 1, 2016 at 6:02 AM, Cody Koeninger &l
Can someone familiar with amplab's jenkins setup clarify whether all tests
running at a given time are competing for network ports, or whether there's
some sort of containerization being done?
Based on the use of Utils.startServiceOnPort in the tests, I'd assume the
former.
Cody Koeninger created SPARK-16312:
--
Summary: Docs for Kafka 0.10 consumer integration
Key: SPARK-16312
URL: https://issues.apache.org/jira/browse/SPARK-16312
Project: Spark
Issue Type: Sub
Cody Koeninger created SPARK-16212:
--
Summary: code cleanup of kafka-0-8 to match review feedback on 0-10
Key: SPARK-16212
URL: https://issues.apache.org/jira/browse/SPARK-16212
Project: Spark
ile an issue?
>>
>> On Tue, Jun 21, 2016 at 9:04 PM, Colin Kincaid Williams <disc...@uw.edu>
>> wrote:
>>> Thanks @Cody, I will try that out. In the interm, I tried to validate
>>> my Hbase cluster by running a random write test and see 30-40K writes
>>&g
That looks like a classpath problem. You should not have to include
the kafka_2.10 artifact in your pom, spark-streaming-kafka_2.10
already has a transitive dependency on it. That being said, 0.8.2.1
is the correct version, so that's a little strange.
How are you building and submitting your
On Wed, Jun 22, 2016 at 7:46 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> As far as I know the only thing blocking it at this point is lack of
>> committer review / approval.
>>
>> It's technically adding a new feature after spark code-freeze, but it
501 - 600 of 1347 matches
Mail list logo