Hi all, just wanted to give a heads up that we're seeing a reproducible
deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
If jira is a better place for this, apologies in advance - figured talking
about it on the mailing list was friendlier than randomly (re)opening jira
tickets.
I know Gary had
I'm going to be on a plane wed 23, return flight monday 28, so will miss
daily call those days. I'll be pushing forward on projects as I can, but
skype availability may be limited, so email if you need something from me.
Wendell pwend...@gmail.com
wrote:
Cody - did you mean to send this to the spark dev list?
On Tue, Jul 15, 2014 at 7:15 AM, Cody Koeninger
cody.koenin...@mediacrossing.com wrote:
I'm going to be on a plane wed 23, return flight monday 28, so will miss
daily call those days. I'll be pushing
We tested that patch from aarondav's branch, and are no longer seeing that
deadlock. Seems to have solved the problem, at least for us.
On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell pwend...@gmail.com wrote:
Andrew and Gary,
Would you guys be able to test
We were previously using SPARK_JAVA_OPTS to set java system properties via
-D.
This was used for properties that varied on a per-deployment-environment
basis, but needed to be available in the spark shell and workers.
On upgrading to 1.0, we saw that SPARK_JAVA_OPTS had been deprecated, and
options... the current code should work fine in cluster mode
though, since the driver is a different process. :-)
On Wed, Jul 30, 2014 at 1:12 PM, Cody Koeninger c...@koeninger.org
wrote:
We were previously using SPARK_JAVA_OPTS to set java system properties
via
-D.
This was used
.mxstg,null), (dn-02.mxstg,null),
(dn-02.mxstg,null), ...
Note that this is a mesos deployment, although I wouldn't expect that to
affect the availability of spark.driver.extraJavaOptions in a local spark
shell.
On Wed, Jul 30, 2014 at 4:18 PM, Cody Koeninger c...@koeninger.org wrote:
Either
(this might be because of the
other issues, or a separate bug).
- Patrick
On Wed, Jul 30, 2014 at 3:10 PM, Cody Koeninger c...@koeninger.org
wrote:
In addition, spark.executor.extraJavaOptions does not seem to behave as
I
would expect; java arguments don't seem to be propagated
The stmt.isClosed just looks like stupidity on my part, no secret
motivation :) Thanks for noticing it.
As for the leaking in the case of malformed statements, isn't that
addressed by
context.addOnCompleteCallback{ () = closeIfNeeded() }
or am I misunderstanding?
On Tue, Aug 5, 2014 at 3:15
Just wanted to check in on this, see if I should file a bug report
regarding the mesos argument propagation.
On Thu, Jul 31, 2014 at 8:35 AM, Cody Koeninger c...@koeninger.org wrote:
1. I've tried with and without escaping equals sign, it doesn't affect the
results.
2. Yeah, exporting
I've been looking at performance differences between spark sql queries
against single parquet tables, vs a unionAll of two tables. It's a
significant difference, like 5 to 10x
Is there a reason in general not to push projections and predicates down
into the individual ParquetTableScans in a
Opened
https://issues.apache.org/jira/browse/SPARK-3462
I'll take a look at ColumnPruning and see what I can do
On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust mich...@databricks.com
wrote:
On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger c...@koeninger.org
wrote:
Is there a reason
by day/week in the HDFS directory structure?
On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust mich...@databricks.com
wrote:
Thanks!
On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org
wrote:
Opened
https://issues.apache.org/jira/browse/SPARK-3462
I'll take a look
, 2014 at 12:01 PM, Cody Koeninger c...@koeninger.org
wrote:
Maybe I'm missing something, I thought parquet was generally a
write-once
format and the sqlContext interface to it seems that way as well.
d1.saveAsParquetFile(/foo/d1)
// another day, another table, with same schema
d2
performance against some actual data sets.
On Tue, Sep 9, 2014 at 6:09 PM, Cody Koeninger c...@koeninger.org wrote:
Ok, so looking at the optimizer code for the first time and trying the
simplest rule that could possibly work,
object UnionPushdown extends Rule[LogicalPlan] {
def apply(plan
it. I'll see
about testing performance against some actual data sets.
On Tue, Sep 9, 2014 at 6:09 PM, Cody Koeninger c...@koeninger.org wrote:
Ok, so looking at the optimizer code for the first time and trying the
simplest rule that could possibly work,
object UnionPushdown extends Rule
, Sep 10, 2014 at 9:31 AM, Cody Koeninger c...@koeninger.org
wrote:
Tested the patch against a cluster with some real data. Initial results
seem like going from one table to a union of 2 tables is now closer to a
doubling of query time as expected, instead of 5 to 10x.
Let me know if you see
I noticed that the release notes for 1.1.0 said that spark doesn't support
Hive buckets yet. I didn't notice any jira issues related to adding
support.
Broadly speaking, what would be involved in supporting buckets, especially
the bucketmapjoin and sortedmerge optimizations?
After the recent spark project changes to guava shading, I'm seeing issues
with the datastax spark cassandra connector (which depends on guava 15.0)
and the datastax cql driver (which depends on guava 16.0.1)
Building an assembly for a job (with spark marked as provided) that
includes either
file.
On Mon, Sep 22, 2014 at 10:54 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:
Thanks for the heads up Cody. Any indication of what was going wrong?
On Mon, Sep 22, 2014 at 7:16 AM, Cody Koeninger c...@koeninger.org
wrote:
Just as a heads up, we deployed 471e6a3a of master (in order
.
On Fri, Sep 19, 2014 at 10:30 PM, Cody Koeninger c...@koeninger.org
wrote:
After the recent spark project changes to guava shading, I'm seeing
issues
with the datastax spark cassandra connector (which depends on guava 15.0)
and the datastax cql driver (which depends on guava 16.0.1
file
from the Spark assembly you're using.
On Mon, Sep 22, 2014 at 12:46 PM, Cody Koeninger c...@koeninger.org
wrote:
We're using Mesos, is there a reasonable expectation that
spark.files.userClassPathFirst will actually work?
On Mon, Sep 22, 2014 at 1:42 PM, Marcelo Vanzin van
After commit 8856c3d8 switched from gzip to snappy as default parquet
compression codec, I'm seeing the following when trying to read parquet
files saved using the new default (same schema and roughly same size as
files that were previously working):
java.lang.OutOfMemoryError: Direct buffer
3 quick questions, then some background:
1. Is there a reason not to document the fact that spark.hadoop.* is
copied from spark config into hadoop config?
2. Is there a reason StreamingContext.getOrCreate defaults to a blank
hadoop configuration rather than
Opened
https://issues.apache.org/jira/browse/SPARK-4229
Sent a PR
https://github.com/apache/spark/pull/3102
On Tue, Nov 4, 2014 at 11:48 AM, Marcelo Vanzin van...@cloudera.com wrote:
On Tue, Nov 4, 2014 at 9:34 AM, Cody Koeninger c...@koeninger.org wrote:
2. Is there a reason
My 2 cents:
Spark since pre-Apache days has been the most friendly and welcoming open
source project I've seen, and that's reflected in its success.
It seems pretty obvious to me that, for example, Michael should be looking
at major changes to the SQL codebase. I trust him to do that in a way
I'm wondering why
https://issues.apache.org/jira/browse/SPARK-3638
only updated the version of http client for the kinesis-asl profile and
left the base dependencies unchanged.
Spark built without that profile still has the same
java.lang.NoSuchMethodError:
For an alternative take on a similar idea, see
https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka
An advantage of the approach I'm taking is that the lower and upper offsets
of the RDD are known in advance, so it's deterministic.
I
Now that 1.2 is finalized... who are the go-to people to get some
long-standing Kafka related issues resolved?
The existing api is not sufficiently safe nor flexible for our production
use. I don't think we're alone in this viewpoint, because I've seen
several different patches and libraries to
18, 2014 at 7:07 AM, Cody Koeninger c...@koeninger.org
wrote:
Now that 1.2 is finalized... who are the go-to people to get some
long-standing Kafka related issues resolved?
The existing api is not sufficiently safe nor flexible for our
production
use. I don't think we're alone
to guarantee - though I really would
love to have that!
Thanks,
Hari
On Thu, Dec 18, 2014 at 12:26 PM, Cody Koeninger c...@koeninger.org
wrote:
Thanks for the replies.
Regarding skipping WAL, it's not just about optimization. If you
actually want exactly-once semantics, you need control of kafka
-
From: Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com]
Sent: Friday, December 19, 2014 5:57 AM
To: Cody Koeninger
Cc: Hari Shreedharan; Patrick Wendell; dev@spark.apache.org
Subject: Re: Which committers care about Kafka?
But idempotency is not that easy t achieve sometimes
.
Thanks,
Hari
On Fri, Dec 19, 2014 at 1:48 PM, Cody Koeninger c...@koeninger.org
wrote:
The problems you guys are discussing come from trying to store state in
spark, so don't do that. Spark isn't a distributed database.
Just map kafka partitions directly to rdds, llet user code specify
Is there a reason not to go ahead and move the _cache and _lock files
created by Utils.fetchFiles into the work directory, so they can be cleaned
up more easily? I saw comments to that effect in the discussion of the PR
for 2713, but it doesn't look like it got done.
And no, I didn't just have a
...@tresata.com wrote:
yup, we at tresata do the idempotent store the same way. very simple
approach.
On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org
wrote:
That KafkaRDD code is dead simple.
Given a user specified map
(topic1, partition0) - (startingOffset, endingOffset
hshreedha...@cloudera.com
wrote:
In general such discussions happen or is posted on the dev lists. Could
you please post a summary? Thanks.
Thanks,
Hari
On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.org
wrote:
After a long talk with Patrick and TD (thanks guys), I
The spark context reference is transient.
On Fri, Dec 26, 2014 at 6:11 PM, Alessandro Baretta alexbare...@gmail.com
wrote:
How, O how can this be? Doesn't the SQLContext hold a reference to the
SparkContext?
Alex
://github.com/apache/spark/pull/3798 . I
am reviewing it. Please follow the PR if you are interested.
TD
On Wed, Dec 24, 2014 at 11:59 PM, Cody Koeninger c...@koeninger.org
wrote:
The conversation was mostly getting TD up to speed on this thread since
he
had just gotten back from his trip
will further delay the ongoing job, and
finally lead to failure.
Thanks
Jerry
*From:* Cody Koeninger [mailto:c...@koeninger.org]
*Sent:* Tuesday, December 30, 2014 6:50 AM
*To:* Tathagata Das
*Cc:* Hari Shreedharan; Shao, Saisai; Sean McNamara; Patrick Wendell;
Luis Ángel Vicente Sánchez
It looks like taskContext.attemptId doesn't mean what one thinks it might
mean, based on
http://apache-spark-developers-list.1001551.n3.nabble.com/Get-attempt-number-in-a-closure-td8853.html
and the unresolved
https://issues.apache.org/jira/browse/SPARK-4014
Is there any alternative way to
this be an issue??
If you guys want to have a look at the code I've just uploaded it to my
github account: big-brother https://github.com/ardlema/big-brother (see
DirectKafkaWordCountTest.scala).
Thank you again!!
2015-03-19 22:13 GMT+01:00 Cody Koeninger c...@koeninger.org:
What
is working fine!
Do you think that I should open an issue to warn that the kafka topic must
contain at least one message before the DirectStream creation?
Thank you very much! You've just made my day ;)
2015-03-19 23:08 GMT+01:00 Cody Koeninger c...@koeninger.org:
Yeah, I wouldn't be shocked
So when building 1.3.0-rc1 I see the following warning:
[WARNING] spark-streaming-kafka_2.10-1.3.0.jar, unused-1.0.0.jar define 1
overlappping classes:
[WARNING] - org.apache.spark.unused.UnusedStubClass
and when trying to build an assembly of a project that was previously using
1.3
(otherwise, some work we do that requires the shade
plugin does not happen). However, now there are other things there. If
you just comment out the line in the root pom.xml adding this
dependency, does it work?
- Patrick
On Wed, Feb 25, 2015 at 7:53 AM, Cody Koeninger c
My 2 cents - I'd rather see design docs in github pull requests (using
plain text / markdown). That doesn't require changing access or adding
people, and github PRs already allow for conversation / email notifications.
Conversation is already split between jira and github PRs. Having a third
Why can't pull requests be used for design docs in Git if people who aren't
committers want to contribute changes (as opposed to just comments)?
On Fri, Apr 24, 2015 at 2:57 PM, Sean Owen so...@cloudera.com wrote:
Only catch there is it requires commit access to the repo. We need a
way for
What's your schema for the offset table, and what's the definition of
writeOffset ?
What key are you reducing on? Maybe I'm misreading the code, but it looks
like the per-partition offset is part of the key. If that's true then you
could just do your reduction on each partition, rather than
In fact, you're using the 2 arg form of reduce by key to shrink it down to
1 partition
reduceByKey(sumFunc, 1)
But you started with 4 kafka partitions? So they're definitely no longer
1:1
On Thu, Apr 30, 2015 at 1:58 PM, Cody Koeninger c...@koeninger.org wrote:
This is what I'm suggesting
having. Thank you kindly, Mr.
Koeninger.
On Thu, Apr 30, 2015 at 3:06 PM, Cody Koeninger c...@koeninger.org
wrote:
In fact, you're using the 2 arg form of reduce by key to shrink it down
to 1 partition
reduceByKey(sumFunc, 1)
But you started with 4 kafka partitions? So they're definitely
Since that's an internal class used only for unit testing, what would the
benefit be?
On Tue, May 5, 2015 at 3:19 PM, BenFradet benjamin.fra...@gmail.com wrote:
Hi,
Since we're now supporting Kafka 0.8.2.1
https://github.com/apache/spark/pull/4537 , and that there is a new
Producer API
Regarding performance, keep in mind we'd probably have to turn all those
async calls into blocking calls for the unit tests
On Tue, May 5, 2015 at 3:44 PM, BenFradet benjamin.fra...@gmail.com wrote:
Even if it's only used for testing and the examples, why not move ahead of
the deprecation and
to the others in the RDD, the executor gets done
with it and gets scheduled another one to work one. With long running
receivers spark acts like the receiver takes up a core even if it isn't
doing much. Look at the CPU graph on slide 13 of the link i posted.
On Thu, May 14, 2015 at 4:21 PM, Cody
The scala api has 2 ways of calling createDirectStream. One of them allows
you to pass a message handler that gets full access to the kafka
MessageAndMetadata, including offset.
I don't know why the python api was developed with only one way to call
createDirectStream, but the first thing I'd
There are already private methods in the code for interacting with Kafka's
offset management api.
There's a jira for making those methods public, but TD has been reluctant
to merge it
https://issues.apache.org/jira/browse/SPARK-10963
I think adding any ZK specific behavior to spark is a bad
I wrote some code for this a while back, pretty sure it didn't need access
to anything private in the decision tree / random forest model. If people
want it added to the api I can put together a PR.
I think it's important to have separately parseable operators / operands
though. E.g
This is a user list question not a dev list question.
Looks like your driver is having trouble communicating to the kafka
brokers. Make sure the broker host and port is available from the driver
host (using nc or telnet); make sure that you're providing the _broker_
host and port to
rs is questionable.
> <<
>
> I agree and i was more thinking maybe there is a way to support both for a
> period of time (of course means some more code to maintain :-)).
>
>
> thanks
> Mario
>
> [image: Inactive hide details for Cody Koeninger ---04/12/2015 12:15:5
Honestly my feeling on any new API is to wait for a point release before
taking it seriously :)
Auth and encryption seem like the only compelling reason to move, but
forcing people on kafka 8.x to upgrade their brokers is questionable.
On Thu, Dec 3, 2015 at 11:30 AM, Mario Ds Briggs
Have you seen
SPARK-12177
On Wed, Dec 23, 2015 at 3:27 PM, eugene miretsky
wrote:
> Hi,
>
> The Kafka connector currently uses the older Kafka Scala consumer. Kafka
> 0.9 came out with a new Java Kafka consumer.
>
> One of the main differences is that the Scala
I don't have a vote, but I'd just like to reiterate that I think kafka
0.10 support should be added to a 2.0 release candidate; if not now,
then well before release.
- it's a completely standalone jar, so shouldn't break anyone who's
using the existing 0.8 support
- it's like the 5th highest
, Sean Owen <so...@cloudera.com> wrote:
> I profess ignorance again though I really should know by now, but,
> what's opposing that? I personally thought this was going to be in 2.0
> and didn't kind of notice it wasn't ...
>
> On Wed, Jun 22, 2016 at 3:29 PM, Cody Koeninger <c
On Wed, Jun 22, 2016 at 7:46 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> As far as I know the only thing blocking it at this point is lack of
>> committer review / approval.
>>
>> It's technically adding a new feature after spark code-freeze, but it
Any word on Kafka 0.10 support / SPARK-12177
I understand the hesitation, but is having nothing better than having
a standalone subproject marked as experimental?
On Wed, Jun 15, 2016 at 2:01 PM, Reynold Xin wrote:
> It's been a while and we have accumulated quite a few bug
essed as Rows, everything
>> > uses
>> > DataFrames, Type classes used in a Dataset is internally converted to
>> > rows
>> > for processing. . Therefore internally DataFrame is like "main" type
>> > that is
>> > used.
>> >
>
e is like "main" type that is
> used.
>
> On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Sorry, meant DataFrame vs Dataset
>>
>> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <c...@koe
Sorry, meant DataFrame vs Dataset
On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <c...@koeninger.org> wrote:
> Is there a principled reason why sql.streaming.* and
> sql.execution.streaming.* are making extensive use of DataFrame
> instead of Datasource?
>
> Or is that just
I saw this slide:
http://image.slidesharecdn.com/east2016v2matei-160217154412/95/2016-spark-summit-east-keynote-matei-zaharia-5-638.jpg?cb=1455724433
Didn't see the talk - was this just referring to the existing work on the
spark-streaming-kafka subproject, or is someone actually working on
Spark streaming by default will not start processing a batch until the
current batch is finished. So if your processing time is larger than
your batch time, delays will build up.
On Wed, Mar 9, 2016 at 11:09 AM, Sachin Aggarwal
wrote:
> Hi All,
>
> we have batchTime
The central problem with doing anything like this is that you break
one of the basic guarantees of kafka, which is in-order processing on
a per-topicpartition basis.
As far as PRs go, because of the new consumer interface for kafka 0.9
and 0.10, there's a lot of potential change already underway.
like when they use the existing DStream.repartition where original
> per-topicpartition in-order processing is also not observed any more.
>
> Do you agree?
>
> On Thu, Mar 10, 2016 at 12:12 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> The central problem wit
t;
> On Wed, Mar 9, 2016 at 10:46 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Spark streaming by default will not start processing a batch until the
>> current batch is finished. So if your processing time is larger than
>> your batch time, delays will buil
e relevance to what
> you're talking about.
>
> Perhaps if both functions (the one with partitions arg and the one without)
> returned just ConsumerRecord, I would like that more.
>
> - Alan
>
> On Tue, Mar 8, 2016 at 6:49 AM, Cody Koeninger <c...@koeninger.org> wrote:
>
arty external repositories - not owned by Apache.
>>
>> At a minimum, dev@ discussion (like this one) should be initiated.
>> As PMC is responsible for the project assets (including code), signoff
>> is required for it IMO.
>>
>> More experienced Apache members mi
Why would a PMC vote be necessary on every code deletion?
There was a Jira and pull request discussion about the submodules that
have been removed so far.
https://issues.apache.org/jira/browse/SPARK-13843
There's another ongoing one about Kafka specifically
I'm in favor of everything in /extras and /external being removed, but
I'm more in favor of making a decision and moving on.
On Tue, Mar 22, 2016 at 12:20 PM, Marcelo Vanzin wrote:
> +1 for getting flume back.
>
> On Tue, Mar 22, 2016 at 12:27 AM, Kostas Sakellis
minimum, dev@ discussion (like this one) should be initiated.
>> As PMC is responsible for the project assets (including code), signoff
>> is required for it IMO.
>>
>> More experienced Apache members might be opine better in case I got it
>> wrong
No, looks like you'd have to catch them in the serializer and have the
serializer return option or something. The new consumer builds a buffer
full of records, not one at a time.
On Mar 8, 2016 4:43 AM, "Marius Soutier" <mps@gmail.com> wrote:
>
> > On 04.03.2016, at
I need getPreferredLocations to choose a consistent executor for a
given partition in a stream. In order to do that, I need to know what
the current executors are.
I'm currently grabbing them from the block manager master .getPeers(),
which works, but I don't know if that's the most reasonable
gt; What do you mean by consistent? Throughout the life cycle of an app, the
>> executors can come and go and as a result really has no consistency. Do you
>> just need it for a specific job?
>>
>>
>>
>> On Thu, Mar 3, 2016 at 3:08 PM, Cody Koeninger <c...@koeni
Wanted to survey what people are using the direct stream
messageHandler for, besides just extracting key / value / offset.
Would your use case still work if that argument was removed, and the
stream just contained ConsumerRecord objects
Jay, thanks for the response.
Regarding the new consumer API for 0.9, I've been reading through the code
for it and thinking about how it fits in to the existing Spark integration.
So far I've seen some interesting challenges, and if you (or anyone else on
the dev list) have time to provide some
I agree with Mark in that I don't see how supporting scala 2.10 for
spark 2.0 implies supporting it for all of spark 2.x
Regarding Koert's comment on akka, I thought all akka dependencies
have been removed from spark after SPARK-7997 and the recent removal
of external/akka
On Wed, Mar 30, 2016
I really think the only thing that should have to change is the maven
group and identifier, not the java namespace.
There are compatibility problems with the java namespace changing
(e.g. access to private[spark]), and I don't think that someone who
takes the time to change their build file to
Are you talking about group/identifier name, or contained classes?
Because there are plenty of org.apache.* classes distributed via maven
with non-apache group / identifiers.
On Fri, Mar 25, 2016 at 6:54 PM, David Nalley wrote:
>
>> As far as group / artifact name
If you want to refer back to Kafka based on offset ranges, why not use
createDirectStream?
On Fri, Apr 22, 2016 at 11:49 PM, Renyi Xiong wrote:
> Hi,
>
> Is it possible for Kafka receiver generated WriteAheadLogBackedBlockRDD to
> hold corresponded Kafka offset range so
Given that not all of the connectors were removed, I think this
creates a weird / confusing three tier system
1. connectors in the official project's spark/extras or spark/external
2. connectors in "Spark Extras"
3. connectors in some random organization's github
On Fri, Apr 15, 2016 at 11:18
100% agree with Sean & Reynold's comments on this.
Adding this as a TLP would just cause more confusion as to "official"
endorsement.
On Fri, Apr 15, 2016 at 11:50 AM, Sean Owen wrote:
> On Fri, Apr 15, 2016 at 5:34 PM, Luciano Resende wrote:
>> I
For what it's worth, I have definitely had PRs that sat inactive for
more than 30 days due to committers not having time to look at them,
but did eventually end up successfully being merged.
I guess if this just ends up being a committer ping and reopening the
PR, it's fine, but I don't know if
This seems really low risk to me. In order to be impacted, it'd have
to be someone who was using the kafka integration in spark 2.0, which
isn't even officially released yet.
On Mon, Jul 25, 2016 at 7:23 PM, Vahid S Hashemian
wrote:
> Sorry, meant to ask if any Apache
For 2.0, the kafka dstream support is in two separate subprojects
depending on which version of Kafka you are using
spark-streaming-kafka-0-10
or
spark-streaming-kafka-0-8
corresponding to brokers that are version 0.10+ or 0.8+
On Mon, Jul 25, 2016 at 12:29 PM, Reynold Xin
Yeah, and the without hadoop was even more confusing... because if you
weren't using hdfs at all, you still needed to download one of the
hadoop-x packages in order to get hadoop io classes used by almost
everything. :)
On Fri, Jul 29, 2016 at 3:06 PM, Marcelo Vanzin wrote:
roblem in detail here:
> https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>
> Could you please give me some suggestions or advice to fix this problem?
>
> Thanks
>
> On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <c...@koeninger.org> wrote:
>&
leq...@gmail.com> wrote:
> How to do that? if I put the queue inside .transform operation, it doesn't
> work.
>
> On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Can you keep a queue per executor in memory?
>>
>> On Mon
Are you using KafkaUtils.createDirectStream?
On Wed, Aug 3, 2016 at 9:42 AM, Soumitra Johri
wrote:
> Hi,
>
> I am running a steaming job with 4 executors and 16 cores so that each
> executor has two cores to work with. The input Kafka topic has 4 partitions.
> With
I don't think that's a scala compiler bug.
println is a valid expression that returns unit.
Unit is not a single-argument function, and does not match any of the
overloads of foreachPartition
You may be used to a conversion taking place when println is passed to
method expecting a function, but
I know some usages of the 0.10 kafka connector will be broken until
https://github.com/apache/spark/pull/14026 is merged, but the 0.10
connector is a new feature, so not blocking.
Sean I'm assuming the DirectKafkaStreamSuite failure you saw was for
0.8? I'll take another look at it.
On Wed,
The Kafka 0.10 support in spark 2.0 allows for pattern based topic
subscription
On Aug 8, 2016 1:12 AM, "r7raul1...@163.com" wrote:
> How to add new topic to kafka without requiring restart of the streaming
> context?
>
> --
> r7raul1...@163.com
>
Most stream systems you're still going to incur the cost of reading
each message... I suppose you could rotate among reading just the
latest messages from a single partition of a Kafka topic if they were
evenly balanced.
But once you've read the messages, nothing's stopping you from
filtering
Can someone familiar with amplab's jenkins setup clarify whether all tests
running at a given time are competing for network ports, or whether there's
some sort of containerization being done?
Based on the use of Utils.startServiceOnPort in the tests, I'd assume the
former.
e're on bare metal.
>
> the test launch code executes this for each build:
> # Generate random point for Zinc
> export ZINC_PORT
> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
>
> On Fri, Jul 1, 2016 at 6:02 AM, Cody Koeninger &l
1 - 100 of 202 matches
Mail list logo