Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Cody Koeninger
Hi all, just wanted to give a heads up that we're seeing a reproducible deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2 If jira is a better place for this, apologies in advance - figured talking about it on the mailing list was friendlier than randomly (re)opening jira tickets. I know Gary had

traveling next week

2014-07-15 Thread Cody Koeninger
I'm going to be on a plane wed 23, return flight monday 28, so will miss daily call those days. I'll be pushing forward on projects as I can, but skype availability may be limited, so email if you need something from me.

Re: traveling next week

2014-07-15 Thread Cody Koeninger
Wendell pwend...@gmail.com wrote: Cody - did you mean to send this to the spark dev list? On Tue, Jul 15, 2014 at 7:15 AM, Cody Koeninger cody.koenin...@mediacrossing.com wrote: I'm going to be on a plane wed 23, return flight monday 28, so will miss daily call those days. I'll be pushing

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-15 Thread Cody Koeninger
We tested that patch from aarondav's branch, and are no longer seeing that deadlock. Seems to have solved the problem, at least for us. On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell pwend...@gmail.com wrote: Andrew and Gary, Would you guys be able to test

replacement for SPARK_JAVA_OPTS

2014-07-30 Thread Cody Koeninger
We were previously using SPARK_JAVA_OPTS to set java system properties via -D. This was used for properties that varied on a per-deployment-environment basis, but needed to be available in the spark shell and workers. On upgrading to 1.0, we saw that SPARK_JAVA_OPTS had been deprecated, and

Re: replacement for SPARK_JAVA_OPTS

2014-07-30 Thread Cody Koeninger
options... the current code should work fine in cluster mode though, since the driver is a different process. :-) On Wed, Jul 30, 2014 at 1:12 PM, Cody Koeninger c...@koeninger.org wrote: We were previously using SPARK_JAVA_OPTS to set java system properties via -D. This was used

Re: replacement for SPARK_JAVA_OPTS

2014-07-30 Thread Cody Koeninger
.mxstg,null), (dn-02.mxstg,null), (dn-02.mxstg,null), ... Note that this is a mesos deployment, although I wouldn't expect that to affect the availability of spark.driver.extraJavaOptions in a local spark shell. On Wed, Jul 30, 2014 at 4:18 PM, Cody Koeninger c...@koeninger.org wrote: Either

Re: replacement for SPARK_JAVA_OPTS

2014-07-31 Thread Cody Koeninger
(this might be because of the other issues, or a separate bug). - Patrick On Wed, Jul 30, 2014 at 3:10 PM, Cody Koeninger c...@koeninger.org wrote: In addition, spark.executor.extraJavaOptions does not seem to behave as I would expect; java arguments don't seem to be propagated

Re: Tiny curiosity question on closing the jdbc connection

2014-08-05 Thread Cody Koeninger
The stmt.isClosed just looks like stupidity on my part, no secret motivation :) Thanks for noticing it. As for the leaking in the case of malformed statements, isn't that addressed by context.addOnCompleteCallback{ () = closeIfNeeded() } or am I misunderstanding? On Tue, Aug 5, 2014 at 3:15

Re: replacement for SPARK_JAVA_OPTS

2014-08-07 Thread Cody Koeninger
Just wanted to check in on this, see if I should file a bug report regarding the mesos argument propagation. On Thu, Jul 31, 2014 at 8:35 AM, Cody Koeninger c...@koeninger.org wrote: 1. I've tried with and without escaping equals sign, it doesn't affect the results. 2. Yeah, exporting

parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
I've been looking at performance differences between spark sql queries against single parquet tables, vs a unionAll of two tables. It's a significant difference, like 5 to 10x Is there a reason in general not to push projections and predicates down into the individual ParquetTableScans in a

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
Opened https://issues.apache.org/jira/browse/SPARK-3462 I'll take a look at ColumnPruning and see what I can do On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger c...@koeninger.org wrote: Is there a reason

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
by day/week in the HDFS directory structure? On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust mich...@databricks.com wrote: Thanks! On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org wrote: Opened https://issues.apache.org/jira/browse/SPARK-3462 I'll take a look

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
, 2014 at 12:01 PM, Cody Koeninger c...@koeninger.org wrote: Maybe I'm missing something, I thought parquet was generally a write-once format and the sqlContext interface to it seems that way as well. d1.saveAsParquetFile(/foo/d1) // another day, another table, with same schema d2

Re: parquet predicate / projection pushdown into unionAll

2014-09-10 Thread Cody Koeninger
performance against some actual data sets. On Tue, Sep 9, 2014 at 6:09 PM, Cody Koeninger c...@koeninger.org wrote: Ok, so looking at the optimizer code for the first time and trying the simplest rule that could possibly work, object UnionPushdown extends Rule[LogicalPlan] { def apply(plan

Re: parquet predicate / projection pushdown into unionAll

2014-09-10 Thread Cody Koeninger
it. I'll see about testing performance against some actual data sets. On Tue, Sep 9, 2014 at 6:09 PM, Cody Koeninger c...@koeninger.org wrote: Ok, so looking at the optimizer code for the first time and trying the simplest rule that could possibly work, object UnionPushdown extends Rule

Re: parquet predicate / projection pushdown into unionAll

2014-09-12 Thread Cody Koeninger
, Sep 10, 2014 at 9:31 AM, Cody Koeninger c...@koeninger.org wrote: Tested the patch against a cluster with some real data. Initial results seem like going from one table to a union of 2 tables is now closer to a doubling of query time as expected, instead of 5 to 10x. Let me know if you see

Support for Hive buckets

2014-09-14 Thread Cody Koeninger
I noticed that the release notes for 1.1.0 said that spark doesn't support Hive buckets yet. I didn't notice any jira issues related to adding support. Broadly speaking, what would be involved in supporting buckets, especially the bucketmapjoin and sortedmerge optimizations?

guava version conflicts

2014-09-19 Thread Cody Koeninger
After the recent spark project changes to guava shading, I'm seeing issues with the datastax spark cassandra connector (which depends on guava 15.0) and the datastax cql driver (which depends on guava 16.0.1) Building an assembly for a job (with spark marked as provided) that includes either

Re: hash vs sort shuffle

2014-09-22 Thread Cody Koeninger
file. On Mon, Sep 22, 2014 at 10:54 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Thanks for the heads up Cody. Any indication of what was going wrong? On Mon, Sep 22, 2014 at 7:16 AM, Cody Koeninger c...@koeninger.org wrote: Just as a heads up, we deployed 471e6a3a of master (in order

Re: guava version conflicts

2014-09-22 Thread Cody Koeninger
. On Fri, Sep 19, 2014 at 10:30 PM, Cody Koeninger c...@koeninger.org wrote: After the recent spark project changes to guava shading, I'm seeing issues with the datastax spark cassandra connector (which depends on guava 15.0) and the datastax cql driver (which depends on guava 16.0.1

Re: guava version conflicts

2014-09-22 Thread Cody Koeninger
file from the Spark assembly you're using. On Mon, Sep 22, 2014 at 12:46 PM, Cody Koeninger c...@koeninger.org wrote: We're using Mesos, is there a reasonable expectation that spark.files.userClassPathFirst will actually work? On Mon, Sep 22, 2014 at 1:42 PM, Marcelo Vanzin van

OutOfMemoryError on parquet SnappyDecompressor

2014-09-22 Thread Cody Koeninger
After commit 8856c3d8 switched from gzip to snappy as default parquet compression codec, I'm seeing the following when trying to read parquet files saved using the new default (same schema and roughly same size as files that were previously working): java.lang.OutOfMemoryError: Direct buffer

Hadoop configuration for checkpointing

2014-11-04 Thread Cody Koeninger
3 quick questions, then some background: 1. Is there a reason not to document the fact that spark.hadoop.* is copied from spark config into hadoop config? 2. Is there a reason StreamingContext.getOrCreate defaults to a blank hadoop configuration rather than

Re: Hadoop configuration for checkpointing

2014-11-04 Thread Cody Koeninger
Opened https://issues.apache.org/jira/browse/SPARK-4229 Sent a PR https://github.com/apache/spark/pull/3102 On Tue, Nov 4, 2014 at 11:48 AM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Nov 4, 2014 at 9:34 AM, Cody Koeninger c...@koeninger.org wrote: 2. Is there a reason

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Cody Koeninger
My 2 cents: Spark since pre-Apache days has been the most friendly and welcoming open source project I've seen, and that's reflected in its success. It seems pretty obvious to me that, for example, Michael should be looking at major changes to the SQL codebase. I trust him to do that in a way

Http client dependency conflict when using AWS

2014-11-10 Thread Cody Koeninger
I'm wondering why https://issues.apache.org/jira/browse/SPARK-3638 only updated the version of http client for the kinesis-asl profile and left the base dependencies unchanged. Spark built without that profile still has the same java.lang.NoSuchMethodError:

Re: spark kafka batch integration

2014-12-15 Thread Cody Koeninger
For an alternative take on a similar idea, see https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka An advantage of the approach I'm taking is that the lower and upper offsets of the RDD are known in advance, so it's deterministic. I

Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
Now that 1.2 is finalized... who are the go-to people to get some long-standing Kafka related issues resolved? The existing api is not sufficiently safe nor flexible for our production use. I don't think we're alone in this viewpoint, because I've seen several different patches and libraries to

Re: Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
18, 2014 at 7:07 AM, Cody Koeninger c...@koeninger.org wrote: Now that 1.2 is finalized... who are the go-to people to get some long-standing Kafka related issues resolved? The existing api is not sufficiently safe nor flexible for our production use. I don't think we're alone

Re: Which committers care about Kafka?

2014-12-18 Thread Cody Koeninger
to guarantee - though I really would love to have that! Thanks, Hari On Thu, Dec 18, 2014 at 12:26 PM, Cody Koeninger c...@koeninger.org wrote: Thanks for the replies. Regarding skipping WAL, it's not just about optimization. If you actually want exactly-once semantics, you need control of kafka

Re: Which committers care about Kafka?

2014-12-19 Thread Cody Koeninger
- From: Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com] Sent: Friday, December 19, 2014 5:57 AM To: Cody Koeninger Cc: Hari Shreedharan; Patrick Wendell; dev@spark.apache.org Subject: Re: Which committers care about Kafka? But idempotency is not that easy t achieve sometimes

Re: Which committers care about Kafka?

2014-12-19 Thread Cody Koeninger
. Thanks, Hari On Fri, Dec 19, 2014 at 1:48 PM, Cody Koeninger c...@koeninger.org wrote: The problems you guys are discussing come from trying to store state in spark, so don't do that. Spark isn't a distributed database. Just map kafka partitions directly to rdds, llet user code specify

cleaning up cache files left by SPARK-2713

2014-12-22 Thread Cody Koeninger
Is there a reason not to go ahead and move the _cache and _lock files created by Utils.fetchFiles into the work directory, so they can be cleaned up more easily? I saw comments to that effect in the discussion of the PR for 2713, but it doesn't look like it got done. And no, I didn't just have a

Re: Which committers care about Kafka?

2014-12-24 Thread Cody Koeninger
...@tresata.com wrote: yup, we at tresata do the idempotent store the same way. very simple approach. On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org wrote: That KafkaRDD code is dead simple. Given a user specified map (topic1, partition0) - (startingOffset, endingOffset

Re: Which committers care about Kafka?

2014-12-25 Thread Cody Koeninger
hshreedha...@cloudera.com wrote: In general such discussions happen or is posted on the dev lists. Could you please post a summary? Thanks. Thanks, Hari On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.org wrote: After a long talk with Patrick and TD (thanks guys), I

Re: SQLContext is Serializable, SparkContext is not

2014-12-26 Thread Cody Koeninger
The spark context reference is transient. On Fri, Dec 26, 2014 at 6:11 PM, Alessandro Baretta alexbare...@gmail.com wrote: How, O how can this be? Doesn't the SQLContext hold a reference to the SparkContext? Alex

Re: Which committers care about Kafka?

2014-12-29 Thread Cody Koeninger
://github.com/apache/spark/pull/3798 . I am reviewing it. Please follow the PR if you are interested. TD On Wed, Dec 24, 2014 at 11:59 PM, Cody Koeninger c...@koeninger.org wrote: The conversation was mostly getting TD up to speed on this thread since he had just gotten back from his trip

Re: Which committers care about Kafka?

2014-12-29 Thread Cody Koeninger
will further delay the ongoing job, and finally lead to failure. Thanks Jerry *From:* Cody Koeninger [mailto:c...@koeninger.org] *Sent:* Tuesday, December 30, 2014 6:50 AM *To:* Tathagata Das *Cc:* Hari Shreedharan; Shao, Saisai; Sean McNamara; Patrick Wendell; Luis Ángel Vicente Sánchez

Is there any way to tell if compute is being called from a retry?

2014-12-30 Thread Cody Koeninger
It looks like taskContext.attemptId doesn't mean what one thinks it might mean, based on http://apache-spark-developers-list.1001551.n3.nabble.com/Get-attempt-number-in-a-closure-td8853.html and the unresolved https://issues.apache.org/jira/browse/SPARK-4014 Is there any alternative way to

Re: Exception using the new createDirectStream util method

2015-03-19 Thread Cody Koeninger
this be an issue?? If you guys want to have a look at the code I've just uploaded it to my github account: big-brother https://github.com/ardlema/big-brother (see DirectKafkaWordCountTest.scala). Thank you again!! 2015-03-19 22:13 GMT+01:00 Cody Koeninger c...@koeninger.org: What

Re: Exception using the new createDirectStream util method

2015-03-20 Thread Cody Koeninger
is working fine! Do you think that I should open an issue to warn that the kafka topic must contain at least one message before the DirectStream creation? Thank you very much! You've just made my day ;) 2015-03-19 23:08 GMT+01:00 Cody Koeninger c...@koeninger.org: Yeah, I wouldn't be shocked

UnusedStubClass in 1.3.0-rc1

2015-02-25 Thread Cody Koeninger
So when building 1.3.0-rc1 I see the following warning: [WARNING] spark-streaming-kafka_2.10-1.3.0.jar, unused-1.0.0.jar define 1 overlappping classes: [WARNING] - org.apache.spark.unused.UnusedStubClass and when trying to build an assembly of a project that was previously using 1.3

Re: UnusedStubClass in 1.3.0-rc1

2015-02-25 Thread Cody Koeninger
(otherwise, some work we do that requires the shade plugin does not happen). However, now there are other things there. If you just comment out the line in the root pom.xml adding this dependency, does it work? - Patrick On Wed, Feb 25, 2015 at 7:53 AM, Cody Koeninger c

Re: Design docs: consolidation and discoverability

2015-04-24 Thread Cody Koeninger
My 2 cents - I'd rather see design docs in github pull requests (using plain text / markdown). That doesn't require changing access or adding people, and github PRs already allow for conversation / email notifications. Conversation is already split between jira and github PRs. Having a third

Re: Design docs: consolidation and discoverability

2015-04-24 Thread Cody Koeninger
Why can't pull requests be used for design docs in Git if people who aren't committers want to contribute changes (as opposed to just comments)? On Fri, Apr 24, 2015 at 2:57 PM, Sean Owen so...@cloudera.com wrote: Only catch there is it requires commit access to the repo. We need a way for

Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread Cody Koeninger
What's your schema for the offset table, and what's the definition of writeOffset ? What key are you reducing on? Maybe I'm misreading the code, but it looks like the per-partition offset is part of the key. If that's true then you could just do your reduction on each partition, rather than

Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread Cody Koeninger
In fact, you're using the 2 arg form of reduce by key to shrink it down to 1 partition reduceByKey(sumFunc, 1) But you started with 4 kafka partitions? So they're definitely no longer 1:1 On Thu, Apr 30, 2015 at 1:58 PM, Cody Koeninger c...@koeninger.org wrote: This is what I'm suggesting

Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-05-05 Thread Cody Koeninger
having. Thank you kindly, Mr. Koeninger. On Thu, Apr 30, 2015 at 3:06 PM, Cody Koeninger c...@koeninger.org wrote: In fact, you're using the 2 arg form of reduce by key to shrink it down to 1 partition reduceByKey(sumFunc, 1) But you started with 4 kafka partitions? So they're definitely

Re: New Kafka producer API

2015-05-05 Thread Cody Koeninger
Since that's an internal class used only for unit testing, what would the benefit be? On Tue, May 5, 2015 at 3:19 PM, BenFradet benjamin.fra...@gmail.com wrote: Hi, Since we're now supporting Kafka 0.8.2.1 https://github.com/apache/spark/pull/4537 , and that there is a new Producer API

Re: New Kafka producer API

2015-05-05 Thread Cody Koeninger
Regarding performance, keep in mind we'd probably have to turn all those async calls into blocking calls for the unit tests On Tue, May 5, 2015 at 3:44 PM, BenFradet benjamin.fra...@gmail.com wrote: Even if it's only used for testing and the examples, why not move ahead of the deprecation and

Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-05-14 Thread Cody Koeninger
to the others in the RDD, the executor gets done with it and gets scheduled another one to work one. With long running receivers spark acts like the receiver takes up a core even if it isn't doing much. Look at the CPU graph on slide 13 of the link i posted. On Thu, May 14, 2015 at 4:21 PM, Cody

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Cody Koeninger
The scala api has 2 ways of calling createDirectStream. One of them allows you to pass a message handler that gets full access to the kafka MessageAndMetadata, including offset. I don't know why the python api was developed with only one way to call createDirectStream, but the first thing I'd

Re: Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Cody Koeninger
There are already private methods in the code for interacting with Kafka's offset management api. There's a jira for making those methods public, but TD has been reluctant to merge it https://issues.apache.org/jira/browse/SPARK-10963 I think adding any ZK specific behavior to spark is a bad

Re: Feedback: Feature request

2015-08-28 Thread Cody Koeninger
I wrote some code for this a while back, pretty sure it didn't need access to anything private in the decision tree / random forest model. If people want it added to the api I can put together a PR. I think it's important to have separately parseable operators / operands though. E.g

Re: Spark-Kafka Connector issue

2015-09-28 Thread Cody Koeninger
This is a user list question not a dev list question. Looks like your driver is having trouble communicating to the kafka brokers. Make sure the broker host and port is available from the driver host (using nc or telnet); make sure that you're providing the _broker_ host and port to

Re: Spark Streaming Kafka - DirectKafkaInputDStream: Using the new Kafka Consumer API

2015-12-04 Thread Cody Koeninger
rs is questionable. > << > > I agree and i was more thinking maybe there is a way to support both for a > period of time (of course means some more code to maintain :-)). > > > thanks > Mario > > [image: Inactive hide details for Cody Koeninger ---04/12/2015 12:15:5

Re: Spark Streaming Kafka - DirectKafkaInputDStream: Using the new Kafka Consumer API

2015-12-03 Thread Cody Koeninger
Honestly my feeling on any new API is to wait for a point release before taking it seriously :) Auth and encryption seem like the only compelling reason to move, but forcing people on kafka 8.x to upgrade their brokers is questionable. On Thu, Dec 3, 2015 at 11:30 AM, Mario Ds Briggs

Re: Kafka consumer: Upgrading to use the the new Java Consumer

2015-12-27 Thread Cody Koeninger
Have you seen SPARK-12177 On Wed, Dec 23, 2015 at 3:27 PM, eugene miretsky wrote: > Hi, > > The Kafka connector currently uses the older Kafka Scala consumer. Kafka > 0.9 came out with a new Java Kafka consumer. > > One of the main differences is that the Scala

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Cody Koeninger
I don't have a vote, but I'd just like to reiterate that I think kafka 0.10 support should be added to a 2.0 release candidate; if not now, then well before release. - it's a completely standalone jar, so shouldn't break anyone who's using the existing 0.8 support - it's like the 5th highest

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Cody Koeninger
, Sean Owen <so...@cloudera.com> wrote: > I profess ignorance again though I really should know by now, but, > what's opposing that? I personally thought this was going to be in 2.0 > and didn't kind of notice it wasn't ... > > On Wed, Jun 22, 2016 at 3:29 PM, Cody Koeninger <c

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Cody Koeninger
On Wed, Jun 22, 2016 at 7:46 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> As far as I know the only thing blocking it at this point is lack of >> committer review / approval. >> >> It's technically adding a new feature after spark code-freeze, but it

Re: cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-15 Thread Cody Koeninger
Any word on Kafka 0.10 support / SPARK-12177 I understand the hesitation, but is having nothing better than having a standalone subproject marked as experimental? On Wed, Jun 15, 2016 at 2:01 PM, Reynold Xin wrote: > It's been a while and we have accumulated quite a few bug

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
essed as Rows, everything >> > uses >> > DataFrames, Type classes used in a Dataset is internally converted to >> > rows >> > for processing. . Therefore internally DataFrame is like "main" type >> > that is >> > used. >> > >

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
e is like "main" type that is > used. > > On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Sorry, meant DataFrame vs Dataset >> >> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <c...@koe

Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Cody Koeninger
Sorry, meant DataFrame vs Dataset On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <c...@koeninger.org> wrote: > Is there a principled reason why sql.streaming.* and > sql.execution.streaming.* are making extensive use of DataFrame > instead of Datasource? > > Or is that just

Kafka connector mention in Matei's keynote

2016-02-18 Thread Cody Koeninger
I saw this slide: http://image.slidesharecdn.com/east2016v2matei-160217154412/95/2016-spark-summit-east-keynote-matei-zaharia-5-638.jpg?cb=1455724433 Didn't see the talk - was this just referring to the existing work on the spark-streaming-kafka subproject, or is someone actually working on

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Cody Koeninger
Spark streaming by default will not start processing a batch until the current batch is finished. So if your processing time is larger than your batch time, delays will build up. On Wed, Mar 9, 2016 at 11:09 AM, Sachin Aggarwal wrote: > Hi All, > > we have batchTime

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-10 Thread Cody Koeninger
The central problem with doing anything like this is that you break one of the basic guarantees of kafka, which is in-order processing on a per-topicpartition basis. As far as PRs go, because of the new consumer interface for kafka 0.9 and 0.10, there's a lot of potential change already underway.

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-15 Thread Cody Koeninger
like when they use the existing DStream.repartition where original > per-topicpartition in-order processing is also not observed any more. > > Do you agree? > > On Thu, Mar 10, 2016 at 12:12 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> The central problem wit

Re: submissionTime vs batchTime, DirectKafka

2016-03-09 Thread Cody Koeninger
t; > On Wed, Mar 9, 2016 at 10:46 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Spark streaming by default will not start processing a batch until the >> current batch is finished. So if your processing time is larger than >> your batch time, delays will buil

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Cody Koeninger
e relevance to what > you're talking about. > > Perhaps if both functions (the one with partitions arg and the one without) > returned just ConsumerRecord, I would like that more. > > - Alan > > On Tue, Mar 8, 2016 at 6:49 AM, Cody Koeninger <c...@koeninger.org> wrote: >

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Cody Koeninger
arty external repositories - not owned by Apache. >> >> At a minimum, dev@ discussion (like this one) should be initiated. >> As PMC is responsible for the project assets (including code), signoff >> is required for it IMO. >> >> More experienced Apache members mi

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Cody Koeninger
Why would a PMC vote be necessary on every code deletion? There was a Jira and pull request discussion about the submodules that have been removed so far. https://issues.apache.org/jira/browse/SPARK-13843 There's another ongoing one about Kafka specifically

Re: SPARK-13843 Next steps

2016-03-22 Thread Cody Koeninger
I'm in favor of everything in /extras and /external being removed, but I'm more in favor of making a decision and moving on. On Tue, Mar 22, 2016 at 12:20 PM, Marcelo Vanzin wrote: > +1 for getting flume back. > > On Tue, Mar 22, 2016 at 12:27 AM, Kostas Sakellis

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Cody Koeninger
minimum, dev@ discussion (like this one) should be initiated. >> As PMC is responsible for the project assets (including code), signoff >> is required for it IMO. >> >> More experienced Apache members might be opine better in case I got it >> wrong

Re: Use cases for kafka direct stream messageHandler

2016-03-08 Thread Cody Koeninger
No, looks like you'd have to catch them in the serializer and have the serializer return option or something. The new consumer builds a buffer full of records, not one at a time. On Mar 8, 2016 4:43 AM, "Marius Soutier" <mps@gmail.com> wrote: > > > On 04.03.2016, at

getting a list of executors for use in getPreferredLocations

2016-03-03 Thread Cody Koeninger
I need getPreferredLocations to choose a consistent executor for a given partition in a stream. In order to do that, I need to know what the current executors are. I'm currently grabbing them from the block manager master .getPeers(), which works, but I don't know if that's the most reasonable

Re: getting a list of executors for use in getPreferredLocations

2016-03-03 Thread Cody Koeninger
gt; What do you mean by consistent? Throughout the life cycle of an app, the >> executors can come and go and as a result really has no consistency. Do you >> just need it for a specific job? >> >> >> >> On Thu, Mar 3, 2016 at 3:08 PM, Cody Koeninger <c...@koeni

Use cases for kafka direct stream messageHandler

2016-03-04 Thread Cody Koeninger
Wanted to survey what people are using the direct stream messageHandler for, besides just extracting key / value / offset. Would your use case still work if that argument was removed, and the stream just contained ConsumerRecord objects

Re: Upgrading to Kafka 0.9.x

2016-03-02 Thread Cody Koeninger
Jay, thanks for the response. Regarding the new consumer API for 0.9, I've been reading through the code for it and thinking about how it fits in to the existing Spark integration. So far I've seen some interesting challenges, and if you (or anyone else on the dev list) have time to provide some

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Cody Koeninger
I agree with Mark in that I don't see how supporting scala 2.10 for spark 2.0 implies supporting it for all of spark 2.x Regarding Koert's comment on akka, I thought all akka dependencies have been removed from spark after SPARK-7997 and the recent removal of external/akka On Wed, Mar 30, 2016

Re: SPARK-13843 Next steps

2016-03-28 Thread Cody Koeninger
I really think the only thing that should have to change is the maven group and identifier, not the java namespace. There are compatibility problems with the java namespace changing (e.g. access to private[spark]), and I don't think that someone who takes the time to change their build file to

Re: SPARK-13843 and future of streaming backends

2016-03-28 Thread Cody Koeninger
Are you talking about group/identifier name, or contained classes? Because there are plenty of org.apache.* classes distributed via maven with non-apache group / identifiers. On Fri, Mar 25, 2016 at 6:54 PM, David Nalley wrote: > >> As far as group / artifact name

Re: Spark streaming Kafka receiver WriteAheadLog question

2016-04-25 Thread Cody Koeninger
If you want to refer back to Kafka based on offset ranges, why not use createDirectStream? On Fri, Apr 22, 2016 at 11:49 PM, Renyi Xiong wrote: > Hi, > > Is it possible for Kafka receiver generated WriteAheadLogBackedBlockRDD to > hold corresponded Kafka offset range so

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Cody Koeninger
Given that not all of the connectors were removed, I think this creates a weird / confusing three tier system 1. connectors in the official project's spark/extras or spark/external 2. connectors in "Spark Extras" 3. connectors in some random organization's github On Fri, Apr 15, 2016 at 11:18

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Cody Koeninger
100% agree with Sean & Reynold's comments on this. Adding this as a TLP would just cause more confusion as to "official" endorsement. On Fri, Apr 15, 2016 at 11:50 AM, Sean Owen wrote: > On Fri, Apr 15, 2016 at 5:34 PM, Luciano Resende wrote: >> I

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Cody Koeninger
For what it's worth, I have definitely had PRs that sat inactive for more than 30 days due to committers not having time to look at them, but did eventually end up successfully being merged. I guess if this just ends up being a committer ping and reopening the PR, it's fine, but I don't know if

Re: Potential Change in Kafka's Partition Assignment Semantics when Subscription Changes

2016-07-25 Thread Cody Koeninger
This seems really low risk to me. In order to be impacted, it'd have to be someone who was using the kafka integration in spark 2.0, which isn't even officially released yet. On Mon, Jul 25, 2016 at 7:23 PM, Vahid S Hashemian wrote: > Sorry, meant to ask if any Apache

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Cody Koeninger
For 2.0, the kafka dstream support is in two separate subprojects depending on which version of Kafka you are using spark-streaming-kafka-0-10 or spark-streaming-kafka-0-8 corresponding to brokers that are version 0.10+ or 0.8+ On Mon, Jul 25, 2016 at 12:29 PM, Reynold Xin

Re: Clarifying that spark-x.x.x-bin-hadoopx.x.tgz doesn't include Hadoop itself

2016-07-29 Thread Cody Koeninger
Yeah, and the without hadoop was even more confusing... because if you weren't using hdfs at all, you still needed to download one of the hadoop-x packages in order to get hadoop io classes used by almost everything. :) On Fri, Jul 29, 2016 at 3:06 PM, Marcelo Vanzin wrote:

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
roblem in detail here: > https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok > > Could you please give me some suggestions or advice to fix this problem? > > Thanks > > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <c...@koeninger.org> wrote: >&

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
leq...@gmail.com> wrote: > How to do that? if I put the queue inside .transform operation, it doesn't > work. > > On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Can you keep a queue per executor in memory? >> >> On Mon

Re: How does MapWithStateRDD distribute the data

2016-08-03 Thread Cody Koeninger
Are you using KafkaUtils.createDirectStream? On Wed, Aug 3, 2016 at 9:42 AM, Soumitra Johri wrote: > Hi, > > I am running a steaming job with 4 executors and 16 cores so that each > executor has two cores to work with. The input Kafka topic has 4 partitions. > With

Re: Why's ds.foreachPartition(println) not possible?

2016-07-05 Thread Cody Koeninger
I don't think that's a scala compiler bug. println is a valid expression that returns unit. Unit is not a single-argument function, and does not match any of the overloads of foreachPartition You may be used to a conversion taking place when println is passed to method expecting a function, but

Re: [VOTE] Release Apache Spark 2.0.0 (RC2)

2016-07-06 Thread Cody Koeninger
I know some usages of the 0.10 kafka connector will be broken until https://github.com/apache/spark/pull/14026 is merged, but the 0.10 connector is a new feature, so not blocking. Sean I'm assuming the DirectKafkaStreamSuite failure you saw was for 0.8? I'll take another look at it. On Wed,

Re: Kafka Support new topic subscriptions without requiring restart of the streaming context

2016-08-08 Thread Cody Koeninger
The Kafka 0.10 support in spark 2.0 allows for pattern based topic subscription On Aug 8, 2016 1:12 AM, "r7raul1...@163.com" wrote: > How to add new topic to kafka without requiring restart of the streaming > context? > > -- > r7raul1...@163.com >

Re: sampling operation for DStream

2016-07-29 Thread Cody Koeninger
Most stream systems you're still going to incur the cost of reading each message... I suppose you could rotate among reading just the latest messages from a single partition of a Kafka topic if they were evenly balanced. But once you've read the messages, nothing's stopping you from filtering

Jenkins networking / port contention

2016-07-01 Thread Cody Koeninger
Can someone familiar with amplab's jenkins setup clarify whether all tests running at a given time are competing for network ports, or whether there's some sort of containerization being done? Based on the use of Utils.startServiceOnPort in the tests, I'd assume the former.

Re: Jenkins networking / port contention

2016-07-01 Thread Cody Koeninger
e're on bare metal. > > the test launch code executes this for each build: > # Generate random point for Zinc > export ZINC_PORT > ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)") > > On Fri, Jul 1, 2016 at 6:02 AM, Cody Koeninger &l

  1   2   3   >