Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-22 Thread Sean McNamara
+1

On Jun 22, 2016, at 1:14 PM, Michael Armbrust 
> wrote:

+1

On Wed, Jun 22, 2016 at 11:33 AM, Jonathan Kelly 
> wrote:
+1

On Wed, Jun 22, 2016 at 10:41 AM Tim Hunter 
> wrote:
+1 This release passes all tests on the graphframes and tensorframes packages.

On Wed, Jun 22, 2016 at 7:19 AM, Cody Koeninger 
> wrote:
If we're considering backporting changes for the 0.8 kafka
integration, I am sure there are people who would like to get

https://issues.apache.org/jira/browse/SPARK-10963

into 1.6.x as well

On Wed, Jun 22, 2016 at 7:41 AM, Sean Owen 
> wrote:
> Good call, probably worth back-porting, I'll try to do that. I don't
> think it blocks a release, but would be good to get into a next RC if
> any.
>
> On Wed, Jun 22, 2016 at 11:38 AM, Pete Robbins 
> > wrote:
>> This has failed on our 1.6 stream builds regularly.
>> (https://issues.apache.org/jira/browse/SPARK-6005) looks fixed in 2.0?
>>
>> On Wed, 22 Jun 2016 at 11:15 Sean Owen 
>> > wrote:
>>>
>>> Oops, one more in the "does anybody else see this" department:
>>>
>>> - offset recovery *** FAILED ***
>>>   recoveredOffsetRanges.forall(((or: (org.apache.spark.streaming.Time,
>>> Array[org.apache.spark.streaming.kafka.OffsetRange])) =>
>>>
>>> earlierOffsetRangesAsSets.contains(scala.Tuple2.apply[org.apache.spark.streaming.Time,
>>>
>>> scala.collection.immutable.Set[org.apache.spark.streaming.kafka.OffsetRange]](or._1,
>>>
>>> scala.this.Predef.refArrayOps[org.apache.spark.streaming.kafka.OffsetRange](or._2).toSet[org.apache.spark.streaming.kafka.OffsetRange]
>>> was false Recovered ranges are not the same as the ones generated
>>> (DirectKafkaStreamSuite.scala:301)
>>>
>>> This actually fails consistently for me too in the Kafka integration
>>> code. Not timezone related, I think.
>
> -
> To unsubscribe, e-mail: 
> dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: 
> dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org
For additional commands, e-mail: 
dev-h...@spark.apache.org






Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-08 Thread Sean McNamara
+1

Sean


On Nov 3, 2015, at 4:28 PM, Reynold Xin 
> wrote:

Please vote on releasing the following candidate as Apache Spark version 1.5.2. 
The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a majority of 
at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.5.2
[ ] -1 Do not release this package because ...


The release fixes 59 known issues in Spark 1.5.1, listed here:
http://s.apache.org/spark-1.5.2

The tag to be voted on is v1.5.2-rc2:
https://github.com/apache/spark/releases/tag/v1.5.2-rc2

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
- as version 1.5.2-rc2: 
https://repository.apache.org/content/repositories/orgapachespark-1153
- as version 1.5.2: 
https://repository.apache.org/content/repositories/orgapachespark-1152

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-docs/


===
How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.


What justifies a -1 vote for this release?

-1 vote should occur for regressions from Spark 1.5.1. Bugs already present in 
1.5.1 will not block this release.





Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Sean McNamara
Ran tests + built/ran an internal spark streaming app /w 1.5.1 artifacts.

+1

Cheers,

Sean


On Sep 24, 2015, at 1:28 AM, Reynold Xin 
> wrote:

Please vote on releasing the following candidate as Apache Spark version 1.5.1. 
The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a majority 
of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.5.1
[ ] -1 Do not release this package because ...


The release fixes 81 known issues in Spark 1.5.0, listed here:
http://s.apache.org/spark-1.5.1

The tag to be voted on is v1.5.1-rc1:
https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release (1.5.1) can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1148/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/


===
How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.


What justifies a -1 vote for this release?

-1 vote should occur for regressions from Spark 1.5.0. Bugs already present in 
1.5.0 will not block this release.

===
What should happen to JIRA tickets still targeting 1.5.1?
===
Please target 1.5.2 or 1.6.0.






Re: [VOTE] Release Apache Spark 1.4.1 (RC4)

2015-07-10 Thread Sean McNamara
+1

Sean

 On Jul 8, 2015, at 11:55 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version 
 1.4.1!
 
 This release fixes a handful of known issues in Spark 1.4.0, listed here:
 http://s.apache.org/spark-1.4.1
 
 The tag to be voted on is v1.4.1-rc4 (commit dbaa5c2):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 dbaa5c294eb565f84d7032e387e4b8c1a56e4cd2
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc4-bin/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 [published as version: 1.4.1]
 https://repository.apache.org/content/repositories/orgapachespark-1125/
 [published as version: 1.4.1-rc4]
 https://repository.apache.org/content/repositories/orgapachespark-1126/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc4-docs/
 
 Please vote on releasing this package as Apache Spark 1.4.1!
 
 The vote is open until Sunday, July 12, at 06:55 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.4.1
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-09 Thread Sean McNamara
+1

tested /w OS X + deployed one of our streaming apps onto a staging yarn cluster.

Sean

 On Jun 2, 2015, at 9:54 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version 
 1.4.0!
 
 The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 22596c534a38cfdda91aef18aa9037ab101e4251
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 [published as version: 1.4.0]
 https://repository.apache.org/content/repositories/orgapachespark-/
 [published as version: 1.4.0-rc4]
 https://repository.apache.org/content/repositories/orgapachespark-1112/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/
 
 Please vote on releasing this package as Apache Spark 1.4.0!
 
 The vote is open until Saturday, June 06, at 05:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.4.0
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 == What has changed since RC3 ==
 In addition to may smaller fixes, three blocker issues were fixed:
 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
 metadataHive get constructed too early
 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton
 
 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.3 workload and running on this release candidate,
 then reporting any regressions.
 
 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.4 QA period,
 so -1 votes should only occur for significant regressions from 1.3.1.
 Bugs already present in 1.3.X, minor regressions, or bugs related
 to new features will not block this release.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.2

2015-04-15 Thread Sean McNamara
Ran tests on OS X

+1

Sean


 On Apr 14, 2015, at 10:59 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 I'd like to close this vote to coincide with the 1.3.1 release,
 however, it would be great to have more people test this release
 first. I'll leave it open for a bit longer and see if others can give
 a +1.
 
 On Tue, Apr 14, 2015 at 9:55 PM, Patrick Wendell pwend...@gmail.com wrote:
 +1 from me ass well.
 
 On Tue, Apr 7, 2015 at 4:36 AM, Sean Owen so...@cloudera.com wrote:
 I think that's close enough for a +1:
 
 Signatures and hashes are good.
 LICENSE, NOTICE still check out.
 Compiles for a Hadoop 2.6 + YARN + Hive profile.
 
 JIRAs with target version = 1.2.x look legitimate; no blockers.
 
 I still observe several Hive test failures with:
 mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0
 -DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive
 -Phive-0.13.1 -Dhadoop.version=2.6.0 test
 .. though again I think these are not regressions but known issues in
 older branches.
 
 FYI there are 16 Critical issues still open for 1.2.x:
 
 SPARK-6209,ExecutorClassLoader can leak connections after failing to
 load classes from the REPL class server,Josh Rosen,In Progress,4/5/15
 SPARK-5098,Number of running tasks become negative after tasks
 lost,,Open,1/14/15
 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge
 instances,,Open,1/27/15
 SPARK-4879,Missing output partitions after job completes with
 speculative execution,Josh Rosen,Open,3/5/15
 SPARK-4568,Publish release candidates under $VERSION-RCX instead of
 $VERSION,Patrick Wendell,Open,11/24/14
 SPARK-4520,SparkSQL exception when reading certain columns from a
 parquet file,sadhan sood,Open,1/21/15
 SPARK-4514,SparkContext localProperties does not inherit property
 updates across thread reuse,Josh Rosen,Open,3/31/15
 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15
 SPARK-4452,Shuffle data structures can starve others on the same
 thread for memory,Tianshuo Deng,Open,1/24/15
 SPARK-4356,Test Scala 2.11 on Jenkins,Patrick Wendell,Open,11/12/14
 SPARK-4258,NPE with new Parquet Filters,Cheng Lian,Reopened,4/3/15
 SPARK-4194,Exceptions thrown during SparkContext or SparkEnv
 construction might lead to resource leaks or corrupted global
 state,,In Progress,4/2/15
 SPARK-4159,Maven build doesn't run JUnit test suites,Sean 
 Owen,Open,1/11/15
 SPARK-4106,Shuffle write and spill to disk metrics are 
 incorrect,,Open,10/28/14
 SPARK-3492,Clean up Yarn integration code,Andrew Or,Open,9/12/14
 SPARK-3461,Support external groupByKey using
 repartitionAndSortWithinPartitions,Sandy Ryza,Open,11/10/14
 SPARK-2984,FileNotFoundException on _temporary directory,,Open,12/11/14
 SPARK-2532,Fix issues with consolidated shuffle,,Open,3/26/15
 SPARK-1312,Batch should read based on the batch interval provided in
 the StreamingContext,Tathagata Das,Open,12/24/14
 
 On Sun, Apr 5, 2015 at 7:24 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.2.2!
 
 The tag to be voted on is v1.2.2-rc1 (commit 7531b50):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009ceea7c684272b2e27
 
 The list of fixes present in this release can be found at:
 http://bit.ly/1DCNddt
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.2-rc1/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1082/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.2-rc1-docs/
 
 Please vote on releasing this package as Apache Spark 1.2.2!
 
 The vote is open until Thursday, April 08, at 00:30 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.2.2
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-09 Thread Sean McNamara
+1 tested on OS X

Sean

 On Apr 7, 2015, at 11:46 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.1!
 
 The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5
 
 The list of fixes present in this release can be found at:
 http://bit.ly/1C2nVPY
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc2/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1083/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/
 
 The patches on top of RC1 are:
 
 [SPARK-6737] Fix memory leak in OutputCommitCoordinator
 https://github.com/apache/spark/pull/5397
 
 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
 https://github.com/apache/spark/pull/5302
 
 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
 NoClassDefFoundError
 https://github.com/apache/spark/pull/4933
 
 Please vote on releasing this package as Apache Spark 1.3.1!
 
 The vote is open until Saturday, April 11, at 07:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.3.1
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Sean McNamara
+1

 On Apr 4, 2015, at 6:11 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.1!
 
 The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
 
 The list of fixes present in this release can be found at:
 http://bit.ly/1C2nVPY
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1080
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/
 
 Please vote on releasing this package as Apache Spark 1.3.1!
 
 The vote is open until Wednesday, April 08, at 01:10 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.3.1
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 - Patrick
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Sean McNamara
Sounds good, that makes sense.

Cheers,

Sean

 On Jan 27, 2015, at 11:35 AM, Patrick Wendell pwend...@gmail.com wrote:
 
 Hey Sean,
 
 Right now we don't publish every 2.11 binary to avoid combinatorial
 explosion of the number of build artifacts we publish (there are other
 parameters such as whether hive is included, etc). We can revisit this
 in future feature releases, but .1 releases like this are reserved for
 bug fixes.
 
 - Patrick
 
 On Tue, Jan 27, 2015 at 10:31 AM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:
 We're using spark on scala 2.11 /w hadoop2.4.  Would it be practical / make 
 sense to build a bin version of spark against scala 2.11 for versions other 
 than just hadoop1 at this time?
 
 Cheers,
 
 Sean
 
 
 On Jan 27, 2015, at 12:04 AM, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version 
 1.2.1!
 
 The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1061/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
 
 Please vote on releasing this package as Apache Spark 1.2.1!
 
 The vote is open until Friday, January 30, at 07:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.2.1
 [ ] -1 Do not release this package because ...
 
 For a list of fixes in this release, see http://s.apache.org/Mpn.
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 - Patrick
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Which committers care about Kafka?

2014-12-19 Thread Sean McNamara
Please feel free to correct me if I’m wrong, but I think the exactly once spark 
streaming semantics can easily be solved using updateStateByKey. Make the key 
going into updateStateByKey be a hash of the event, or pluck off some uuid from 
the message.  The updateFunc would only emit the message if the key did not 
exist, and the user has complete control over the window of time / state 
lifecycle for detecting duplicates.  It also makes it really easy to detect and 
take action (alert?) when you DO see a duplicate, or make memory tradeoffs 
within an error bound using a sketch algorithm.  The kafka simple consumer is 
insanely complex, if possible I think it would be better (and vastly more 
flexible) to get reliability using the primitives that spark so elegantly 
provides.

Cheers,

Sean


 On Dec 19, 2014, at 12:06 PM, Hari Shreedharan hshreedha...@cloudera.com 
 wrote:
 
 Hi Dibyendu,
 
 Thanks for the details on the implementation. But I still do not believe
 that it is no duplicates - what they achieve is that the same batch is
 processed exactly the same way every time (but see it may be processed more
 than once) - so it depends on the operation being idempotent. I believe
 Trident uses ZK to keep track of the transactions - a batch can be
 processed multiple times in failure scenarios (for example, the transaction
 is processed but before ZK is updated the machine fails, causing a new
 node to process it again).
 
 I don't think it is impossible to do this in Spark Streaming as well and
 I'd be really interested in working on it at some point in the near future.
 
 On Fri, Dec 19, 2014 at 1:44 AM, Dibyendu Bhattacharya 
 dibyendu.bhattach...@gmail.com wrote:
 
 Hi,
 
 Thanks to Jerry for mentioning the Kafka Spout for Trident. The Storm
 Trident has done the exact-once guarantee by processing the tuple in a
 batch  and assigning same transaction-id for a given batch . The replay for
 a given batch with a transaction-id will have exact same set of tuples and
 replay of batches happen in exact same order before the failure.
 
 Having this paradigm, if downstream system process data for a given batch
 for having a given transaction-id , and if during failure if same batch is
 again emitted , you can check if same transaction-id is already processed
 or not and hence can guarantee exact once semantics.
 
 And this can only be achieved in Spark if we use Low Level Kafka consumer
 API to process the offsets. This low level Kafka Consumer (
 https://github.com/dibbhatt/kafka-spark-consumer) has implemented the
 Spark Kafka consumer which uses Kafka Low Level APIs . All of the Kafka
 related logic has been taken from Storm-Kafka spout and which manages all
 Kafka re-balance and fault tolerant aspects and Kafka metadata managements.
 
 Presently this Consumer maintains that during Receiver failure, it will
 re-emit the exact same Block with same set of messages . Every message have
 the details of its partition, offset and topic related details which can
 tackle the SPARK-3146.
 
 As this Low Level consumer has complete control over the Kafka Offsets ,
 we can implement Trident like feature on top of it like having implement a
 transaction-id for a given block , and re-emit the same block with same set
 of message during Driver failure.
 
 Regards,
 Dibyendu
 
 
 On Fri, Dec 19, 2014 at 7:33 AM, Shao, Saisai saisai.s...@intel.com
 wrote:
 
 Hi all,
 
 I agree with Hari that Strong exact-once semantics is very hard to
 guarantee, especially in the failure situation. From my understanding even
 current implementation of ReliableKafkaReceiver cannot fully guarantee the
 exact once semantics once failed, first is the ordering of data replaying
 from last checkpoint, this is hard to guarantee when multiple partitions
 are injected in; second is the design complexity of achieving this, you can
 refer to the Kafka Spout in Trident, we have to dig into the very details
 of Kafka metadata management system to achieve this, not to say rebalance
 and fault-tolerance.
 
 Thanks
 Jerry
 
 -Original Message-
 From: Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com]
 Sent: Friday, December 19, 2014 5:57 AM
 To: Cody Koeninger
 Cc: Hari Shreedharan; Patrick Wendell; dev@spark.apache.org
 Subject: Re: Which committers care about Kafka?
 
 But idempotency is not that easy t achieve sometimes. A strong only once
 semantic through a proper API would  be superuseful; but I'm not implying
 this is easy to achieve.
 On 18 Dec 2014 21:52, Cody Koeninger c...@koeninger.org wrote:
 
 If the downstream store for the output data is idempotent or
 transactional, and that downstream store also is the system of record
 for kafka offsets, then you have exactly-once semantics.  Commit
 offsets with / after the data is stored.  On any failure, restart from
 the last committed offsets.
 
 Yes, this approach is biased towards the etl-like use cases rather
 than near-realtime-analytics use cases.
 
 On Thu, Dec 18, 2014 at 

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-13 Thread Sean McNamara
+1 tested on OS X and deployed+tested our apps via YARN into our staging 
cluster.

Sean


 On Dec 11, 2014, at 10:40 AM, Reynold Xin r...@databricks.com wrote:
 
 +1
 
 Tested on OS X.
 
 On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version
 1.2.0!
 
 The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc2/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1055/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
 
 Please vote on releasing this package as Apache Spark 1.2.0!
 
 The vote is open until Saturday, December 13, at 21:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.2.0
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 == What justifies a -1 vote for this release? ==
 This vote is happening relatively late into the QA period, so
 -1 votes should only occur for significant regressions from
 1.0.2. Bugs already present in 1.1.X, minor
 regressions, or bugs related to new features will not block this
 release.
 
 == What default changes should I be aware of? ==
 1. The default value of spark.shuffle.blockTransferService has been
 changed to netty
 -- Old behavior can be restored by switching to nio
 
 2. The default value of spark.shuffle.manager has been changed to sort.
 -- Old behavior can be restored by setting spark.shuffle.manager to
 hash.
 
 == How does this differ from RC1 ==
 This has fixes for a handful of issues identified - some of the
 notable fixes are:
 
 [Core]
 SPARK-4498: Standalone Master can fail to recognize completed/failed
 applications
 
 [SQL]
 SPARK-4552: Query for empty parquet table in spark sql hive get
 IllegalArgumentException
 SPARK-4753: Parquet2 does not prune based on OR filters on partition
 columns
 SPARK-4761: With JDBC server, set Kryo as default serializer and
 disable reference tracking
 SPARK-4785: When called with arguments referring column fields, PMOD
 throws NPE
 
 - Patrick
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: dev-h...@spark.apache.org javascript:;
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Sean McNamara
+1

Sean

On Nov 5, 2014, at 6:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 Hi all,
 
 I wanted to share a discussion we've been having on the PMC list, as well as 
 call for an official vote on it on a public list. Basically, as the Spark 
 project scales up, we need to define a model to make sure there is still 
 great oversight of key components (in particular internal architecture and 
 public APIs), and to this end I've proposed implementing a maintainer model 
 for some of these components, similar to other large projects.
 
 As background on this, Spark has grown a lot since joining Apache. We've had 
 over 80 contributors/month for the past 3 months, which I believe makes us 
 the most active project in contributors/month at Apache, as well as over 500 
 patches/month. The codebase has also grown significantly, with new libraries 
 for SQL, ML, graphs and more.
 
 In this kind of large project, one common way to scale development is to 
 assign maintainers to oversee key components, where each patch to that 
 component needs to get sign-off from at least one of its maintainers. Most 
 existing large projects do this -- at Apache, some large ones with this model 
 are CloudStack (the second-most active project overall), Subversion, and 
 Kafka, and other examples include Linux and Python. This is also by-and-large 
 how Spark operates today -- most components have a de-facto maintainer.
 
 IMO, adopting this model would have two benefits:
 
 1) Consistent oversight of design for that component, especially regarding 
 architecture and API. This process would ensure that the component's 
 maintainers see all proposed changes and consider them to fit together in a 
 good way.
 
 2) More structure for new contributors and committers -- in particular, it 
 would be easy to look up who’s responsible for each module and ask them for 
 reviews, etc, rather than having patches slip between the cracks.
 
 We'd like to start with in a light-weight manner, where the model only 
 applies to certain key components (e.g. scheduler, shuffle) and user-facing 
 APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it 
 if we deem it useful. The specific mechanics would be as follows:
 
 - Some components in Spark will have maintainers assigned to them, where one 
 of the maintainers needs to sign off on each patch to the component.
 - Each component with maintainers will have at least 2 maintainers.
 - Maintainers will be assigned from the most active and knowledgeable 
 committers on that component by the PMC. The PMC can vote to add / remove 
 maintainers, and maintained components, through consensus.
 - Maintainers are expected to be active in responding to patches for their 
 components, though they do not need to be the main reviewers for them (e.g. 
 they might just sign off on architecture / API). To prevent inactive 
 maintainers from blocking the project, if a maintainer isn't responding in a 
 reasonable time period (say 2 weeks), other committers can merge the patch, 
 and the PMC will want to discuss adding another maintainer.
 
 If you'd like to see examples for this model, check out the following 
 projects:
 - CloudStack: 
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
  
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
  
 - Subversion: https://subversion.apache.org/docs/community-guide/roles.html 
 https://subversion.apache.org/docs/community-guide/roles.html
 
 Finally, I wanted to list our current proposal for initial components and 
 maintainers. It would be good to get feedback on other components we might 
 add, but please note that personnel discussions (e.g. I don't think Matei 
 should maintain *that* component) should only happen on the private list. The 
 initial components were chosen to include all public APIs and the main core 
 components, and the maintainers were chosen from the most active contributors 
 to those modules.
 
 - Spark core public API: Matei, Patrick, Reynold
 - Job scheduler: Matei, Kay, Patrick
 - Shuffle and network: Reynold, Aaron, Matei
 - Block manager: Reynold, Aaron
 - YARN: Tom, Andrew Or
 - Python: Josh, Matei
 - MLlib: Xiangrui, Matei
 - SQL: Michael, Reynold
 - Streaming: TD, Matei
 - GraphX: Ankur, Joey, Reynold
 
 I'd like to formally call a [VOTE] on this model, to last 72 hours. The 
 [VOTE] will end on Nov 8, 2014 at 6 PM PST.
 
 Matei


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



accumulators

2014-10-16 Thread Sean McNamara
Accumulators on the stage info page show the rolling life time value of 
accumulators as well as per task which is handy.  I think it would be useful to 
add another field to the “Accumulators” table that also shows the total for the 
stage you are looking at (basically just a merge of the accumulators for tasks 
in that stage).  This would be useful for any job that is iterative (eg- 
basically every spark streaming job).

Does this idea make sense?


Separate but related question-  From the operational side I think it could be 
very useful to have an accumulators summary page.  For example we have a spark 
streaming job with many different stages.  It is difficult to navigate into 
each stage to pick out a trend.  An accumulators page that allowed one to 
filter by stage description and/or accumulator name would be very useful.

Thoughts?


Thanks,

Sean
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Sean McNamara
+1

From: Patrick Wendell [pwend...@gmail.com]
Sent: Saturday, August 30, 2014 4:08 PM
To: dev@spark.apache.org
Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)

Please vote on releasing the following candidate as Apache Spark version 1.1.0!

The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1030/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/

Please vote on releasing this package as Apache Spark 1.1.0!

The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== Regressions fixed since RC1 ==
- Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234
- EC2 script version bump to 1.1.0.

== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.0.X will not block
this release.

== What default changes should I be aware of? ==
1. The default value of spark.io.compression.codec is now snappy
-- Old behavior can be restored by switching to lzf

2. PySpark now performs external spilling during aggregations.
-- Old behavior can be restored by setting spark.shuffle.spill to false.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: balancing RDDs

2014-06-25 Thread Sean McNamara
Yep exactly!  I’m not sure how complicated it would be to pull off.  If someone 
wouldn’t mind helping to get me pointed in the right direction I would be happy 
to look into and contribute this functionality.  I imagine this would be 
implemented in the scheduler codebase and there would be some sort of rebalance 
configuration property to enable it possibly?

Does anyone else have any thoughts on this?

Cheers,

Sean


On Jun 24, 2014, at 4:41 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:

 This would be really useful. Especially for Shark where shift of
 partitioning effects all subsequent queries unless task scheduling time
 beats spark.locality.wait. Can cause overall low performance for all
 subsequent tasks.
 
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi
 
 
 
 On Tue, Jun 24, 2014 at 4:10 AM, Sean McNamara sean.mcnam...@webtrends.com
 wrote:
 
 We have a use case where we’d like something to execute once on each node
 and I thought it would be good to ask here.
 
 Currently we achieve this by setting the parallelism to the number of
 nodes and use a mod partitioner:
 
 val balancedRdd = sc.parallelize(
(0 until Settings.parallelism)
.map(id = id - Settings.settings)
  ).partitionBy(new ModPartitioner(Settings.parallelism))
  .cache()
 
 
 This works great except in two instances where it can become unbalanced:
 
 1. if a worker is restarted or dies, the partition will move to a
 different node (one of the nodes will run two tasks).  When the worker
 rejoins, is there a way to have a partition move back over to the newly
 restarted worker so that it’s balanced again?
 
 2. drivers need to be started in a staggered fashion, otherwise one driver
 can launch two tasks on one set of workers, and the other driver will do
 the same with the other set.  Are there any scheduler/config semantics so
 that each driver will take one (and only one) core from *each* node?
 
 
 Thanks
 
 Sean
 
 
 
 
 
 
 



Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Sean McNamara
Pulled down, compiled, and tested examples on OS X and ubuntu.
Deployed app we are building on spark and poured data through it.

+1

Sean


On May 26, 2014, at 8:39 AM, Tathagata Das tathagata.das1...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version 
 1.0.0!
 
 This has a few important bug fixes on top of rc10:
 SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
 SPARK-1870: https://github.com/apache/spark/pull/848
 SPARK-1897: https://github.com/apache/spark/pull/849
 
 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1019/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
 
 Please vote on releasing this package as Apache Spark 1.0.0!
 
 The vote is open until Thursday, May 29, at 16:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.
 
 Changes to ML vector specification:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
 
 Changes to the Java API:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
 
 Changes to the streaming API:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
 
 Changes to the GraphX API:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
 
 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior
 
 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior