Re: UnusedStubClass in 1.3.0-rc1

2015-02-25 Thread Patrick Wendell
Hey Cody,

What build command are you using? In any case, we can actually comment
out the "unused" thing now in the root pom.xml. It existed just to
ensure that at least one dependency was listed in the shade plugin
configuration (otherwise, some work we do that requires the shade
plugin does not happen). However, now there are other things there. If
you just comment out the line in the root pom.xml adding this
dependency, does it work?

- Patrick

On Wed, Feb 25, 2015 at 7:53 AM, Cody Koeninger  wrote:
> So when building 1.3.0-rc1 I see the following warning:
>
> [WARNING] spark-streaming-kafka_2.10-1.3.0.jar, unused-1.0.0.jar define 1
> overlappping classes:
>
> [WARNING]   - org.apache.spark.unused.UnusedStubClass
>
>
> and when trying to build an assembly of a project that was previously using
> 1.3 snapshots without difficulty, I see the following errors:
>
>
> [error] (*:assembly) deduplicate: different file contents found in the
> following:
>
> [error]
> /Users/cody/.m2/repository/org/apache/spark/spark-streaming-kafka_2.10/1.3.0/spark-streaming-kafka_2.10-1.3.0.jar:org/apache/spark/unused/UnusedStubClass.class
>
> [error]
> /Users/cody/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
>
>
> This persists even after a clean / rebuild of both 1.3.0-rc1 and the
> project using it.
>
>
> I can just exclude that jar in the assembly definition, but is anyone else
> seeing similar issues?  If so, might be worth resolving rather than make
> users mess with assembly exclusions.
>
> I see that this class was introduced a while ago, related to SPARK-3812 but
> the jira issue doesn't have much detail.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell
It's only been reported on this thread by Tom, so far.

On Mon, Feb 23, 2015 at 10:29 AM, Marcelo Vanzin  wrote:
> Hey Patrick,
>
> Do you have a link to the bug related to Python and Yarn? I looked at
> the blockers in Jira but couldn't find it.
>
> On Mon, Feb 23, 2015 at 10:18 AM, Patrick Wendell  wrote:
>> So actually, the list of blockers on JIRA is a bit outdated. These
>> days I won't cut RC1 unless there are no known issues that I'm aware
>> of that would actually block the release (that's what the snapshot
>> ones are for). I'm going to clean those up and push others to do so
>> also.
>>
>> The main issues I'm aware of that came about post RC1 are:
>> 1. Python submission broken on YARN
>> 2. The license issue in MLlib [now fixed].
>> 3. Varargs broken for Java Dataframes [now fixed]
>>
>> Re: Corey - yeah, as it stands now I try to wait if there are things
>> that look like implicit -1 votes.
>
> --
> Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell
So actually, the list of blockers on JIRA is a bit outdated. These
days I won't cut RC1 unless there are no known issues that I'm aware
of that would actually block the release (that's what the snapshot
ones are for). I'm going to clean those up and push others to do so
also.

The main issues I'm aware of that came about post RC1 are:
1. Python submission broken on YARN
2. The license issue in MLlib [now fixed].
3. Varargs broken for Java Dataframes [now fixed]

Re: Corey - yeah, as it stands now I try to wait if there are things
that look like implicit -1 votes.

On Mon, Feb 23, 2015 at 6:13 AM, Corey Nolet  wrote:
> Thanks Sean. I glossed over the comment about SPARK-5669.
>
> On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen  wrote:
>>
>> Yes my understanding from Patrick's comment is that this RC will not
>> be released, but, to keep testing. There's an implicit -1 out of the
>> gates there, I believe, and so the vote won't pass, so perhaps that's
>> why there weren't further binding votes. I'm sure that will be
>> formalized shortly.
>>
>> FWIW here are 10 issues still listed as blockers for 1.3.0:
>>
>> SPARK-5910 DataFrame.selectExpr("col as newName") does not work
>> SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java
>> SPARK-5873 Can't see partially analyzed plans
>> SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API
>> SPARK-5517 SPARK-5166 Add input types for Java UDFs
>> SPARK-5463 Fix Parquet filter push-down
>> SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3
>> SPARK-5183 SPARK-5180 Document data source API
>> SPARK-3650 Triangle Count handles reverse edges incorrectly
>> SPARK-3511 Create a RELEASE-NOTES.txt file in the repo
>>
>>
>> On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet  wrote:
>> > This vote was supposed to close on Saturday but it looks like no PMCs
>> > voted
>> > (other than the implicit vote from Patrick). Was there a discussion
>> > offline
>> > to cut an RC2? Was the vote extended?
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [Performance] Possible regression in rdd.take()?

2015-02-18 Thread Patrick Wendell
I believe the heuristic governing the way that take() decides to fetch
partitions changed between these versions. It could be that in certain
cases the new heuristic is worse, but it might be good to just look at
the source code and see, for your number of elements taken and number
of partitions, if there was any effective change in how aggressively
spark fetched partitions.

This was quite a while ago, but I think the change was made because in
many cases the newer code works more efficiently.

- Patrick

On Wed, Feb 18, 2015 at 4:47 PM, Matt Cheah  wrote:
> Hi everyone,
>
> Between Spark 1.0.2 and Spark 1.1.1, I have noticed that rdd.take()
> consistently has a slower execution time on the later release. I was
> wondering if anyone else has had similar observations.
>
> I have two setups where this reproduces. The first is a local test. I
> launched a spark cluster with 4 worker JVMs on my Mac, and launched a
> Spark-Shell. I retrieved the text file and immediately called rdd.take(N) on
> it, where N varied. The RDD is a plaintext CSV, 4GB in size, split over 8
> files, which ends up having 128 partitions, and a total of 8000 rows.
> The numbers I discovered between Spark 1.0.2 and Spark 1.1.1 are, with all
> numbers being in seconds:
>
> 1 items
>
> Spark 1.0.2: 0.069281, 0.012261, 0.011083
>
> Spark 1.1.1: 0.11577, 0.097636, 0.11321
>
>
> 4 items
>
> Spark 1.0.2: 0.023751, 0.069365, 0.023603
>
> Spark 1.1.1: 0.224287, 0.229651, 0.158431
>
>
> 10 items
>
> Spark 1.0.2: 0.047019, 0.049056, 0.042568
>
> Spark 1.1.1: 0.353277, 0.288965, 0.281751
>
>
> 40 items
>
> Spark 1.0.2: 0.216048, 0.198049, 0.796037
>
> Spark 1.1.1: 1.865622, 2.224424, 2.037672
>
> This small test suite indicates a consistently reproducible performance
> regression.
>
>
> I also notice this on a larger scale test. The cluster used is on EC2:
>
> ec2 instance type: m2.4xlarge
> 10 slaves, 1 master
> ephemeral storage
> 70 cores, 50 GB/box
>
> In this case, I have a 100GB dataset split into 78 files totally 350 million
> items, and I take the first 50,000 items from the RDD. In this case, I have
> tested this on different formats of the raw data.
>
> With plaintext files:
>
> Spark 1.0.2: 0.422s, 0.363s, 0.382s
>
> Spark 1.1.1: 4.54s, 1.28s, 1.221s, 1.13s
>
>
> With snappy-compressed Avro files:
>
> Spark 1.0.2: 0.73s, 0.395s, 0.426s
>
> Spark 1.1.1: 4.618s, 1.81s, 1.158s, 1.333s
>
> Again demonstrating a reproducible performance regression.
>
> I was wondering if anyone else observed this regression, and if so, if
> anyone would have any idea what could possibly have caused it between Spark
> 1.0.2 and Spark 1.1.1?
>
> Thanks,
>
> -Matt Cheah

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Patrick Wendell
> UISeleniumSuite:
> *** RUN ABORTED ***
>   java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
> ...

This is a newer test suite. There is something flaky about it, we
should definitely fix it, IMO it's not a blocker though.

>
> Patrick this link gives a 404:
> https://people.apache.org/keys/committer/pwendell.asc

Works for me. Maybe it's some ephemeral issue?

> Finally, I already realized I failed to get the fix for
> https://issues.apache.org/jira/browse/SPARK-5669 correct, and that has
> to be correct for the release. I'll patch that up straight away,
> sorry. I believe the result of the intended fix is still as I
> described in SPARK-5669, so there is no bad news there. A local test
> seems to confirm it and I'm waiting on Jenkins. If it's all good I'll
> merge that fix. So, that much will need a new release, I apologize.

Thanks for finding this. I'm going to leave this open for continued testing...

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Merging code into branch 1.3

2015-02-18 Thread Patrick Wendell
Hey Committers,

Now that Spark 1.3 rc1 is cut, please restrict branch-1.3 merges to
the following:

1. Fixes for issues blocking the 1.3 release (i.e. 1.2.X regressions)
2. Documentation and tests.
3. Fixes for non-blocker issues that are surgical, low-risk, and/or
outside of the core.

If there is a lower priority bug fix (a non-blocker) that requires
nontrivial code changes, do not merge it into 1.3. If something seems
borderline, feel free to reach out to me and we can work through it
together.

This is what we've done for the last few releases to make sure rc's
become progressively more stable, and it is important towards helping
us cut timely releases.

Thanks!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.3.0!

The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1069/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/

Please vote on releasing this package as Apache Spark 1.3.0!

The vote is open until Saturday, February 21, at 08:03 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== How can I help test this release? ==
If you are a Spark user, you can help us test this release by
taking a Spark 1.2 workload and running on this release candidate,
then reporting any regressions.

== What justifies a -1 vote for this release? ==
This vote is happening towards the end of the 1.3 QA period,
so -1 votes should only occur for significant regressions from 1.2.1.
Bugs already present in 1.2.X, minor regressions, or bugs related
to new features will not block this release.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Replacing Jetty with TomCat

2015-02-17 Thread Patrick Wendell
Hey Niranda,

It seems to me a lot of effort to support multiple libraries inside of
Spark like this, so I'm not sure that's a great solution.

If you are building an application that embeds Spark, is it not
possible for you to continue to use Jetty for Spark's internal servers
and use tomcat for your own server's? I would guess that many complex
applications end up embedding multiple server libraries in various
places (Spark itself has different transport mechanisms, etc.)

- Patrick

On Tue, Feb 17, 2015 at 7:14 PM, Niranda Perera
 wrote:
> Hi Sean,
> The main issue we have is, running two web servers in a single product. we
> think it would not be an elegant solution.
>
> Could you please point me to the main areas where jetty server is tightly
> coupled or extension points where I could plug tomcat instead of jetty?
> If successful I could contribute it to the spark project. :-)
>
> cheers
>
>
>
> On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen  wrote:
>
>> There's no particular reason you have to remove the embedded Jetty
>> server, right? it doesn't prevent you from using it inside another app
>> that happens to run in Tomcat. You won't be able to switch it out
>> without rewriting a fair bit of code, no, but you don't need to.
>>
>> On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera
>>  wrote:
>> > Hi,
>> >
>> > We are thinking of integrating Spark server inside a product. Our current
>> > product uses Tomcat as its webserver.
>> >
>> > Is it possible to switch the Jetty webserver in Spark to Tomcat
>> > off-the-shelf?
>> >
>> > Cheers
>> >
>> > --
>> > Niranda
>>
>
>
>
> --
> Niranda

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to track issues that must wait for Spark 2.x in JIRA?

2015-02-12 Thread Patrick Wendell
Yeah my preferred is also having a more open ended "2+" for issues
that are clearly desirable but blocked by compatibility concerns.

What I would really want to avoid is major feature proposals sitting
around in our JIRA and tagged under some 2.X version. IMO JIRA isn't
the place for thoughts about very-long-term things. When we get these,
I'd be include to either close them as "won't fix" or "later".

On Thu, Feb 12, 2015 at 12:47 AM, Reynold Xin  wrote:
> It seems to me having a version that is 2+ is good for that? Once we move
> to 2.0, we can retag those that are not going to be fixed in 2.0 as 2.0.1
> or 2.1.0 .
>
> On Thu, Feb 12, 2015 at 12:42 AM, Sean Owen  wrote:
>
>> Patrick and I were chatting about how to handle several issues which
>> clearly need a fix, and are easy, but can't be implemented until a
>> next major release like Spark 2.x since it would change APIs.
>> Examples:
>>
>> https://issues.apache.org/jira/browse/SPARK-3266
>> https://issues.apache.org/jira/browse/SPARK-3369
>> https://issues.apache.org/jira/browse/SPARK-4819
>>
>> We could simply make version 2.0.0 in JIRA. Although straightforward,
>> it might imply that release planning has begun for 2.0.0.
>>
>> The version could be called "2+" for now to better indicate its status.
>>
>> There is also a "Later" JIRA resolution. Although resolving the above
>> seems a little wrong, it might be reasonable if we're sure to revisit
>> "Later", well, at some well defined later. The three issues above risk
>> getting lost in the shuffle.
>>
>> We also wondered whether using "Later" is good or bad. It takes items
>> off the radar that aren't going to be acted on anytime soon -- and
>> there are lots of those right now. It might send a message that these
>> will be revisited when they are even less likely to if resolved.
>>
>> Any opinions?
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: driver fail-over in Spark streaming 1.2.0

2015-02-12 Thread Patrick Wendell
It will create and connect to new executors. The executors are mostly
stateless, so the program can resume with new executors.

On Wed, Feb 11, 2015 at 11:24 PM, lin  wrote:
> Hi, all
>
> In Spark Streaming 1.2.0, when the driver fails and a new driver starts
> with the most updated check-pointed data, will the former Executors
> connects to the new driver, or will the new driver starts out its own set
> of new Executors? In which piece of codes is that done?
>
> Any reply will be appreciated :)
>
> regards,
>
> lin

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-12 Thread Patrick Wendell
The map will start with a capacity of 64, but will grow to accommodate
new data. Are you using the groupBy operator in Spark or are you using
Spark SQL's group by? This usually happens if you are grouping or
aggregating in a way that doesn't sufficiently condense the data
created from each input partition.

- Patrick

On Wed, Feb 11, 2015 at 9:37 PM, fightf...@163.com  wrote:
> Hi,
>
> Really have no adequate solution got for this issue. Expecting any available
> analytical rules or hints.
>
> Thanks,
> Sun.
>
> 
> fightf...@163.com
>
>
> From: fightf...@163.com
> Date: 2015-02-09 11:56
> To: user; dev
> Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for
> large data sets
> Hi,
> Problem still exists. Any experts would take a look at this?
>
> Thanks,
> Sun.
>
> 
> fightf...@163.com
>
>
> From: fightf...@163.com
> Date: 2015-02-06 17:54
> To: user; dev
> Subject: Sort Shuffle performance issues about using AppendOnlyMap for large
> data sets
> Hi, all
> Recently we had caught performance issues when using spark 1.2.0 to read
> data from hbase and do some summary work.
> Our scenario means to : read large data sets from hbase (maybe 100G+ file) ,
> form hbaseRDD, transform to schemardd,
> groupby and aggregate the data while got fewer new summary data sets,
> loading data into hbase (phoenix).
>
> Our major issue lead to : aggregate large datasets to get summary data sets
> would consume too long time (1 hour +) , while that
> should be supposed not so bad performance. We got the dump file attached and
> stacktrace from jstack like the following:
>
> From the stacktrace and dump file we can identify that processing large
> datasets would cause frequent AppendOnlyMap growing, and
> leading to huge map entrysize. We had referenced the source code of
> org.apache.spark.util.collection.AppendOnlyMap and found that
> the map had been initialized with capacity of 64. That would be too small
> for our use case.
>
> So the question is : Does anyone had encounted such issues before? How did
> that be resolved? I cannot find any jira issues for such problems and
> if someone had seen, please kindly let us know.
>
> More specified solution would goes to : Does any possibility exists for user
> defining the map capacity releatively in spark? If so, please
> tell how to achieve that.
>
> Best Thanks,
> Sun.
>
>Thread 22432: (state = IN_JAVA)
> - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
> line=224 (Compiled frame; information may be imprecise)
> - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
> @bci=1, line=38 (Interpreted frame)
> - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
> line=198 (Compiled frame)
> -
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=201, line=145 (Compiled frame)
> -
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=3, line=32 (Compiled frame)
> -
> org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
> @bci=141, line=205 (Compiled frame)
> -
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
> @bci=74, line=58 (Interpreted frame)
> -
> org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
> @bci=169, line=68 (Interpreted frame)
> -
> org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
> @bci=2, line=41 (Interpreted frame)
> - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted
> frame)
> - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196
> (Interpreted frame)
> -
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
> @bci=95, line=1145 (Interpreted frame)
> - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
> (Interpreted frame)
> - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
>
>
> Thread 22431: (state = IN_JAVA)
> - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87,
> line=224 (Compiled frame; information may be imprecise)
> - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable()
> @bci=1, line=38 (Interpreted frame)
> - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22,
> line=198 (Compiled frame)
> -
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=201, line=145 (Compiled frame)
> -
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
> scala.Function2) @bci=3, line=32 (Compiled frame)
> -
> org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
> @bci=141, line=205 (Compiled frame)
> -
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
> @bci=74, line=58 (Interp

[ANNOUNCE] Spark 1.3.0 Snapshot 1

2015-02-11 Thread Patrick Wendell
Hey All,

I've posted Spark 1.3.0 snapshot 1. At this point the 1.3 branch is
ready for community testing and we are strictly merging fixes and
documentation across all components.

The release files, including signatures, digests, etc can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1/

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1068/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/

Please report any issues with the release to this thread and/or to our
project JIRA. Thanks!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Powered by Spark: Concur

2015-02-10 Thread Patrick Wendell
Thanks Paolo - I've fixed it.

On Mon, Feb 9, 2015 at 11:10 PM, Paolo Platter
 wrote:
> Hi,
>
> I checked the powered by wiki too and Agile Labs should be Agile Lab. The 
> link is wrong too, it should be www.agilelab.it.
> The description is correct.
>
> Thanks a lot
>
> Paolo
>
> Inviata dal mio Windows Phone
> 
> Da: Denny Lee
> Inviato: 10/02/2015 07:41
> A: Matei Zaharia
> Cc: dev@spark.apache.org
> Oggetto: Re: Powered by Spark: Concur
>
> Thanks Matei - much appreciated!
>
> On Mon Feb 09 2015 at 10:23:57 PM Matei Zaharia 
> wrote:
>
>> Thanks Denny; added you.
>>
>> Matei
>>
>> > On Feb 9, 2015, at 10:11 PM, Denny Lee  wrote:
>> >
>> > Forgot to add Concur to the "Powered by Spark" wiki:
>> >
>> > Concur
>> > https://www.concur.com
>> > Spark SQL, MLLib
>> > Using Spark for travel and expenses analytics and personalization
>> >
>> > Thanks!
>> > Denny
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: New Metrics Sink class not packaged in spark-assembly jar

2015-02-09 Thread Patrick Wendell
Hi Judy,

If you have added source files in the sink/ source folder, they should
appear in the assembly jar when you build. One thing I noticed is that you
are looking inside the "/dist" folder. That only gets populated if you run
"make-distribution". The normal development process is just to do "mvn
package" and then look at the assembly jar that is contained in core/target.

- Patrick

On Mon, Feb 9, 2015 at 10:02 PM, Judy Nash 
wrote:

>  Hello,
>
>
>
> Working on SPARK-5708 
> - Add Slf4jSink to Spark Metrics Sink.
>
>
>
> Wrote a new Slf4jSink class (see patch attached), but the new class is not
> packaged as part of spark-assembly jar.
>
>
>
> Do I need to update build config somewhere to have this packaged?
>
>
>
> Current packaged class:
>
>
>
> Thought I must have missed something basic but can't figure out why.
>
>
>
> Thanks!
>
> Judy
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>


Re: New Metrics Sink class not packaged in spark-assembly jar

2015-02-09 Thread Patrick Wendell
Actually, to correct myself, the assembly jar is in
assembly/target/scala-2.11 (I think).

On Mon, Feb 9, 2015 at 10:42 PM, Patrick Wendell  wrote:

> Hi Judy,
>
> If you have added source files in the sink/ source folder, they should
> appear in the assembly jar when you build. One thing I noticed is that you
> are looking inside the "/dist" folder. That only gets populated if you run
> "make-distribution". The normal development process is just to do "mvn
> package" and then look at the assembly jar that is contained in core/target.
>
> - Patrick
>
> On Mon, Feb 9, 2015 at 10:02 PM, Judy Nash <
> judyn...@exchange.microsoft.com> wrote:
>
>>  Hello,
>>
>>
>>
>> Working on SPARK-5708 <https://issues.apache.org/jira/browse/SPARK-5708>
>> - Add Slf4jSink to Spark Metrics Sink.
>>
>>
>>
>> Wrote a new Slf4jSink class (see patch attached), but the new class is
>> not packaged as part of spark-assembly jar.
>>
>>
>>
>> Do I need to update build config somewhere to have this packaged?
>>
>>
>>
>> Current packaged class:
>>
>>
>>
>> Thought I must have missed something basic but can't figure out why.
>>
>>
>>
>> Thanks!
>>
>> Judy
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
>


Re: Keep or remove Debian packaging in Spark?

2015-02-09 Thread Patrick Wendell
Mark was involved in adding this code (IIRC) and has also been the
most active in maintaining it. So I'd be interested in hearing his
thoughts on that proposal. Mark - would you be okay deprecating this
and having Spark instead work with the upstream projects that focus on
packaging?

My feeling is that it's better to just have nothing than to have
something not usable out-of-the-box (which to your point, is a lot
more work).

On Mon, Feb 9, 2015 at 4:10 PM,   wrote:
> This could be something if the spark community wanted to not maintain 
> debs/rpms directly via the project could direct interested efforts towards 
> apache bigtop.  Right now debs/rpms of bigtop components, as well as related 
> tests is a focus.
>
> Something that would be great is if at least one spark committer with 
> interests in config/pkg/testing could be liason and pt for bigtop efforts.
>
> Right now focus on bigtop 0.9, which currently includes spark 1.2.  Jira for 
> items included in 0.9 can be found here:
>
> https://issues.apache.org/jira/browse/BIGTOP-1480
>
>
>
> -Original Message-
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Monday, February 9, 2015 3:52 PM
> To: Nicholas Chammas
> Cc: Patrick Wendell; Mark Hamstra; dev
> Subject: Re: Keep or remove Debian packaging in Spark?
>
> What about this straw man proposal: deprecate in 1.3 with some kind of 
> message in the build, and remove for 1.4? And add a pointer to any 
> third-party packaging that might provide similar functionality?
>
> On Mon, Feb 9, 2015 at 6:47 PM, Nicholas Chammas  
> wrote:
>> +1 to an "official" deprecation + redirecting users to some other
>> +project
>> that will or already is taking this on.
>>
>> Nate?
>>
>>
>>
>> On Mon Feb 09 2015 at 10:08:27 AM Patrick Wendell 
>> wrote:
>>>
>>> I have wondered whether we should sort of deprecated it more
>>> officially, since otherwise I think people have the reasonable
>>> expectation based on the current code that Spark intends to support
>>> "complete" Debian packaging as part of the upstream build. Having
>>> something that's sort-of maintained but no one is helping review and
>>> merge patches on it or make it fully functional, IMO that doesn't
>>> benefit us or our users. There are a bunch of other projects that are
>>> specifically devoted to packaging, so it seems like there is a clear
>>> separation of concerns here.
>>>
>>> On Mon, Feb 9, 2015 at 7:31 AM, Mark Hamstra
>>> 
>>> wrote:
>>> >>
>>> >> it sounds like nobody intends these to be used to actually deploy
>>> >> Spark
>>> >
>>> >
>>> > I wouldn't go quite that far.  What we have now can serve as useful
>>> > input to a deployment tool like Chef, but the user is then going to
>>> > need to add some customization or configuration within the context
>>> > of that tooling to get Spark installed just the way they want.  So
>>> > it is not so much that the current Debian packaging can't be used
>>> > as that it has never really been intended to be a completely
>>> > finished product that a newcomer could, for example, use to install
>>> > Spark completely and quickly to Ubuntu and have a fully-functional
>>> > environment in which they could then run all of the examples,
>>> > tutorials, etc.
>>> >
>>> > Getting to that level of packaging (and maintenance) is something
>>> > that I'm not sure we want to do since that is a better fit with
>>> > Bigtop and the efforts of Cloudera, Horton Works, MapR, etc. to
>>> > distribute Spark.
>>> >
>>> > On Mon, Feb 9, 2015 at 2:41 AM, Sean Owen  wrote:
>>> >
>>> >> This is a straw poll to assess whether there is support to keep
>>> >> and fix, or remove, the Debian packaging-related config in Spark.
>>> >>
>>> >> I see several oldish outstanding JIRAs relating to problems in the
>>> >> packaging:
>>> >>
>>> >> https://issues.apache.org/jira/browse/SPARK-1799
>>> >> https://issues.apache.org/jira/browse/SPARK-2614
>>> >> https://issues.apache.org/jira/browse/SPARK-3624
>>> >> https://issues.apache.org/jira/browse/SPARK-4436
>>> >> (and a similar idea about making RPMs)
>>> >> https://issues.apache.org/jira/browse/SPARK-665
>>> >>
>>> >> The original motivation seems related to Chef:
>

Re: Mail to u...@spark.apache.org failing

2015-02-09 Thread Patrick Wendell
Ah - we should update it to suggest mailing the dev@ list (and if
there is enough traffic maybe do something else).

I'm happy to add you if you can give an organization name, URL, a list
of which Spark components you are using, and a short description of
your use case..

On Mon, Feb 9, 2015 at 9:00 PM, Meethu Mathew  wrote:
> Hi,
>
> The mail id given in
> https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark seems to
> be failing. Can anyone tell me how to get added to Powered By Spark list?
>
> --
>
> Regards,
>
> *Meethu*

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: multi-line comment style

2015-02-09 Thread Patrick Wendell
Clearly there isn't a strictly optimal commenting format (pro's and
cons for both '//' and '/*'). My thought is for consistency we should
just chose one and put in the style guide.

On Mon, Feb 9, 2015 at 12:25 PM, Xiangrui Meng  wrote:
> Btw, I think allowing `/* ... */` without the leading `*` in lines is
> also useful. Check this line:
> https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55,
> where we put the R commands that can reproduce the test result. It is
> easier if we write in the following style:
>
> ~~~
> /*
>  Using the following R code to load the data and train the model using
> glmnet package.
>
>  library("glmnet")
>  data <- read.csv("path", header=FALSE, stringsAsFactors=FALSE)
>  features <- as.matrix(data.frame(as.numeric(data$V2), as.numeric(data$V3)))
>  label <- as.numeric(data$V1)
>  weights <- coef(glmnet(features, label, family="gaussian", alpha = 0,
> lambda = 0))
>  */
> ~~~
>
> So people can copy & paste the R commands directly.
>
> Xiangrui
>
> On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng  wrote:
>> I like the `/* .. */` style more. Because it is easier for IDEs to
>> recognize it as a block comment. If you press enter in the comment
>> block with the `//` style, IDEs won't add `//` for you. -Xiangrui
>>
>> On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin  wrote:
>>> We should update the style doc to reflect what we have in most places
>>> (which I think is //).
>>>
>>>
>>>
>>> On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
>>>> FWIW I like the multi-line // over /* */ from a purely style standpoint.
>>>> The Google Java style guide[1] has some comment about code formatting tools
>>>> working better with /* */ but there doesn't seem to be any strong arguments
>>>> for one over the other I can find
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> [1]
>>>>
>>>> https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style
>>>>
>>>> On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell 
>>>> wrote:
>>>>
>>>> > Personally I have no opinion, but agree it would be nice to standardize.
>>>> >
>>>> > - Patrick
>>>> >
>>>> > On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen  wrote:
>>>> > > One thing Marcelo pointed out to me is that the // style does not
>>>> > > interfere with commenting out blocks of code with /* */, which is a
>>>> > > small good thing. I am also accustomed to // style for multiline, and
>>>> > > reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style
>>>> > > inline always looks a little funny to me.
>>>> > >
>>>> > > On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout <
>>>> kayousterh...@gmail.com>
>>>> > wrote:
>>>> > >> Hi all,
>>>> > >>
>>>> > >> The Spark Style Guide
>>>> > >> <
>>>> > https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
>>>> >
>>>> > >> says multi-line comments should formatted as:
>>>> > >>
>>>> > >> /*
>>>> > >>  * This is a
>>>> > >>  * very
>>>> > >>  * long comment.
>>>> > >>  */
>>>> > >>
>>>> > >> But in my experience, we almost always use "//" for multi-line
>>>> comments:
>>>> > >>
>>>> > >> // This is a
>>>> > >> // very
>>>> > >> // long comment.
>>>> > >>
>>>> > >> Here are some examples:
>>>> > >>
>>>> > >>- Recent commit by Reynold, king of style:
>>>> > >>
>>>> >
>>>> https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58
>>>> > >>- RDD.scala:
>>>> > >>
>>>> >
>>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361
>>>> > >>- DAGScheduler.scala:
>>>> > >>
>>>> >
>>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281
>>>> > >>
>>>> > >>
>>>> > >> Any objections to me updating the style guide to reflect this?  As
>>>> with
>>>> > >> other style issues, I think consistency here is helpful (and
>>>> formatting
>>>> > >> multi-line comments as "//" does nicely visually distinguish code
>>>> > comments
>>>> > >> from doc comments).
>>>> > >>
>>>> > >> -Kay
>>>> > >
>>>> > > -
>>>> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> > > For additional commands, e-mail: dev-h...@spark.apache.org
>>>> > >
>>>> >
>>>> > -
>>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>>> >
>>>> >
>>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] Apache Spark 1.2.1 Released

2015-02-09 Thread Patrick Wendell
Hi All,

I've just posted the 1.2.1 maintenance release of Apache Spark. We
recommend all 1.2.0 users upgrade to this release, as this release
includes stability fixes across all components of Spark.

- Download this release: http://spark.apache.org/downloads.html
- View the release notes:
http://spark.apache.org/releases/spark-release-1-2-1.html
- Full list of JIRA issues resolved in this release: http://s.apache.org/Mpn

Thanks to everyone who helped work on this release!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Keep or remove Debian packaging in Spark?

2015-02-09 Thread Patrick Wendell
I have wondered whether we should sort of deprecated it more
officially, since otherwise I think people have the reasonable
expectation based on the current code that Spark intends to support
"complete" Debian packaging as part of the upstream build. Having
something that's sort-of maintained but no one is helping review and
merge patches on it or make it fully functional, IMO that doesn't
benefit us or our users. There are a bunch of other projects that are
specifically devoted to packaging, so it seems like there is a clear
separation of concerns here.

On Mon, Feb 9, 2015 at 7:31 AM, Mark Hamstra  wrote:
>>
>> it sounds like nobody intends these to be used to actually deploy Spark
>
>
> I wouldn't go quite that far.  What we have now can serve as useful input
> to a deployment tool like Chef, but the user is then going to need to add
> some customization or configuration within the context of that tooling to
> get Spark installed just the way they want.  So it is not so much that the
> current Debian packaging can't be used as that it has never really been
> intended to be a completely finished product that a newcomer could, for
> example, use to install Spark completely and quickly to Ubuntu and have a
> fully-functional environment in which they could then run all of the
> examples, tutorials, etc.
>
> Getting to that level of packaging (and maintenance) is something that I'm
> not sure we want to do since that is a better fit with Bigtop and the
> efforts of Cloudera, Horton Works, MapR, etc. to distribute Spark.
>
> On Mon, Feb 9, 2015 at 2:41 AM, Sean Owen  wrote:
>
>> This is a straw poll to assess whether there is support to keep and
>> fix, or remove, the Debian packaging-related config in Spark.
>>
>> I see several oldish outstanding JIRAs relating to problems in the
>> packaging:
>>
>> https://issues.apache.org/jira/browse/SPARK-1799
>> https://issues.apache.org/jira/browse/SPARK-2614
>> https://issues.apache.org/jira/browse/SPARK-3624
>> https://issues.apache.org/jira/browse/SPARK-4436
>> (and a similar idea about making RPMs)
>> https://issues.apache.org/jira/browse/SPARK-665
>>
>> The original motivation seems related to Chef:
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-2614?focusedCommentId=14070908&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14070908
>>
>> Mark's recent comments cast some doubt on whether it is essential:
>>
>> https://github.com/apache/spark/pull/4277#issuecomment-72114226
>>
>> and in recent conversations I didn't hear dissent to the idea of removing
>> this.
>>
>> Is this still useful enough to fix up? All else equal I'd like to
>> start to walk back some of the complexity of the build, but I don't
>> know how all-else-equal it is. Certainly, it sounds like nobody
>> intends these to be used to actually deploy Spark.
>>
>> I don't doubt it's useful to someone, but can they maintain the
>> packaging logic elsewhere?
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Unit tests

2015-02-08 Thread Patrick Wendell
Hey All,

The tests are in a not-amazing state right now due to a few compounding factors:

1. We've merged a large volume of patches recently.
2. The load on jenkins has been relatively high, exposing races and
other behavior not seen at lower load.

For those not familiar, the main issue is flaky (non deterministic)
test failures. Right now I'm trying to prioritize keeping the
PullReqeustBuilder in good shape since it will block development if it
is down.

For other tests, let's try to keep filing JIRA's when we see issues
and use the flaky-test label (see http://bit.ly/1yRif9S):

I may contact people regarding specific tests. This is a very high
priority to get in good shape. This kind of thing is no one's "fault"
but just the result of a lot of concurrent development, and everyone
needs to pitch in to get back in a good place.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Improving metadata in Spark JIRA

2015-02-08 Thread Patrick Wendell
I think we already have a YARN component.

https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20YARN

I don't think JIRA allows it to be mandatory, but if it does, that
would be useful.

On Sat, Feb 7, 2015 at 5:08 PM, Nicholas Chammas
 wrote:
> By the way, isn't it possible to make the "Component" field mandatory when
> people open new issues? Shouldn't we do that?
>
> Btw Patrick, don't we need a YARN component? I think our JIRA components
> should roughly match the components on the PR dashboard.
>
> Nick
>
> On Fri Feb 06 2015 at 12:25:52 PM Patrick Wendell 
> wrote:
>>
>> Per Nick's suggestion I added two components:
>>
>> 1. Spark Submit
>> 2. Spark Scheduler
>>
>> I figured I would just add these since if we decide later we don't
>> want them, we can simply merge them into Spark Core.
>>
>> On Fri, Feb 6, 2015 at 11:53 AM, Nicholas Chammas
>>  wrote:
>> > Do we need some new components to be added to the JIRA project?
>> >
>> > Like:
>> >
>> >-
>> >
>> >scheduler
>> > -
>> >
>> >YARN
>> > - spark-submit
>> >- ...?
>> >
>> > Nick
>> >
>> >
>> > On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas <
>> > nicholas.cham...@gmail.com> wrote:
>> >
>> >> +9000 on cleaning up JIRA.
>> >>
>> >> Thank you Sean for laying out some specific things to tackle. I will
>> >> assist with this.
>> >>
>> >> Regarding email, I think Sandy is right. I only get JIRA email for
>> >> issues
>> >> I'm watching.
>> >>
>> >> Nick
>> >>
>> >> On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza 
>> >> wrote:
>> >>
>> >>> JIRA updates don't go to this list, they go to
>> >>> iss...@spark.apache.org.
>> >>> I
>> >>> don't think many are signed up for that list, and those that are
>> >>> probably
>> >>> have a flood of emails anyway.
>> >>>
>> >>> So I'd definitely be in favor of any JIRA cleanup that you're up for.
>> >>>
>> >>> -Sandy
>> >>>
>> >>> On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen  wrote:
>> >>>
>> >>> > I've wasted no time in wielding the commit bit to complete a number
>> >>> > of
>> >>> > small, uncontroversial changes. I wouldn't commit anything that
>> >>> > didn't
>> >>> > already appear to have review, consensus and little risk, but please
>> >>> > let me know if anything looked a little too bold, so I can
>> >>> > calibrate.
>> >>> >
>> >>> >
>> >>> > Anyway, I'd like to continue some small house-cleaning by improving
>> >>> > the state of JIRA's metadata, in order to let it give us a little
>> >>> > clearer view on what's happening in the project:
>> >>> >
>> >>> > a. Add Component to every (open) issue that's missing one
>> >>> > b. Review all Critical / Blocker issues to de-escalate ones that
>> >>> > seem
>> >>> > obviously neither
>> >>> > c. Correct open issues that list a Fix version that has already been
>> >>> > released
>> >>> > d. Close all issues Resolved for a release that has already been
>> >>> released
>> >>> >
>> >>> > The problem with doing so is that it will create a tremendous amount
>> >>> > of email to the list, like, several hundred. It's possible to make
>> >>> > bulk changes and suppress e-mail though, which could be done for all
>> >>> > but b.
>> >>> >
>> >>> > Better to suppress the emails when making such changes? or just not
>> >>> > bother on some of these?
>> >>> >
>> >>> >
>> >>> > -
>> >>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >>> >
>> >>> >
>> >>>
>> >>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-08 Thread Patrick Wendell
This vote passes with 5 +1 votes (3 binding) and no 0 or -1 votes.

+1 Votes:
Krishna Sankar
Sean Owen*
Chip Senkbeil
Matei Zaharia*
Patrick Wendell*

0 Votes:
(none)

-1 Votes:
(none)

On Fri, Feb 6, 2015 at 5:12 PM, Patrick Wendell  wrote:
> I'll add a +1 as well.
>
> On Fri, Feb 6, 2015 at 2:38 PM, Matei Zaharia  wrote:
>> +1
>>
>> Tested on Mac OS X.
>>
>> Matei
>>
>>
>>> On Feb 2, 2015, at 8:57 PM, Patrick Wendell  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version 
>>> 1.2.1!
>>>
>>> The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.1-rc3/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1065/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/
>>>
>>> Changes from rc2:
>>> A single patch fixing a windows issue.
>>>
>>> Please vote on releasing this package as Apache Spark 1.2.1!
>>>
>>> The vote is open until Friday, February 06, at 05:00 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.2.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> For a list of fixes in this release, see http://s.apache.org/Mpn.
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-08 Thread Patrick Wendell
I'll add a +1 as well.

On Fri, Feb 6, 2015 at 2:38 PM, Matei Zaharia  wrote:
> +1
>
> Tested on Mac OS X.
>
> Matei
>
>
>> On Feb 2, 2015, at 8:57 PM, Patrick Wendell  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.2.1!
>>
>> The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc3/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1065/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/
>>
>> Changes from rc2:
>> A single patch fixing a windows issue.
>>
>> Please vote on releasing this package as Apache Spark 1.2.1!
>>
>> The vote is open until Friday, February 06, at 05:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.2.1
>> [ ] -1 Do not release this package because ...
>>
>> For a list of fixes in this release, see http://s.apache.org/Mpn.
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Improving metadata in Spark JIRA

2015-02-06 Thread Patrick Wendell
Per Nick's suggestion I added two components:

1. Spark Submit
2. Spark Scheduler

I figured I would just add these since if we decide later we don't
want them, we can simply merge them into Spark Core.

On Fri, Feb 6, 2015 at 11:53 AM, Nicholas Chammas
 wrote:
> Do we need some new components to be added to the JIRA project?
>
> Like:
>
>-
>
>scheduler
> -
>
>YARN
> - spark-submit
>- ...?
>
> Nick
>
>
> On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> +9000 on cleaning up JIRA.
>>
>> Thank you Sean for laying out some specific things to tackle. I will
>> assist with this.
>>
>> Regarding email, I think Sandy is right. I only get JIRA email for issues
>> I'm watching.
>>
>> Nick
>>
>> On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza 
>> wrote:
>>
>>> JIRA updates don't go to this list, they go to iss...@spark.apache.org.
>>> I
>>> don't think many are signed up for that list, and those that are probably
>>> have a flood of emails anyway.
>>>
>>> So I'd definitely be in favor of any JIRA cleanup that you're up for.
>>>
>>> -Sandy
>>>
>>> On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen  wrote:
>>>
>>> > I've wasted no time in wielding the commit bit to complete a number of
>>> > small, uncontroversial changes. I wouldn't commit anything that didn't
>>> > already appear to have review, consensus and little risk, but please
>>> > let me know if anything looked a little too bold, so I can calibrate.
>>> >
>>> >
>>> > Anyway, I'd like to continue some small house-cleaning by improving
>>> > the state of JIRA's metadata, in order to let it give us a little
>>> > clearer view on what's happening in the project:
>>> >
>>> > a. Add Component to every (open) issue that's missing one
>>> > b. Review all Critical / Blocker issues to de-escalate ones that seem
>>> > obviously neither
>>> > c. Correct open issues that list a Fix version that has already been
>>> > released
>>> > d. Close all issues Resolved for a release that has already been
>>> released
>>> >
>>> > The problem with doing so is that it will create a tremendous amount
>>> > of email to the list, like, several hundred. It's possible to make
>>> > bulk changes and suppress e-mail though, which could be done for all
>>> > but b.
>>> >
>>> > Better to suppress the emails when making such changes? or just not
>>> > bother on some of these?
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: dev-h...@spark.apache.org
>>> >
>>> >
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: PSA: Maven supports parallel builds

2015-02-05 Thread Patrick Wendell
I've done this in the past, but back when I wasn't using Zinc it
didn't make a big difference. It's worth doing this in our jenkins
environment though.

- Patrick

On Thu, Feb 5, 2015 at 4:52 PM, Dirceu Semighini Filho
 wrote:
> Thanks Nicholas, I didn't knew this.
>
> 2015-02-05 22:16 GMT-02:00 Nicholas Chammas :
>
>> Y'all may already know this, but I haven't seen it mentioned anywhere in
>> our docs on here and it's a pretty easy win.
>>
>> Maven supports parallel builds
>> <
>> https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3
>> >
>> with the -T command line option.
>>
>> For example:
>>
>> ./build/mvn -T 1C -Dhadoop.version=1.2.1 -DskipTests clean package
>>
>> This will have Maven use 1 thread per core on your machine to build Spark.
>>
>> On my little MacBook air, this cuts the build time from 14 minutes to 10.5
>> minutes. A machine with more cores should see a bigger improvement.
>>
>> Note though that the docs mark this as experimental, so I wouldn't change
>> our reference build to use this. But it should be useful, for example, in
>> Jenkins or when working locally.
>>
>> Nick
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: multi-line comment style

2015-02-04 Thread Patrick Wendell
Personally I have no opinion, but agree it would be nice to standardize.

- Patrick

On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen  wrote:
> One thing Marcelo pointed out to me is that the // style does not
> interfere with commenting out blocks of code with /* */, which is a
> small good thing. I am also accustomed to // style for multiline, and
> reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style
> inline always looks a little funny to me.
>
> On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout  
> wrote:
>> Hi all,
>>
>> The Spark Style Guide
>> 
>> says multi-line comments should formatted as:
>>
>> /*
>>  * This is a
>>  * very
>>  * long comment.
>>  */
>>
>> But in my experience, we almost always use "//" for multi-line comments:
>>
>> // This is a
>> // very
>> // long comment.
>>
>> Here are some examples:
>>
>>- Recent commit by Reynold, king of style:
>>
>> https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58
>>- RDD.scala:
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361
>>- DAGScheduler.scala:
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281
>>
>>
>> Any objections to me updating the style guide to reflect this?  As with
>> other style issues, I think consistency here is helpful (and formatting
>> multi-line comments as "//" does nicely visually distinguish code comments
>> from doc comments).
>>
>> -Kay
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.2.1-rc3 - Avro input format for Hadoop 2 broken/fix?

2015-02-04 Thread Patrick Wendell
Hi Markus,

That won't be included in 1.2.1 most likely because the release votes
have already started, and at that point we don't hold the release
except for major regression issues from 1.2.0. However, if this goes
through we can backport it into the 1.2 branch and it will end up in a
future maintenance release, or you can just build spark from that
branch as soon as it's in there.

- Patric

On Wed, Feb 4, 2015 at 7:30 AM, M. Dale  wrote:
> SPARK-3039 "Spark assembly for new hadoop API (hadoop 2) contains
> avro-mapred for hadoop 1 API" was reopened
> and prevents v.1.2.1-rc3 from using Avro Input format for Hadoop 2
> API/instances (it includes the hadoop1 avro-mapred library files).
>
> What are the chances of getting the fix outlined here
> (https://github.com/medale/spark/compare/apache:v1.2.1-rc3...avro-hadoop2-v1.2.1-rc2)
> included in 1.2.1? My apologies, I do not know how to generate a pull
> request against a tag version.
>
> I did add pull request https://github.com/apache/spark/pull/4315 for the
> current 1.3.0-SNAPSHOT master on this issue. Even though 1.3.0 build already
> does not include avro-mapred in the spark assembly jar this minor change
> improves dependence convergence.
>
> Thanks,
> Markus
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] branch-1.3 has been cut

2015-02-03 Thread Patrick Wendell
Hey All,

Just wanted to announce that we've cut the 1.3 branch which will
become the 1.3 release after community testing.

There are still some features that will go in (in higher level
libraries, and some stragglers in spark core), but overall this
indicates the end of major feature development for Spark 1.3 and a
transition into testing.

Within a few days I'll cut a snapshot package release for this so that
people can begin testing.

https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-1.3

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-02 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1!

The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1065/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/

Changes from rc2:
A single patch fixing a windows issue.

Please vote on releasing this package as Apache Spark 1.2.1!

The vote is open until Friday, February 06, at 05:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.1
[ ] -1 Do not release this package because ...

For a list of fixes in this release, see http://s.apache.org/Mpn.

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-02-02 Thread Patrick Wendell
This is cancelled in favor of RC2.

On Mon, Feb 2, 2015 at 8:50 PM, Patrick Wendell  wrote:
> The windows issue reported only affects actually running Spark on
> Windows (not job submission). However, I agree it's worth cutting a
> new RC. I'm going to cancel this vote and propose RC3 with a single
> additional patch. Let's try to vote that through so we can ship Spark
> 1.2.1.
>
> - Patrick
>
> On Sat, Jan 31, 2015 at 7:36 PM, Matei Zaharia  
> wrote:
>> This looks like a pretty serious problem, thanks! Glad people are testing on 
>> Windows.
>>
>> Matei
>>
>>> On Jan 31, 2015, at 11:57 AM, MartinWeindel  
>>> wrote:
>>>
>>> FYI: Spark 1.2.1rc2 does not work on Windows!
>>>
>>> On creating a Spark context you get following log output on my Windows
>>> machine:
>>> INFO  org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster
>>> ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in
>>> C:\Users\mweindel\AppData\Local\Temp\. Ignoring this directory.
>>> ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any
>>> local dir.
>>>
>>> I have already located the cause. A newly added function chmod700() in
>>> org.apache.util.Utils uses functionality which only works on a Unix file
>>> system.
>>>
>>> See also pull request [https://github.com/apache/spark/pull/4299] for my
>>> suggestion how to resolve the issue.
>>>
>>> Best regards,
>>>
>>> Martin Weindel
>>>
>>>
>>>
>>> --
>>> View this message in context: 
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10370.html
>>> Sent from the Apache Spark Developers List mailing list archive at 
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-02-02 Thread Patrick Wendell
The windows issue reported only affects actually running Spark on
Windows (not job submission). However, I agree it's worth cutting a
new RC. I'm going to cancel this vote and propose RC3 with a single
additional patch. Let's try to vote that through so we can ship Spark
1.2.1.

- Patrick

On Sat, Jan 31, 2015 at 7:36 PM, Matei Zaharia  wrote:
> This looks like a pretty serious problem, thanks! Glad people are testing on 
> Windows.
>
> Matei
>
>> On Jan 31, 2015, at 11:57 AM, MartinWeindel  wrote:
>>
>> FYI: Spark 1.2.1rc2 does not work on Windows!
>>
>> On creating a Spark context you get following log output on my Windows
>> machine:
>> INFO  org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster
>> ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in
>> C:\Users\mweindel\AppData\Local\Temp\. Ignoring this directory.
>> ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any
>> local dir.
>>
>> I have already located the cause. A newly added function chmod700() in
>> org.apache.util.Utils uses functionality which only works on a Unix file
>> system.
>>
>> See also pull request [https://github.com/apache/spark/pull/4299] for my
>> suggestion how to resolve the issue.
>>
>> Best regards,
>>
>> Martin Weindel
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10370.html
>> Sent from the Apache Spark Developers List mailing list archive at 
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Temporary jenkins issue

2015-02-02 Thread Patrick Wendell
Hey All,

I made a change to the Jenkins configuration that caused most builds
to fail (attempting to enable a new plugin), I've reverted the change
effective about 10 minutes ago.

If you've seen recent build failures like below, this was caused by
that change. Sorry about that.


ERROR: Publisher
com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver
aborted due to exception
java.lang.NoSuchMethodError:
hudson.model.AbstractBuild.getTestResultAction()Lhudson/tasks/test/AbstractTestResultAction;
at 
com.google.jenkins.flakyTestHandler.plugin.FlakyTestResultAction.(FlakyTestResultAction.java:78)
at 
com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver.perform(JUnitFlakyResultArchiver.java:89)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:734)
at hudson.model.Build$BuildExecution.post2(Build.java:183)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:683)
at hudson.model.Run.execute(Run.java:1784)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:89)
at hudson.model.Executor.run(Executor.java:240)


- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark Master Maven with YARN build is broken

2015-02-02 Thread Patrick Wendell
It's my fault, I'm sending a hot fix now.

On Mon, Feb 2, 2015 at 1:44 PM, Nicholas Chammas
 wrote:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/
>
> Is this is a known issue? It seems to have been broken since last night.
>
> Here's a snippet from the build output of one of the builds
> 
> :
>
> [error] bad symbolic reference. A signature in WebUI.class refers to
> term eclipse
> [error] in package org which is not available.
> [error] It may be completely missing from the current classpath, or
> the version on
> [error] the classpath might be incompatible with the version used when
> compiling WebUI.class.
> [error] bad symbolic reference. A signature in WebUI.class refers to term 
> jetty
> [error] in value org.eclipse which is not available.
> [error] It may be completely missing from the current classpath, or
> the version on
> [error] the classpath might be incompatible with the version used when
> compiling WebUI.class.
> [error]
> [error]  while compiling:
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/centos/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
> [error] during phase: erasure
> [error]  library version: version 2.10.4
> [error] compiler version: version 2.10.4
>
> Nick
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Questions about Spark standalone resource scheduler

2015-02-02 Thread Patrick Wendell
Hey Jerry,

I think standalone mode will still add more features over time, but
the goal isn't really for it to become equivalent to what Mesos/YARN
are today. Or at least, I doubt Spark Standalone will ever attempt to
manage _other_ frameworks outside of Spark and become a general
purpose resource manager.

In terms of having better support for multi tenancy, meaning multiple
*Spark* instances, this is something I think could be in scope in the
future. For instance, we added H/A to the standalone scheduler a while
back, because it let us support H/A streaming apps in a totally native
way. It's a trade off of adding new features and keeping the scheduler
very simple and easy to use. We've tended to bias towards simplicity
as the main goal, since this is something we want to be really easy
"out of the box".

One thing to point out, a lot of people use the standalone mode with
some coarser grained scheduler, such as running in a cloud service. In
this case they really just want a simple "inner" cluster manager. This
may even be the majority of all Spark installations. This is slightly
different than Hadoop environments, where they might just want nice
integration into the existing Hadoop stack via something like YARN.

- Patrick

On Mon, Feb 2, 2015 at 12:24 AM, Shao, Saisai  wrote:
> Hi all,
>
>
>
> I have some questions about the future development of Spark's standalone
> resource scheduler. We've heard some users have the requirements to have
> multi-tenant support in standalone mode, like multi-user management,
> resource management and isolation, whitelist of users. Seems current Spark
> standalone do not support such kind of functionalities, while resource
> schedulers like Yarn offers such kind of advanced managements, I'm not sure
> what's the future target of standalone resource scheduler, will it only
> target on simple implementation, and for advanced usage shift to YARN? Or
> will it plan to add some simple multi-tenant related functionalities?
>
>
>
> Thanks a lot for your comments.
>
>
>
> BR
>
> Jerry

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: spark akka fork : is the source anywhere?

2015-01-28 Thread Patrick Wendell
It's maintained here:

https://github.com/pwendell/akka/tree/2.2.3-shaded-proto

Over time, this is something that would be great to get rid of, per rxin

On Wed, Jan 28, 2015 at 3:33 PM, Reynold Xin  wrote:
> Hopefully problems like this will go away entirely in the next couple of
> releases. https://issues.apache.org/jira/browse/SPARK-5293
>
>
>
> On Wed, Jan 28, 2015 at 3:12 PM, jay vyas 
> wrote:
>
>> Hi spark. Where is akka coming from in spark ?
>>
>> I see the distribution referenced is a spark artifact... but not in the
>> apache namespace.
>>
>>  org.spark-project.akka
>>  2.3.4-spark
>>
>> Clearly this is a deliberate thought out change (See SPARK-1812), but its
>> not clear where 2.3.4 spark is coming from and who is maintaining its
>> release?
>>
>> --
>> jay vyas
>>
>> PS
>>
>> I've had some conversations with will benton as well about this, and its
>> clear that some modifications to akka are needed, or else a protobug error
>> occurs, which amount to serialization incompatibilities, hence if one wants
>> to build spark from sources, the patched akka is required (or else, manual
>> patching needs to be done)...
>>
>> 15/01/28 22:58:10 ERROR ActorSystemImpl: Uncaught fatal error from thread
>> [sparkWorker-akka.remote.default-remote-dispatcher-6] shutting down
>> ActorSystem [sparkWorker] java.lang.VerifyError: class
>> akka.remote.WireFormats$AkkaControlMessage overrides final method
>> getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Yes - it fixes that issue.

On Wed, Jan 28, 2015 at 2:17 AM, Aniket  wrote:
> Hi Patrick,
>
> I am wondering if this version will address issues around certain artifacts
> not getting published in 1.2 which are gating people to migrate to 1.2. One
> such issue is https://issues.apache.org/jira/browse/SPARK-5144
>
> Thanks,
> Aniket
>
> On Wed Jan 28 2015 at 15:39:43 Patrick Wendell [via Apache Spark Developers
> List]  wrote:
>
>> Minor typo in the above e-mail - the tag is named v1.2.1-rc2 (not
>> v1.2.1-rc1).
>>
>> On Wed, Jan 28, 2015 at 2:06 AM, Patrick Wendell <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=10318&i=0>> wrote:
>>
>> > Please vote on releasing the following candidate as Apache Spark version
>> 1.2.1!
>> >
>> > The tag to be voted on is v1.2.1-rc1 (commit b77f876):
>> >
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc2/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1062/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
>> >
>> > Changes from rc1:
>> > This has no code changes from RC1. Only minor changes to the release
>> script.
>> >
>> > Please vote on releasing this package as Apache Spark 1.2.1!
>> >
>> > The vote is open until  Saturday, January 31, at 10:04 UTC and passes
>> > if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.2.1
>> > [ ] -1 Do not release this package because ...
>> >
>> > For a list of fixes in this release, see http://s.apache.org/Mpn.
>> >
>> > To learn more about Apache Spark, please see
>> > http://spark.apache.org/
>>
>> -
>> To unsubscribe, e-mail: [hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=10318&i=1>
>> For additional commands, e-mail: [hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=10318&i=2>
>>
>>
>>
>> --
>>  If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10318.html
>>  To start a new topic under Apache Spark Developers List, email
>> ml-node+s1001551n1...@n3.nabble.com
>> To unsubscribe from Apache Spark Developers List, click here
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YW5pa2V0LmJoYXRuYWdhckBnbWFpbC5jb218MXwxMzE3NTAzMzQz>
>> .
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10320.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Minor typo in the above e-mail - the tag is named v1.2.1-rc2 (not v1.2.1-rc1).

On Wed, Jan 28, 2015 at 2:06 AM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.2.1!
>
> The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1062/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
>
> Changes from rc1:
> This has no code changes from RC1. Only minor changes to the release script.
>
> Please vote on releasing this package as Apache Spark 1.2.1!
>
> The vote is open until  Saturday, January 31, at 10:04 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.2.1
> [ ] -1 Do not release this package because ...
>
> For a list of fixes in this release, see http://s.apache.org/Mpn.
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1!

The tag to be voted on is v1.2.1-rc1 (commit b77f876):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1062/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/

Changes from rc1:
This has no code changes from RC1. Only minor changes to the release script.

Please vote on releasing this package as Apache Spark 1.2.1!

The vote is open until  Saturday, January 31, at 10:04 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.1
[ ] -1 Do not release this package because ...

For a list of fixes in this release, see http://s.apache.org/Mpn.

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-28 Thread Patrick Wendell
This vote is cancelled in favor of RC2.

On Tue, Jan 27, 2015 at 4:20 PM, Reynold Xin  wrote:
> +1
>
> Tested on Mac OS X
>
> On Tue, Jan 27, 2015 at 12:35 PM, Krishna Sankar 
> wrote:
>>
>> +1
>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min
>>  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
>> -Dhadoop.version=2.6.0 -Phive -DskipTests
>> 2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
>> 1.2.0
>> 2.1. statistics OK
>> 2.2. Linear/Ridge/Laso Regression OK
>> 2.3. Decision Tree, Naive Bayes OK
>> 2.4. KMeans OK
>>Center And Scale OK
>>Fixed : org.apache.spark.SparkException in zip !
>> 2.5. rdd operations OK
>>State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>> 2.6. recommendation OK
>>
>> Cheers
>> 
>>
>> On Mon, Jan 26, 2015 at 11:02 PM, Patrick Wendell 
>> wrote:
>>
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.2.1!
>> >
>> > The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
>> >
>> >
>> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc1/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1061/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
>> >
>> > Please vote on releasing this package as Apache Spark 1.2.1!
>> >
>> > The vote is open until Friday, January 30, at 07:00 UTC and passes
>> > if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.2.1
>> > [ ] -1 Do not release this package because ...
>> >
>> > For a list of fixes in this release, see http://s.apache.org/Mpn.
>> >
>> > To learn more about Apache Spark, please see
>> > http://spark.apache.org/
>> >
>> > - Patrick
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Friendly reminder/request to help with reviews!

2015-01-27 Thread Patrick Wendell
Hey All,

Just a reminder, as always around release time we have a very large
volume of patches show up near the deadline.

One thing that can help us maximize the number of patches we get in is
to have community involvement in performing code reviews. And in
particular, doing a thorough review and signing off on a patch with
LGTM can substantially increase the odds we can merge a patch
confidently.

If you are newer to Spark, finding a single area of the codebase to
focus on can still provide a lot of value to the project in the
reviewing process.

Cheers and good luck with everyone on work for this release.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Okay - we've resolved all issues with the signatures and keys.
However, I'll leave the current vote open for a bit to solicit
additional feedback.

On Tue, Jan 27, 2015 at 10:43 AM, Sean McNamara
 wrote:
> Sounds good, that makes sense.
>
> Cheers,
>
> Sean
>
>> On Jan 27, 2015, at 11:35 AM, Patrick Wendell  wrote:
>>
>> Hey Sean,
>>
>> Right now we don't publish every 2.11 binary to avoid combinatorial
>> explosion of the number of build artifacts we publish (there are other
>> parameters such as whether hive is included, etc). We can revisit this
>> in future feature releases, but .1 releases like this are reserved for
>> bug fixes.
>>
>> - Patrick
>>
>> On Tue, Jan 27, 2015 at 10:31 AM, Sean McNamara
>>  wrote:
>>> We're using spark on scala 2.11 /w hadoop2.4.  Would it be practical / make 
>>> sense to build a bin version of spark against scala 2.11 for versions other 
>>> than just hadoop1 at this time?
>>>
>>> Cheers,
>>>
>>> Sean
>>>
>>>
>>>> On Jan 27, 2015, at 12:04 AM, Patrick Wendell  wrote:
>>>>
>>>> Please vote on releasing the following candidate as Apache Spark version 
>>>> 1.2.1!
>>>>
>>>> The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
>>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-1.2.1-rc1/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1061/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
>>>>
>>>> Please vote on releasing this package as Apache Spark 1.2.1!
>>>>
>>>> The vote is open until Friday, January 30, at 07:00 UTC and passes
>>>> if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.2.1
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> For a list of fixes in this release, see http://s.apache.org/Mpn.
>>>>
>>>> To learn more about Apache Spark, please see
>>>> http://spark.apache.org/
>>>>
>>>> - Patrick
>>>>
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>
>>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Hey Sean,

Right now we don't publish every 2.11 binary to avoid combinatorial
explosion of the number of build artifacts we publish (there are other
parameters such as whether hive is included, etc). We can revisit this
in future feature releases, but .1 releases like this are reserved for
bug fixes.

- Patrick

On Tue, Jan 27, 2015 at 10:31 AM, Sean McNamara
 wrote:
> We're using spark on scala 2.11 /w hadoop2.4.  Would it be practical / make 
> sense to build a bin version of spark against scala 2.11 for versions other 
> than just hadoop1 at this time?
>
> Cheers,
>
> Sean
>
>
>> On Jan 27, 2015, at 12:04 AM, Patrick Wendell  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.2.1!
>>
>> The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1061/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.2.1!
>>
>> The vote is open until Friday, January 30, at 07:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.2.1
>> [ ] -1 Do not release this package because ...
>>
>> For a list of fixes in this release, see http://s.apache.org/Mpn.
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Yes - the key issue is just due to me creating new keys this time
around. Anyways let's take another stab at this. In the mean time,
please don't hesitate to test the release itself.

- Patrick

On Tue, Jan 27, 2015 at 10:00 AM, Sean Owen  wrote:
> Got it. Ignore the SHA512 issue since these aren't somehow expected by
> a policy or Maven to be in a certain format. Just wondered if the
> difference was intended.
>
> The Maven way of generated the SHA1 hashes is to set this on the
> install plugin, AFAIK, although I'm not sure if the intent was to hash
> files that Maven didn't create:
>
> 
> true
> 
>
> As for the key issue, I think it's just a matter of uploading the new
> key in both places.
>
> We should all of course test the release anyway.
>
> On Tue, Jan 27, 2015 at 5:55 PM, Patrick Wendell  wrote:
>> Hey Sean,
>>
>> The release script generates hashes in two places (take a look a bit
>> further down in the script), one for the published artifacts and the
>> other for the binaries. In the case of the binaries we use SHA512
>> because, AFAIK, the ASF does not require you to use SHA1 and SHA512 is
>> better. In the case of the published Maven artifacts we use SHA1
>> because my understanding is this is what Maven requires. However, it
>> does appear that the format is now one that maven cannot parse.
>>
>> Anyways, it seems fine to just change the format of the hash per your PR.
>>
>> - Patrick
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Hey Sean,

The release script generates hashes in two places (take a look a bit
further down in the script), one for the published artifacts and the
other for the binaries. In the case of the binaries we use SHA512
because, AFAIK, the ASF does not require you to use SHA1 and SHA512 is
better. In the case of the published Maven artifacts we use SHA1
because my understanding is this is what Maven requires. However, it
does appear that the format is now one that maven cannot parse.

Anyways, it seems fine to just change the format of the hash per your PR.

- Patrick

On Tue, Jan 27, 2015 at 5:00 AM, Sean Owen  wrote:
> I think there are several signing / hash issues that should be fixed
> before this release.
>
> Hashes:
>
> http://issues.apache.org/jira/browse/SPARK-5308
> https://github.com/apache/spark/pull/4161
>
> The hashes here are correct, but have two issues:
>
> As noted in the JIRA, the format of the hash file is "nonstandard" --
> at least, doesn't match what Maven outputs, and apparently which tools
> like Leiningen expect, which is just the hash with no file name or
> spaces. There are two ways to fix that: different command-line tools
> (see PR), or, just ask Maven to generate these hashes (a different,
> easy PR).
>
> However, is the script I modified above used to generate these hashes?
> It's generating SHA1 sums, but the output in this release candidate
> has (correct) SHA512 sums.
>
> This may be more than a nuisance, since last time for some reason
> Maven Central did not register the project hashes.
>
> http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-core_2.10%7C1.2.0%7Cjar
> does not show them but they exist:
> http://www.us.apache.org/dist/spark/spark-1.2.0/
>
> It may add up to a problem worth rooting out before this release.
>
>
> Signing:
>
> As noted in https://issues.apache.org/jira/browse/SPARK-5299 there are
> two signing keys in
> https://people.apache.org/keys/committer/pwendell.asc (9E4FE3AF,
> 00799F7E) but only one is in http://www.apache.org/dist/spark/KEYS
>
> However, these artifacts seem to be signed by FC8ED089 which isn't in either.
>
> Details details, but I'd say non-binding -1 at the moment.
>
>
> On Tue, Jan 27, 2015 at 7:02 AM, Patrick Wendell  wrote:
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.2.1!
>>
>> The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1061/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.2.1!
>>
>> The vote is open until Friday, January 30, at 07:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.2.1
>> [ ] -1 Do not release this package because ...
>>
>> For a list of fixes in this release, see http://s.apache.org/Mpn.
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-26 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1!

The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1061/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/

Please vote on releasing this package as Apache Spark 1.2.1!

The vote is open until Friday, January 30, at 07:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.1
[ ] -1 Do not release this package because ...

For a list of fixes in this release, see http://s.apache.org/Mpn.

To learn more about Apache Spark, please see
http://spark.apache.org/

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Patrick Wendell
One thing potentially not clear from this e-mail, there will be a 1:1
correspondence where you can get an RDD to/from a DataFrame.

On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin  wrote:
> Hi,
>
> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> get the community's opinion.
>
> The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API. We also expect more and
> more users to be programming directly against SchemaRDD API rather than the
> core RDD API. SchemaRDD, through its less commonly used DSL originally
> designed for writing test cases, always has the data-frame like API. In
> 1.3, we are redesigning the API to make the API usable for end users.
>
>
> There are two motivations for the renaming:
>
> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>
> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> though it would contain some RDD functions like map, flatMap, etc), and
> calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
> DataFrame.rdd will return the underlying RDD for all RDD methods.
>
>
> My understanding is that very few users program directly against the
> SchemaRDD API at the moment, because they are not well documented. However,
> oo maintain backward compatibility, we can create a type alias DataFrame
> that is still named SchemaRDD. This will maintain source compatibility for
> Scala. That said, we will have to update all existing materials to use
> DataFrame rather than SchemaRDD.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Upcoming Spark 1.2.1 RC

2015-01-21 Thread Patrick Wendell
Hey All,

I am planning to cut a 1.2.1 RC soon and wanted to notify people.

There are a handful of important fixes in the 1.2.1 branch
(http://s.apache.org/Mpn) particularly for Spark SQL. There was also
an issue publishing some of our artifacts with 1.2.0 and this release
would fix it for downstream projects.

You can track outstanding 1.2.1 blocker issues here at
http://s.apache.org/2v2 - I'm guessing all remaining blocker issues
will be fixed today.

I think we have a good handle on the remaining outstanding fixes, but
please let me know if you think there are severe outstanding fixes
that need to be backported into this branch or are not tracked above.

Thanks!
- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Standardized Spark dev environment

2015-01-21 Thread Patrick Wendell
Yep,

I think it's only useful (and likely to be maintained) if we actually
use this on Jenkins. So that was my proposal. Basically give people a
docker file so they can understand exactly what versions of everything
we use for our reference build. And if they don't want to use docker
directly, this will at least serve as an up-to-date list of
packages/versions they should try to install locally in whatever
environment they have.

- Patrick

On Wed, Jan 21, 2015 at 5:42 AM, Will Benton  wrote:
> - Original Message -----
>> From: "Patrick Wendell" 
>> To: "Sean Owen" 
>> Cc: "dev" , "jay vyas" , 
>> "Paolo Platter"
>> , "Nicholas Chammas" 
>> , "Will Benton" 
>> Sent: Wednesday, January 21, 2015 2:09:35 AM
>> Subject: Re: Standardized Spark dev environment
>
>> But the issue is when users can't reproduce Jenkins failures.
>
> Yeah, to answer Sean's question, this was part of the problem I was trying to 
> solve.  The other part was teasing out differences between the Fedora Java 
> environment and a more conventional Java environment.  I agree with Sean (and 
> I think this is your suggestion as well, Patrick) that making the environment 
> Jenkins runs a standard image that is available for public consumption would 
> be useful in general.
>
>
>
> best,
> wb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Standardized Spark dev environment

2015-01-21 Thread Patrick Wendell
> If the goal is a reproducible test environment then I think that is what
> Jenkins is. Granted you can only ask it for a test. But presumably you get
> the same result if you start from the same VM image as Jenkins and run the
> same steps.

But the issue is when users can't reproduce Jenkins failures. We don't
publish anywhere what the exact set of packages and versions is that
is installed on Jenkins. And it can change since it's a shared
infrastructure with other projects. So why not publish this manifest
as a docker file and then have it run on jenkins using that image? My
point is that this "VM image + steps" is not public anywhere.

> I bet it is not hard to set up and maintain. I bet it is easier than a VM.
> But unless Jenkins is using it aren't we just making another different
> standard build env in an effort to standardize? If it is not the same then
> it loses value as being exactly the same as the reference build env. Has a
> problem come up that this solves?

Right now the reference build env is an AMI I created and keep adding
stuff to when Spark gets new dependencies (e.g. the version of ruby we
need to create the docs, new python stats libraries, etc). So if we
had a docker image, then I would use that for making the RC's as well
and it could serve as a definitive reference for people who want to
understand exactly what set of things they need to build Spark.

>
> If the goal is just easing developer set up then what does a Docker image do
> - what does it set up for me? I don't know of stuff I need set up on OS X
> for me beyond the IDE.

There are actually a good number of packages you need to do a full
build of Spark including a compliant python version, Java version,
certain python packages, ruby and jekyll stuff for the docs, etc
(mentioned a bit earlier).

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Standardized Spark dev environment

2015-01-20 Thread Patrick Wendell
To respond to the original suggestion by Nick. I always thought it
would be useful to have a Docker image on which we run the tests and
build releases, so that we could have a consistent environment that
other packagers or people trying to exhaustively run Spark tests could
replicate (or at least look at) to understand exactly how we recommend
building Spark. Sean - do you think that is too high of overhead?

In terms of providing images that we encourage as standard deployment
images of Spark and want to make portable across environments, that's
a much larger project and one with higher associated maintenance
overhead. So I'd be interested in seeing that evolve as its own
project (spark-deploy) or something associated with bigtop, etc.

- Patrick

On Tue, Jan 20, 2015 at 10:30 PM, Paolo Platter
 wrote:
> Hi all,
> I also tried the docker way and it works well.
> I suggest to look at sequenceiq/spark dockers, they are very active on that 
> field.
>
> Paolo
>
> Inviata dal mio Windows Phone
> 
> Da: jay vyas
> Inviato: 21/01/2015 04:45
> A: Nicholas Chammas
> Cc: Will Benton; Spark dev 
> list
> Oggetto: Re: Standardized Spark dev environment
>
> I can comment on both...  hi will and nate :)
>
> 1) Will's Dockerfile solution is  the most  simple direct solution to the
> dev environment question : its a  efficient way to build and develop spark
> environments for dev/test..  It would be cool to put that Dockerfile
> (and/or maybe a shell script which uses it) in the top level of spark as
> the build entry point.  For total platform portability, u could wrap in a
> vagrantfile to launch a lightweight vm, so that windows worked equally
> well.
>
> 2) However, since nate mentioned  vagrant and bigtop, i have to chime in :)
> the vagrant recipes in bigtop are a nice reference deployment of how to
> deploy spark in a heterogenous hadoop style environment, and tighter
> integration testing w/ bigtop for spark releases would be lovely !  The
> vagrant stuff use puppet to deploy an n node VM or docker based cluster, in
> which users can easily select components (including
> spark,yarn,hbase,hadoop,etc...) by simnply editing a YAML file :
> https://github.com/apache/bigtop/blob/master/bigtop-deploy/vm/vagrant-puppet/vagrantconfig.yaml
> As nate said, it would be alot of fun to get more cross collaboration
> between the spark and bigtop communities.   Input on how we can better
> integrate spark (wether its spork, hbase integration, smoke tests aroudn
> the mllib stuff, or whatever, is always welcome )
>
>
>
>
>
>
> On Tue, Jan 20, 2015 at 10:21 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> How many profiles (hadoop / hive /scala) would this development environment
>> support ?
>>
>> As many as we want. We probably want to cover a good chunk of the build
>> matrix  that Spark
>> officially supports.
>>
>> What does this provide, concretely?
>>
>> It provides a reliable way to create a "good" Spark development
>> environment. Roughly speaking, this probably should mean an environment
>> that matches Jenkins, since that's where we run "official" testing and
>> builds.
>>
>> For example, Spark has to run on Java 6 and Python 2.6. When devs build and
>> run Spark locally, we can make sure they're doing it on these versions of
>> the languages with a simple vagrant up.
>>
>> Nate, could you comment on how something like this would relate to the
>> Bigtop effort?
>>
>> http://chapeau.freevariable.com/2014/08/jvm-test-docker.html
>>
>> Will, that's pretty sweet. I tried something similar a few months ago as an
>> experiment to try building/testing Spark within a container. Here's the
>> shell script I used > >
>> against the base CentOS Docker image to setup an environment ready to build
>> and test Spark.
>>
>> We want to run Spark unit tests within containers on Jenkins, so it might
>> make sense to develop a single Docker image that can be used as both a "dev
>> environment" as well as execution container on Jenkins.
>>
>> Perhaps that's the approach to take instead of looking into Vagrant.
>>
>> Nick
>>
>> On Tue Jan 20 2015 at 8:22:41 PM Will Benton  wrote:
>>
>> Hey Nick,
>> >
>> > I did something similar with a Docker image last summer; I haven't
>> updated
>> > the images to cache the dependencies for the current Spark master, but it
>> > would be trivial to do so:
>> >
>> > http://chapeau.freevariable.com/2014/08/jvm-test-docker.html
>> >
>> >
>> > best,
>> > wb
>> >
>> >
>> > - Original Message -
>> > > From: "Nicholas Chammas" 
>> > > To: "Spark dev list" 
>> > > Sent: Tuesday, January 20, 2015 6:13:31 PM
>> > > Subject: Standardized Spark dev environment
>> > >
>> > > What do y'all think of creating a standardized Spark dev

Re: Semantics of LGTM

2015-01-19 Thread Patrick Wendell
The wiki does not seem to be operational ATM, but I will do this when
it is back up.

On Mon, Jan 19, 2015 at 12:00 PM, Patrick Wendell  wrote:
> Okay - so given all this I was going to put the following on the wiki
> tentatively:
>
> ## Reviewing Code
> Community code review is Spark's fundamental quality assurance
> process. When reviewing a patch, your goal should be to help
> streamline the committing process by giving committers confidence this
> patch has been verified by an additional party. It's encouraged to
> (politely) submit technical feedback to the author to identify areas
> for improvement or potential bugs.
>
> If you feel a patch is ready for inclusion in Spark, indicate this to
> committers with a comment: "I think this patch looks good". Spark uses
> the LGTM convention for indicating the highest level of technical
> sign-off on a patch: simply comment with the word "LGTM". An LGTM is a
> strong statement, it should be interpreted as the following: "I've
> looked at this thoroughly and take as much ownership as if I wrote the
> patch myself". If you comment LGTM you will be expected to help with
> bugs or follow-up issues on the patch. Judicious use of LGTM's is a
> great way to gain credibility as a reviewer with the broader
> community.
>
> It's also welcome for reviewers to argue against the inclusion of a
> feature or patch. Simply indicate this in the comments.
>
> - Patrick
>
> On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma  wrote:
>> Patrick's original proposal LGTM :).  However until now, I have been in the
>> impression of LGTM with special emphasis on TM part. That said, I will be
>> okay/happy(or Responsible ) for the patch, if it goes in.
>>
>> Prashant Sharma
>>
>>
>>
>> On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin  wrote:
>>>
>>> Maybe just to avoid LGTM as a single token when it is not actually
>>> according to Patrick's definition, but anybody can still leave comments
>>> like:
>>>
>>> "The direction of the PR looks good to me." or "+1 on the direction"
>>>
>>> "The build part looks good to me"
>>>
>>> ...
>>>
>>>
>>> On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout 
>>> wrote:
>>>
>>> > +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
>>> > I've
>>> > heard the semantics of "LGTM" expressed as "I've looked at this
>>> > thoroughly
>>> > and take as much ownership as if I wrote the patch myself".  My
>>> > understanding is that this is the level of review we expect for all
>>> > patches
>>> > that ultimately go into Spark, so it's important to have a way to
>>> > concisely
>>> > describe when this has been done.
>>> >
>>> > Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
>>> > cases I've seen, if someone else says "I looked at this very quickly and
>>> > didn't see any glaring problems", it doesn't add any value for
>>> > subsequent
>>> > reviewers (someone still needs to take a thorough look).
>>> >
>>> > -Kay
>>> >
>>> > On Sat, Jan 17, 2015 at 8:04 PM,  wrote:
>>> >
>>> > > Yeah, the ASF +1 has become partly overloaded to mean both "I would
>>> > > like
>>> > > to see this feature" and "this patch should be committed", although,
>>> > > at
>>> > > least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
>>> > > vote)
>>> > > should unambiguously mean the latter unless qualified in some other
>>> > > way.
>>> > >
>>> > > I don't have any opinion on the specific characters, but I agree with
>>> > > Aaron that it would be nice to have some sort of abbreviation for both
>>> > the
>>> > > strong and weak forms of approval.
>>> > >
>>> > > -Sandy
>>> > >
>>> > > > On Jan 17, 2015, at 7:25 PM, Patrick Wendell 
>>> > wrote:
>>> > > >
>>> > > > I think the ASF +1 is *slightly* different than Google's LGTM,
>>> > > > because
>>> > > > it might convey wanting the patch/feature to be merged but not
>>> > > > necessarily saying you did a thorough review and stand behind

Re: Semantics of LGTM

2015-01-19 Thread Patrick Wendell
Okay - so given all this I was going to put the following on the wiki
tentatively:

## Reviewing Code
Community code review is Spark's fundamental quality assurance
process. When reviewing a patch, your goal should be to help
streamline the committing process by giving committers confidence this
patch has been verified by an additional party. It's encouraged to
(politely) submit technical feedback to the author to identify areas
for improvement or potential bugs.

If you feel a patch is ready for inclusion in Spark, indicate this to
committers with a comment: "I think this patch looks good". Spark uses
the LGTM convention for indicating the highest level of technical
sign-off on a patch: simply comment with the word "LGTM". An LGTM is a
strong statement, it should be interpreted as the following: "I've
looked at this thoroughly and take as much ownership as if I wrote the
patch myself". If you comment LGTM you will be expected to help with
bugs or follow-up issues on the patch. Judicious use of LGTM's is a
great way to gain credibility as a reviewer with the broader
community.

It's also welcome for reviewers to argue against the inclusion of a
feature or patch. Simply indicate this in the comments.

- Patrick

On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma  wrote:
> Patrick's original proposal LGTM :).  However until now, I have been in the
> impression of LGTM with special emphasis on TM part. That said, I will be
> okay/happy(or Responsible ) for the patch, if it goes in.
>
> Prashant Sharma
>
>
>
> On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin  wrote:
>>
>> Maybe just to avoid LGTM as a single token when it is not actually
>> according to Patrick's definition, but anybody can still leave comments
>> like:
>>
>> "The direction of the PR looks good to me." or "+1 on the direction"
>>
>> "The build part looks good to me"
>>
>> ...
>>
>>
>> On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout 
>> wrote:
>>
>> > +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
>> > I've
>> > heard the semantics of "LGTM" expressed as "I've looked at this
>> > thoroughly
>> > and take as much ownership as if I wrote the patch myself".  My
>> > understanding is that this is the level of review we expect for all
>> > patches
>> > that ultimately go into Spark, so it's important to have a way to
>> > concisely
>> > describe when this has been done.
>> >
>> > Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
>> > cases I've seen, if someone else says "I looked at this very quickly and
>> > didn't see any glaring problems", it doesn't add any value for
>> > subsequent
>> > reviewers (someone still needs to take a thorough look).
>> >
>> > -Kay
>> >
>> > On Sat, Jan 17, 2015 at 8:04 PM,  wrote:
>> >
>> > > Yeah, the ASF +1 has become partly overloaded to mean both "I would
>> > > like
>> > > to see this feature" and "this patch should be committed", although,
>> > > at
>> > > least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
>> > > vote)
>> > > should unambiguously mean the latter unless qualified in some other
>> > > way.
>> > >
>> > > I don't have any opinion on the specific characters, but I agree with
>> > > Aaron that it would be nice to have some sort of abbreviation for both
>> > the
>> > > strong and weak forms of approval.
>> > >
>> > > -Sandy
>> > >
>> > > > On Jan 17, 2015, at 7:25 PM, Patrick Wendell 
>> > wrote:
>> > > >
>> > > > I think the ASF +1 is *slightly* different than Google's LGTM,
>> > > > because
>> > > > it might convey wanting the patch/feature to be merged but not
>> > > > necessarily saying you did a thorough review and stand behind it's
>> > > > technical contents. For instance, I've seen people pile on +1's to
>> > > > try
>> > > > and indicate support for a feature or patch in some projects, even
>> > > > though they didn't do a thorough technical review. This +1 is
>> > > > definitely a useful mechanism.
>> > > >
>> > > > There is definitely much overlap though in the meaning, though, and
>> > > > it's largely because Spark

Re: Semantics of LGTM

2015-01-17 Thread Patrick Wendell
I think the ASF +1 is *slightly* different than Google's LGTM, because
it might convey wanting the patch/feature to be merged but not
necessarily saying you did a thorough review and stand behind it's
technical contents. For instance, I've seen people pile on +1's to try
and indicate support for a feature or patch in some projects, even
though they didn't do a thorough technical review. This +1 is
definitely a useful mechanism.

There is definitely much overlap though in the meaning, though, and
it's largely because Spark had it's own culture around reviews before
it was donated to the ASF, so there is a mix of two styles.

Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
proposed originally (unlike the one Sandy proposed, e.g.). This is
what I've seen every project using the LGTM convention do (Google, and
some open source projects such as Impala) to indicate technical
sign-off.

- Patrick

On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson  wrote:
> I think I've seen something like +2 = "strong LGTM" and +1 = "weak LGTM;
> someone else should review" before. It's nice to have a shortcut which isn't
> a sentence when talking about weaker forms of LGTM.
>
> On Sat, Jan 17, 2015 at 6:59 PM,  wrote:
>>
>> I think clarifying these semantics is definitely worthwhile. Maybe this
>> complicates the process with additional terminology, but the way I've used
>> these has been:
>>
>> +1 - I think this is safe to merge and, barring objections from others,
>> would merge it immediately.
>>
>> LGTM - I have no concerns about this patch, but I don't necessarily feel
>> qualified to make a final call about it.  The TM part acknowledges the
>> judgment as a little more subjective.
>>
>> I think having some concise way to express both of these is useful.
>>
>> -Sandy
>>
>> > On Jan 17, 2015, at 5:40 PM, Patrick Wendell  wrote:
>> >
>> > Hey All,
>> >
>> > Just wanted to ping about a minor issue - but one that ends up having
>> > consequence given Spark's volume of reviews and commits. As much as
>> > possible, I think that we should try and gear towards "Google Style"
>> > LGTM on reviews. What I mean by this is that LGTM has the following
>> > semantics:
>> >
>> > "I know this code well, or I've looked at it close enough to feel
>> > confident it should be merged. If there are issues/bugs with this code
>> > later on, I feel confident I can help with them."
>> >
>> > Here is an alternative semantic:
>> >
>> > "Based on what I know about this part of the code, I don't see any
>> > show-stopper problems with this patch".
>> >
>> > The issue with the latter is that it ultimately erodes the
>> > significance of LGTM, since subsequent reviewers need to reason about
>> > what the person meant by saying LGTM. In contrast, having strong
>> > semantics around LGTM can help streamline reviews a lot, especially as
>> > reviewers get more experienced and gain trust from the comittership.
>> >
>> > There are several easy ways to give a more limited endorsement of a
>> > patch:
>> > - "I'm not familiar with this code, but style, etc look good" (general
>> > endorsement)
>> > - "The build changes in this code LGTM, but I haven't reviewed the
>> > rest" (limited LGTM)
>> >
>> > If people are okay with this, I might add a short note on the wiki.
>> > I'm sending this e-mail first, though, to see whether anyone wants to
>> > express agreement or disagreement with this approach.
>> >
>> > - Patrick
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Semantics of LGTM

2015-01-17 Thread Patrick Wendell
Hey All,

Just wanted to ping about a minor issue - but one that ends up having
consequence given Spark's volume of reviews and commits. As much as
possible, I think that we should try and gear towards "Google Style"
LGTM on reviews. What I mean by this is that LGTM has the following
semantics:

"I know this code well, or I've looked at it close enough to feel
confident it should be merged. If there are issues/bugs with this code
later on, I feel confident I can help with them."

Here is an alternative semantic:

"Based on what I know about this part of the code, I don't see any
show-stopper problems with this patch".

The issue with the latter is that it ultimately erodes the
significance of LGTM, since subsequent reviewers need to reason about
what the person meant by saying LGTM. In contrast, having strong
semantics around LGTM can help streamline reviews a lot, especially as
reviewers get more experienced and gain trust from the comittership.

There are several easy ways to give a more limited endorsement of a patch:
- "I'm not familiar with this code, but style, etc look good" (general
endorsement)
- "The build changes in this code LGTM, but I haven't reviewed the
rest" (limited LGTM)

If people are okay with this, I might add a short note on the wiki.
I'm sending this e-mail first, though, to see whether anyone wants to
express agreement or disagreement with this approach.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Bouncing Mails

2015-01-17 Thread Patrick Wendell
Akhil,

Those are handled by ASF infrastructure, not anyone in the Spark
project. So this list is not the appropriate place to ask for help.

- Patrick

On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das  wrote:
> My mails to the mailing list are getting rejected, have opened a Jira issue,
> can someone take a look at it?
>
> https://issues.apache.org/jira/browse/INFRA-9032
>
>
>
>
>
>
> Thanks
> Best Regards

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Fwd: [ NOTICE ] Service Downtime Notification - R/W git repos

2015-01-13 Thread Patrick Wendell
FYI our git repo may be down for a few hours today.
-- Forwarded message --
From: "Tony Stevenson" 
Date: Jan 13, 2015 6:49 AM
Subject: [ NOTICE ] Service Downtime Notification - R/W git repos
To:
Cc:

Folks,

Please note than on Thursday 15th at 20:00 UTC the Infrastructure team
will be taking the read/write git repositories offline.  We expect
that this migration to last about 4 hours.

During the outage the service will be migrated from an old host to a
new one.   We intend to keep the URL the same for access to the repos
after the migration, but an alternate name is already in place in case
DNS updates take too long.   Please be aware it might take some hours
after the completion of the downtime for github to update and reflect
any changes.

The Infrastructure team have been trialling the new host for about a
week now, and [touch wood] have not had any problems with it.

The service is current;y available by accessing repos via:
https://git-wip-us.apache.org

If you have any questions please address them to infrastruct...@apache.org




--
Cheers,
Tony

On behalf of the Apache Infrastructure Team

--
Tony Stevenson

t...@pc-tony.com
pct...@apache.org

http://www.pc-tony.com

GPG - 1024D/51047D66
--


Re: Job priority

2015-01-11 Thread Patrick Wendell
Priority scheduling isn't something we've supported in Spark and we've
opted to support FIFO and Fair scheduling and asked users to try and
fit these to the needs of their applications.

In practice from what I've seen of priority schedulers, such as the
linux CPU scheduler, is that strict priority scheduling is never used
in practice because of priority starvation and other issues. So you
have this second tier of heuristics that exist to deal with issues
like starvation, priority inversion, etc, and these become very
complex over time.

That said, I looked a this a bit with @kayousterhout and I don't think
it would be very hard to implement a simple priority scheduler in the
current architecture. My main concern would be additional complexity
that would develop over time, based on looking at previous
implementations in the wild.

Alessandro, would you be able to open a JIRA and list some of your
requirements there? That way we could hear whether other people have
similar needs.

- Patrick

On Sun, Jan 11, 2015 at 10:07 AM, Mark Hamstra  wrote:
> Yes, if you are asking about developing a new priority queue job scheduling
> feature and not just about how job scheduling currently works in Spark, the
> that's a dev list issue.  The current job scheduling priority is at the
> granularity of pools containing jobs, not the jobs themselves; so if you
> require strictly job-level priority queuing, that would require a new
> development effort -- and one that I expect will involve a lot of tricky
> corner cases.
>
> Sorry for misreading the nature of your initial inquiry.
>
> On Sun, Jan 11, 2015 at 7:36 AM, Alessandro Baretta 
> wrote:
>
>> Cody,
>>
>> While I might be able to improve the scheduling of my jobs by using a few
>> different pools with weights equal to, say, 1, 1e3 and 1e6, effectively
>> getting a small handful of priority classes. Still, this is really not
>> quite what I am describing. This is why my original post was on the dev
>> list. Let me then ask if there is any interest in having priority queue job
>> scheduling in Spark. This is something I might be able to pull off.
>>
>> Alex
>>
>> On Sun, Jan 11, 2015 at 6:21 AM, Cody Koeninger 
>> wrote:
>>
>>> If you set up a number of pools equal to the number of different priority
>>> levels you want, make the relative weights of those pools very different,
>>> and submit a job to the pool representing its priority, I think youll get
>>> behavior equivalent to a priority queue. Try it and see.
>>>
>>> If I'm misunderstandng what youre trying to do, then I don't know.
>>>
>>>
>>> On Sunday, January 11, 2015, Alessandro Baretta 
>>> wrote:
>>>
 Cody,

 Maybe I'm not getting this, but it doesn't look like this page is
 describing a priority queue scheduling policy. What this section discusses
 is how resources are shared between queues. A weight-1000 pool will get
 1000 times more resources allocated to it than a priority 1 queue. Great,
 but not what I want. I want to be able to define an Ordering on make my
 tasks representing their priority, and have Spark allocate all resources to
 the job that has the highest priority.

 Alex

 On Sat, Jan 10, 2015 at 10:11 PM, Cody Koeninger 
 wrote:

>
> http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
>
> "Setting a high weight such as 1000 also makes it possible to
> implement *priority* between pools--in essence, the weight-1000 pool
> will always get to launch tasks first whenever it has jobs active."
>
> On Sat, Jan 10, 2015 at 11:57 PM, Alessandro Baretta <
> alexbare...@gmail.com> wrote:
>
>> Mark,
>>
>> Thanks, but I don't see how this documentation solves my problem. You
>> are referring me to documentation of fair scheduling; whereas, I am 
>> asking
>> about as unfair a scheduling policy as can be: a priority queue.
>>
>> Alex
>>
>> On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra > > wrote:
>>
>>> -dev, +user
>>>
>>> http://spark.apache.org/docs/latest/job-scheduling.html
>>>
>>>
>>> On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta <
>>> alexbare...@gmail.com> wrote:
>>>
 Is it possible to specify a priority level for a job, such that the
 active
 jobs might be scheduled in order of priority?

 Alex

>>>
>>>
>>
>

>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark development with IntelliJ

2015-01-08 Thread Patrick Wendell
Actually I went ahead and did it.

On Thu, Jan 8, 2015 at 10:25 PM, Patrick Wendell  wrote:
> Nick - yes. Do you mind moving it? I should have put it in the
> "Contributing to Spark" page.
>
> On Thu, Jan 8, 2015 at 3:22 PM, Nicholas Chammas
>  wrote:
>> Side question: Should this section
>> <https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IDESetup>
>> in
>> the wiki link to Useful Developer Tools
>> <https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools>?
>>
>> On Thu Jan 08 2015 at 6:19:55 PM Sean Owen  wrote:
>>
>>> I remember seeing this too, but it seemed to be transient. Try
>>> compiling again. In my case I recall that IJ was still reimporting
>>> some modules when I tried to build. I don't see this error in general.
>>>
>>> On Thu, Jan 8, 2015 at 10:38 PM, Bill Bejeck  wrote:
>>> > I was having the same issue and that helped.  But now I get the following
>>> > compilation error when trying to run a test from within Intellij (v 14)
>>> >
>>> > /Users/bbejeck/dev/github_clones/bbejeck-spark/sql/
>>> catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
>>> > Error:(308, 109) polymorphic expression cannot be instantiated to
>>> expected
>>> > type;
>>> >  found   : [T(in method
>>> > apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method
>>> apply)]
>>> >  required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(
>>> in
>>> > method functionToUdfBuilder)]
>>> >   implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]):
>>> > ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
>>> >
>>> > Any thoughts?
>>> >
>>> > ^
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark development with IntelliJ

2015-01-08 Thread Patrick Wendell
Nick - yes. Do you mind moving it? I should have put it in the
"Contributing to Spark" page.

On Thu, Jan 8, 2015 at 3:22 PM, Nicholas Chammas
 wrote:
> Side question: Should this section
> 
> in
> the wiki link to Useful Developer Tools
> ?
>
> On Thu Jan 08 2015 at 6:19:55 PM Sean Owen  wrote:
>
>> I remember seeing this too, but it seemed to be transient. Try
>> compiling again. In my case I recall that IJ was still reimporting
>> some modules when I tried to build. I don't see this error in general.
>>
>> On Thu, Jan 8, 2015 at 10:38 PM, Bill Bejeck  wrote:
>> > I was having the same issue and that helped.  But now I get the following
>> > compilation error when trying to run a test from within Intellij (v 14)
>> >
>> > /Users/bbejeck/dev/github_clones/bbejeck-spark/sql/
>> catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
>> > Error:(308, 109) polymorphic expression cannot be instantiated to
>> expected
>> > type;
>> >  found   : [T(in method
>> > apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method
>> apply)]
>> >  required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(
>> in
>> > method functionToUdfBuilder)]
>> >   implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]):
>> > ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
>> >
>> > Any thoughts?
>> >
>> > ^
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: When will spark support "push" style shuffle?

2015-01-07 Thread Patrick Wendell
This question is conflating a few different concepts. I think the main
question is whether Spark will have a shuffle implementation that
streams data rather than persisting it to disk/cache as a buffer.
Spark currently decouples the shuffle write from the read using
disk/OS cache as a buffer. The two benefits of this approach this are
that it allows intra-query fault tolerance and it makes it easier to
elastically scale and reschedule work within a job. We consider these
to be design requirements (think about jobs that run for several hours
on hundreds of machines). Impala, and similar systems like dremel and
f1, not offer fault tolerance within a query at present. They also
require gang scheduling the entire set of resources that will exist
for the duration of a query.

A secondary question is whether our shuffle should have a barrier or
not. Spark's shuffle currently has a hard barrier between map and
reduce stages. We haven't seen really strong evidence that removing
the barrier is a net win. It can help the performance of a single job
(modestly), but in the a multi-tenant workload, it leads to poor
utilization since you have a lot of reduce tasks that are taking up
slots waiting for mappers to finish. Many large scale users of
Map/Reduce disable this feature in production clusters for that
reason. Thus, we haven't seen compelling evidence for removing the
barrier at this point, given the complexity of doing so.

It is possible that future versions of Spark will support push-based
shuffles, potentially in a mode that remove some of Spark's fault
tolerance properties. But there are many other things we can still
optimize about the shuffle that would likely come before this.

- Patrick

On Wed, Jan 7, 2015 at 6:01 PM, 曹雪林  wrote:
> Hi,
>
>   I've heard a lot of complain about spark's "pull" style shuffle. Is
> there any plan to support "push" style shuffle in the near future?
>
>   Currently, the shuffle phase must be completed before the next stage
> starts. While, it is said, in Impala, the shuffled data is "streamed" to
> the next stage handler, which greatly saves time. Will spark support this
> mechanism one day?
>
> Thanks

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Hang on Executor classloader lookup for the remote REPL URL classloader

2015-01-07 Thread Patrick Wendell
Hey Andrew,

So the executors in Spark will fetch classes from the driver node for
classes defined in the repl from an HTTP server on the driver. Is this
happening in the context of a repl session? Also, is it deterministic
or does it happen only periodically?

The reason all of the other threads are hanging is that there is a
global lock around classloading, so they all queue up.

Could you attach the full stack trace from the driver? Is it possible
that something in the network is blocking the transfer of bytes
between these two processes? Based on the stack trace it looks like it
sent an HTTP request and is waiting on the result back from the
driver.

One thing to check is to verify that the TCP connection between them
used for the repl class server is still alive from the vantage point
of both the executor and driver nodes. Another thing to try would be
to temporarily open up any firewalls that are on the nodes or in the
network and see if this makes the problem go away (to isolate it to an
exogenous-to-Spark network issue).

- Patrick

On Wed, Aug 20, 2014 at 11:35 PM, Andrew Ash  wrote:
> Hi Spark devs,
>
> I'm seeing a stacktrace where the classloader that reads from the REPL is
> hung, and blocking all progress on that executor.  Below is that hung
> thread's stacktrace, and also the stacktrace of another hung thread.
>
> I thought maybe there was an issue with the REPL's JVM on the other side,
> but didn't see anything useful in that stacktrace either.
>
> Any ideas what I should be looking for?
>
> Thanks!
> Andrew
>
>
> "Executor task launch worker-0" daemon prio=10 tid=0x7f780c208000
> nid=0x6ae9 runnable [0x7f78c2eeb000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> - locked <0x7f7e13ea9560> (a java.io.BufferedInputStream)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
> - locked <0x7f7e13e9eeb0> (a
> sun.net.www.protocol.http.HttpURLConnection)
> at java.net.URL.openStream(URL.java:1037)
> at
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:86)
> at
> org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:63)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> - locked <0x7f7fc9018980> (a
> org.apache.spark.repl.ExecutorClassLoader)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at org.apache.avro.util.ClassUtils.forName(ClassUtils.java:102)
> at org.apache.avro.util.ClassUtils.forName(ClassUtils.java:82)
> at
> org.apache.avro.specific.SpecificData.getClass(SpecificData.java:132)
> at
> org.apache.avro.specific.SpecificDatumReader.setSchema(SpecificDatumReader.java:69)
> at
> org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:126)
> at
> org.apache.avro.file.DataFileReader.(DataFileReader.java:97)
> at
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:59)
> at
> org.apache.avro.mapred.AvroRecordReader.(AvroRecordReader.java:41)
> at
> org.apache.avro.mapred.AvroInputFormat.getRecordReader(AvroInputFormat.java:71)
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:193)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:184)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:93)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
>
> And the other threads are stuck on the Class.forName0() method too:
>
> "Executor task launch worker-4" daemon prio=10 tid=0x7f780c20f000
> nid=0x6aed waiting for monitor entry [0x7f78c2ae8000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
>

Re: Spark UI history job duration is wrong

2015-01-05 Thread Patrick Wendell
Thanks for reporting this - it definitely sounds like a bug. Please
open a JIRA for it. My guess is that we define the start or end time
of the job based on the current time instead of looking at data
encoded in the underlying event stream. That would cause it to not
work properly when loading from historical data.

- Patrick

On Mon, Jan 5, 2015 at 12:25 PM, Olivier Toupin
 wrote:
> Hello,
>
> I'm using Spark 1.2.0 and when running an application, if I go into the UI
> and then in the job tab ("/jobs/") the jobs duration are relevant and the
> posted durations looks ok.
>
> However when I open the history ("history/app-/jobs/") for that job,
> the duration are wrong showing milliseconds instead of the relevant job
> time. The submitted time for each job (except maybe the first) is different
> also.
>
> The stage tab is unaffected and show the correct duration for each stages in
> both mode.
>
> Should I open a bug?
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-UI-history-job-duration-is-wrong-tp10010.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark driver main thread hanging after SQL insert

2015-01-02 Thread Patrick Wendell
Hi Alessandro,

Can you create a JIRA for this rather than reporting it on the dev
list? That's where we track issues like this. Thanks!.

- Patrick

On Wed, Dec 31, 2014 at 8:48 PM, Alessandro Baretta
 wrote:
> Here's what the console shows:
>
> 15/01/01 01:12:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 58.0,
> whose tasks have all completed, from pool
> 15/01/01 01:12:29 INFO scheduler.DAGScheduler: Stage 58 (runJob at
> ParquetTableOperations.scala:326) finished in 5493.549 s
> 15/01/01 01:12:29 INFO scheduler.DAGScheduler: Job 41 finished: runJob at
> ParquetTableOperations.scala:326, took 5493.747061 s
>
> It is now 01:40:03, so the driver has been hanging for the last 28 minutes.
> The web UI on the other hand shows that all tasks completed successfully,
> and the output directory has been populated--although the _SUCCESS file is
> missing.
>
> It is worth noting that my code started this job as its own thread. The
> actual code looks like the following snippet, modulo some simplifications.
>
>   def save_to_parquet(allowExisting : Boolean = false) = {
> val threads = tables.map(table => {
>   val thread = new Thread {
> override def run {
>   table.insertInto(t.table_name)
> }
>   }
>   thread.start
>   thread
> })
> threads.foreach(_.join)
>   }
>
> As far as I can see the insertInto call never returns. Any idea why?
>
> Alex

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



ANNOUNCE: New build script ./build/mvn

2014-12-27 Thread Patrick Wendell
Hi All,

A consistent piece of feedback from Spark developers has been that the
Maven build is very slow. Typesafe provides a tool called Zinc which
improves Scala complication speed substantially with Maven, but is
difficult to install and configure, especially for platforms other
than Mac OS.

I've just merged a patch (authored by Brennon York) that provides an
automatically configured Maven instance with Zinc embedded in Spark.
E.g.:

./build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.3 package

It is hard to test changes like this across all environments, so
please give this a spin and report any issues on the Spark JIRA. It is
working correctly if you see the following message during compilation:

[INFO] Using zinc server for incremental compilation

Note that developers preferring their own Maven installation are
unaffected by this and can just ignore this new feature.

Cheers,
- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
So the behavior of overwriting existing directories IMO is something
we don't want to encourage. The reason why the Hadoop client has these
checks is that it's very easy for users to do unsafe things without
them. For instance, a user could overwrite an RDD that had 100
partitions with an RDD that has 10 partitions... and if they read back
the RDD they would get a corrupted RDD that has a combination of data
from the old and new RDD.

If users want to circumvent these safety checks, we need to make them
explicitly disable them. Given this, I think a config option is as
reasonable as any alternatives. This is already pretty easy IMO.

- Patrick

On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao  wrote:
> I am wondering if we can provide more friendly API, other than configuration 
> for this purpose. What do you think Patrick?
>
> Cheng Hao
>
> -Original Message-
> From: Patrick Wendell [mailto:pwend...@gmail.com]
> Sent: Thursday, December 25, 2014 3:22 PM
> To: Shao, Saisai
> Cc: u...@spark.apache.org; dev@spark.apache.org
> Subject: Re: Question on saveAsTextFile with overwrite option
>
> Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?
>
> http://spark.apache.org/docs/latest/configuration.html
>
> - Patrick
>
> On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai  wrote:
>> Hi,
>>
>>
>>
>> We have such requirements to save RDD output to HDFS with
>> saveAsTextFile like API, but need to overwrite the data if existed.
>> I'm not sure if current Spark support such kind of operations, or I need to 
>> check this manually?
>>
>>
>>
>> There's a thread in mailing list discussed about this
>> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
>> ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
>> I'm not sure this feature is enabled or not, or with some configurations?
>>
>>
>>
>> Appreciate your suggestions.
>>
>>
>>
>> Thanks a lot
>>
>> Jerry
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
> commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?

http://spark.apache.org/docs/latest/configuration.html

- Patrick

On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai  wrote:
> Hi,
>
>
>
> We have such requirements to save RDD output to HDFS with saveAsTextFile
> like API, but need to overwrite the data if existed. I'm not sure if current
> Spark support such kind of operations, or I need to check this manually?
>
>
>
> There's a thread in mailing list discussed about this
> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
> I'm not sure this feature is enabled or not, or with some configurations?
>
>
>
> Appreciate your suggestions.
>
>
>
> Thanks a lot
>
> Jerry

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Problems with large dataset using collect() and broadcast()

2014-12-24 Thread Patrick Wendell
Hi Will,

When you call collect() the item you are collecting needs to fit in
memory on the driver. Is it possible your driver program does not have
enough memory?

- Patrick

On Wed, Dec 24, 2014 at 9:34 PM, Will Yang  wrote:
> Hi all,
> In my occasion, I have a huge HashMap[(Int, Long), (Double, Double,
> Double)], say several GB to tens of GB, after each iteration, I need to
> collect() this HashMap and perform some calculation, and then broadcast()
> it to every node. Now I have 20GB for each executor and after it
> performances collect(), it gets stuck at "Added rdd_xx_xx", no further
> respond showed on the Application UI.
>
> I've tried to lower the spark.shuffle.memoryFraction and
> spark.storage.memoryFraction, but it seems that it can only deal with as
> much as 2GB HashMap. What should I optimize for such conditions.
>
> (ps: sorry for my bad English & Grammar)
>
>
> Thanks

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ANNOUNCE] Requiring JIRA for inclusion in release credits

2014-12-22 Thread Patrick Wendell
s/Josh/Nick/ - sorry!

On Mon, Dec 22, 2014 at 10:52 PM, Patrick Wendell  wrote:
> Hey Josh,
>
> We don't explicitly track contributions to spark-ec2 in the Apache
> Spark release notes. The main reason is that usually updates to
> spark-ec2 include a corresponding update to spark so we get it there.
> This may not always be the case though, so let me know if you think
> there is something missing we should add.
>
> - Patrick
>
> On Mon, Dec 22, 2014 at 6:17 PM, Nicholas Chammas
>  wrote:
>> Does this include contributions made against the spark-ec2 repo?
>>
>> On Wed Dec 17 2014 at 12:29:19 AM Patrick Wendell 
>> wrote:
>>>
>>> Hey All,
>>>
>>> Due to the very high volume of contributions, we're switching to an
>>> automated process for generating release credits. This process relies
>>> on JIRA for categorizing contributions, so it's not possible for us to
>>> provide credits in the case where users submit pull requests with no
>>> associated JIRA.
>>>
>>> This needed to be automated because, with more than 1000 commits per
>>> release, finding proper names for every commit and summarizing
>>> contributions was taking on the order of days of time.
>>>
>>> For 1.2.0 there were around 100 commits that did not have JIRA's. I'll
>>> try to manually merge these into the credits, but please e-mail me
>>> directly if you are not credited once the release notes are posted.
>>> The notes should be posted within 48 hours of right now.
>>>
>>> We already ask that users include a JIRA for pull requests, but now it
>>> will be required for proper attribution. I've updated the contributing
>>> guide on the wiki to reflect this.
>>>
>>> - Patrick
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ANNOUNCE] Requiring JIRA for inclusion in release credits

2014-12-22 Thread Patrick Wendell
Hey Josh,

We don't explicitly track contributions to spark-ec2 in the Apache
Spark release notes. The main reason is that usually updates to
spark-ec2 include a corresponding update to spark so we get it there.
This may not always be the case though, so let me know if you think
there is something missing we should add.

- Patrick

On Mon, Dec 22, 2014 at 6:17 PM, Nicholas Chammas
 wrote:
> Does this include contributions made against the spark-ec2 repo?
>
> On Wed Dec 17 2014 at 12:29:19 AM Patrick Wendell 
> wrote:
>>
>> Hey All,
>>
>> Due to the very high volume of contributions, we're switching to an
>> automated process for generating release credits. This process relies
>> on JIRA for categorizing contributions, so it's not possible for us to
>> provide credits in the case where users submit pull requests with no
>> associated JIRA.
>>
>> This needed to be automated because, with more than 1000 commits per
>> release, finding proper names for every commit and summarizing
>> contributions was taking on the order of days of time.
>>
>> For 1.2.0 there were around 100 commits that did not have JIRA's. I'll
>> try to manually merge these into the credits, but please e-mail me
>> directly if you are not credited once the release notes are posted.
>> The notes should be posted within 48 hours of right now.
>>
>> We already ask that users include a JIRA for pull requests, but now it
>> will be required for proper attribution. I've updated the contributing
>> guide on the wiki to reflect this.
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell
Hey Nick,

I think Hitesh was just trying to be helpful and point out the policy
- not necessarily saying there was an issue. We've taken a close look
at this and I think we're in good shape her vis-a-vis this policy.

- Patrick

On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas
 wrote:
> Hitesh,
>
> From your link:
>
> You may not use ASF trademarks such as "Apache" or "ApacheFoo" or "Foo" in
> your own domain names if that use would be likely to confuse a relevant
> consumer about the source of software or services provided through your
> website, without written approval of the VP, Apache Brand Management or
> designee.
>
> The title on the packages website is "A community index of packages for
> Apache Spark." Furthermore, the footnote of the website reads "Spark
> Packages is a community site hosting modules that are not part of Apache
> Spark."
>
> I think there's nothing on there that would "confuse a relevant consumer
> about the source of software". It's pretty clear that the Spark Packages
> name is well within the ASF's guidelines.
>
> Have I misunderstood the ASF's policy?
>
> Nick
>
>
> On Mon Dec 22 2014 at 6:40:10 PM Hitesh Shah  wrote:
>>
>> Hello Xiangrui,
>>
>> If you have not already done so, you should look at
>> http://www.apache.org/foundation/marks/#domains for the policy on use of ASF
>> trademarked terms in domain names.
>>
>> thanks
>> -- Hitesh
>>
>> On Dec 22, 2014, at 12:37 PM, Xiangrui Meng  wrote:
>>
>> > Dear Spark users and developers,
>> >
>> > I'm happy to announce Spark Packages (http://spark-packages.org), a
>> > community package index to track the growing number of open source
>> > packages and libraries that work with Apache Spark. Spark Packages
>> > makes it easy for users to find, discuss, rate, and install packages
>> > for any version of Spark, and makes it easy for developers to
>> > contribute packages.
>> >
>> > Spark Packages will feature integrations with various data sources,
>> > management tools, higher level domain-specific libraries, machine
>> > learning algorithms, code samples, and other Spark content. Thanks to
>> > the package authors, the initial listing of packages includes
>> > scientific computing libraries, a job execution server, a connector
>> > for importing Avro data, tools for launching Spark on Google Compute
>> > Engine, and many others.
>> >
>> > I'd like to invite you to contribute and use Spark Packages and
>> > provide feedback! As a disclaimer: Spark Packages is a community index
>> > maintained by Databricks and (by design) will include packages outside
>> > of the ASF Spark project. We are excited to help showcase and support
>> > all of the great work going on in the broader Spark community!
>> >
>> > Cheers,
>> > Xiangrui
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: More general submitJob API

2014-12-22 Thread Patrick Wendell
A SparkContext is thread safe, so you can just have different threads
that create their own RDD's and do actions, etc.

- Patrick

On Mon, Dec 22, 2014 at 4:15 PM, Alessandro Baretta
 wrote:
> Andrew,
>
> Thanks, yes, this is what I wanted: basically just to start multiple jobs
> concurrently in threads.
>
> Alex
>
> On Mon, Dec 22, 2014 at 4:04 PM, Andrew Ash  wrote:
>>
>> Hi Alex,
>>
>> SparkContext.submitJob() is marked as experimental -- most client programs
>> shouldn't be using it.  What are you looking to do?
>>
>> For multiplexing jobs, one thing you can do is have multiple threads in
>> your client JVM each submit jobs on your SparkContext job.  This is
>> described here in the docs:
>> http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
>>
>> Andrew
>>
>> On Mon, Dec 22, 2014 at 1:32 PM, Alessandro Baretta > > wrote:
>>
>>> Fellow Sparkers,
>>>
>>> I'm rather puzzled at the submitJob API. I can't quite figure out how it
>>> is
>>> supposed to be used. Is there any more documentation about it?
>>>
>>> Also, is there any simpler way to multiplex jobs on the cluster, such as
>>> starting multiple computations in as many threads in the driver and
>>> reaping
>>> all the results when they are available?
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell
Xiangrui asked me to report that it's back and running :)

On Mon, Dec 22, 2014 at 3:21 PM, peng  wrote:
> Me 2 :)
>
>
> On 12/22/2014 06:14 PM, Andrew Ash wrote:
>
> Hi Xiangrui,
>
> That link is currently returning a 503 Over Quota error message.  Would you
> mind pinging back out when the page is back up?
>
> Thanks!
> Andrew
>
> On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng  wrote:
>>
>> Dear Spark users and developers,
>>
>> I'm happy to announce Spark Packages (http://spark-packages.org), a
>> community package index to track the growing number of open source
>> packages and libraries that work with Apache Spark. Spark Packages
>> makes it easy for users to find, discuss, rate, and install packages
>> for any version of Spark, and makes it easy for developers to
>> contribute packages.
>>
>> Spark Packages will feature integrations with various data sources,
>> management tools, higher level domain-specific libraries, machine
>> learning algorithms, code samples, and other Spark content. Thanks to
>> the package authors, the initial listing of packages includes
>> scientific computing libraries, a job execution server, a connector
>> for importing Avro data, tools for launching Spark on Google Compute
>> Engine, and many others.
>>
>> I'd like to invite you to contribute and use Spark Packages and
>> provide feedback! As a disclaimer: Spark Packages is a community index
>> maintained by Databricks and (by design) will include packages outside
>> of the ASF Spark project. We are excited to help showcase and support
>> all of the great work going on in the broader Spark community!
>>
>> Cheers,
>> Xiangrui
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Use mvn to build Spark 1.2.0 failed

2014-12-22 Thread Patrick Wendell
I also couldn't reproduce this issued.

On Mon, Dec 22, 2014 at 2:24 AM, Sean Owen  wrote:
> I just tried the exact same command and do not see any error. Maybe
> you can make sure you're starting from a clean extraction of the
> distro, and check your environment. I'm on OSX, Maven 3.2, Java 8 but
> I don't know that any of those would be relevant.
>
> On Mon, Dec 22, 2014 at 4:10 AM, wyphao.2007  wrote:
>> Hi all, Today download Spark source from 
>> http://spark.apache.org/downloads.html page, and I use
>>
>>
>>  ./make-distribution.sh --tgz -Phadoop-2.2 -Pyarn -DskipTests 
>> -Dhadoop.version=2.2.0 -Phive
>>
>>
>> to build the release, but I encountered an exception as follow:
>>
>>
>> [INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @ 
>> spark-parent ---
>> [INFO] Source directory: /home/q/spark/spark-1.2.0/src/main/scala added.
>> [INFO]
>> [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
>> spark-parent ---
>> [INFO] 
>> 
>> [INFO] Reactor Summary:
>> [INFO]
>> [INFO] Spark Project Parent POM .. FAILURE [1.015s]
>> [INFO] Spark Project Networking .. SKIPPED
>> [INFO] Spark Project Shuffle Streaming Service ... SKIPPED
>> [INFO] Spark Project Core  SKIPPED
>> [INFO] Spark Project Bagel ... SKIPPED
>> [INFO] Spark Project GraphX .. SKIPPED
>> [INFO] Spark Project Streaming ... SKIPPED
>> [INFO] Spark Project Catalyst  SKIPPED
>> [INFO] Spark Project SQL . SKIPPED
>> [INFO] Spark Project ML Library .. SKIPPED
>> [INFO] Spark Project Tools ... SKIPPED
>> [INFO] Spark Project Hive  SKIPPED
>> [INFO] Spark Project REPL  SKIPPED
>> [INFO] Spark Project YARN Parent POM . SKIPPED
>> [INFO] Spark Project YARN Stable API . SKIPPED
>> [INFO] Spark Project Assembly  SKIPPED
>> [INFO] Spark Project External Twitter  SKIPPED
>> [INFO] Spark Project External Flume Sink . SKIPPED
>> [INFO] Spark Project External Flume .. SKIPPED
>> [INFO] Spark Project External MQTT ... SKIPPED
>> [INFO] Spark Project External ZeroMQ . SKIPPED
>> [INFO] Spark Project External Kafka .. SKIPPED
>> [INFO] Spark Project Examples  SKIPPED
>> [INFO] Spark Project YARN Shuffle Service  SKIPPED
>> [INFO] 
>> 
>> [INFO] BUILD FAILURE
>> [INFO] 
>> 
>> [INFO] Total time: 1.644s
>> [INFO] Finished at: Mon Dec 22 10:56:35 CST 2014
>> [INFO] Final Memory: 21M/481M
>> [INFO] 
>> 
>> [ERROR] Failed to execute goal 
>> org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) 
>> on project spark-parent: Error finding remote resources manifests: 
>> /home/q/spark/spark-1.2.0/target/maven-shared-archive-resources/META-INF/NOTICE
>>  (No such file or directory) -> [Help 1]
>> [ERROR]
>> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
>> switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> [ERROR]
>> [ERROR] For more information about the errors and possible solutions, please 
>> read the following articles:
>> [ERROR] [Help 1] 
>> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
>>
>>
>> but the NOTICE file is in the download spark release:
>>
>>
>> [wyp@spark  /home/q/spark/spark-1.2.0]$ ll
>> total 248
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 assembly
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 bagel
>> drwxrwxr-x 2 1000 1000  4096 Dec 10 18:02 bin
>> drwxrwxr-x 2 1000 1000  4096 Dec 10 18:02 conf
>> -rw-rw-r-- 1 1000 1000   663 Dec 10 18:02 CONTRIBUTING.md
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 core
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 data
>> drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 dev
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 docker
>> drwxrwxr-x 7 1000 1000  4096 Dec 10 18:02 docs
>> drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 ec2
>> drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 examples
>> drwxrwxr-x 8 1000 1000  4096 Dec 10 18:02 external
>> drwxrwxr-x 5 1000 1000  4096 Dec 10 18:02 extras
>> drwxrwxr-x 4 1000 1000  4096 Dec 10 18:02 graphx
>> -rw-rw-r-- 1 1000 1000 45242 Dec 10 18:02 LICENSE
>> -rwxrwxr-x 1 1000 1000  7941 Dec 10 18:02 make-distribution.sh
>> drwxrwxr-x 3 1000 1000  4096 Dec 10 18:02 mllib
>> drwxrwxr-x 5 1000 1000  4096 D

Re: Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
Thanks for pointing out the tag issue. I've updated all links to point
to the correct tag (from the vote thread):

a428c446e23e628b746e0626cc02b7b3cadf588e

On Fri, Dec 19, 2014 at 1:55 AM, Sean Owen  wrote:
> Tag 1.2.0 is older than 1.2.0-rc2. I wonder if it just didn't get
> updated. I assume it's going to be 1.2.0-rc2 plus a few commits
> related to the release process.
>
> On Fri, Dec 19, 2014 at 9:50 AM, Shixiong Zhu  wrote:
>> Congrats!
>>
>> A little question about this release: Which commit is this release based on?
>> v1.2.0 and v1.2.0-rc2 are pointed to different commits in
>> https://github.com/apache/spark/releases
>>
>> Best Regards,
>>
>> Shixiong Zhu
>>
>> 2014-12-19 16:52 GMT+08:00 Patrick Wendell :
>>>
>>> I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
>>> the third release on the API-compatible 1.X line. It is Spark's
>>> largest release ever, with contributions from 172 developers and more
>>> than 1,000 commits!
>>>
>>> This release brings operational and performance improvements in Spark
>>> core including a new network transport subsytem designed for very
>>> large shuffles. Spark SQL introduces an API for external data sources
>>> along with Hive 13 support, dynamic partitioning, and the
>>> fixed-precision decimal type. MLlib adds a new pipeline-oriented
>>> package (spark.ml) for composing multiple algorithms. Spark Streaming
>>> adds a Python API and a write ahead log for fault tolerance. Finally,
>>> GraphX has graduated from alpha and introduces a stable API along with
>>> performance improvements.
>>>
>>> Visit the release notes [1] to read about the new features, or
>>> download [2] the release today.
>>>
>>> For errata in the contributions or release notes, please e-mail me
>>> *directly* (not on-list).
>>>
>>> Thanks to everyone involved in creating, testing, and documenting this
>>> release!
>>>
>>> [1] http://spark.apache.org/releases/spark-release-1-2-0.html
>>> [2] http://spark.apache.org/downloads.html
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
the third release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 172 developers and more
than 1,000 commits!

This release brings operational and performance improvements in Spark
core including a new network transport subsytem designed for very
large shuffles. Spark SQL introduces an API for external data sources
along with Hive 13 support, dynamic partitioning, and the
fixed-precision decimal type. MLlib adds a new pipeline-oriented
package (spark.ml) for composing multiple algorithms. Spark Streaming
adds a Python API and a write ahead log for fault tolerance. Finally,
GraphX has graduated from alpha and introduces a stable API along with
performance improvements.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

For errata in the contributions or release notes, please e-mail me
*directly* (not on-list).

Thanks to everyone involved in creating, testing, and documenting this release!

[1] http://spark.apache.org/releases/spark-release-1-2-0.html
[2] http://spark.apache.org/downloads.html

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [RESULT] [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-18 Thread Patrick Wendell
Update: An Apache infrastructure issue prevented me from pushing this
last night. The issue was resolved today and I should be able to push
the final release artifacts tonight.

On Tue, Dec 16, 2014 at 9:20 PM, Patrick Wendell  wrote:
> This vote has PASSED with 12 +1 votes (8 binding) and no 0 or -1 votes:
>
> +1:
> Matei Zaharia*
> Madhu Siddalingaiah
> Reynold Xin*
> Sandy Ryza
> Josh Rozen*
> Mark Hamstra*
> Denny Lee
> Tom Graves*
> GuiQiang Li
> Nick Pentreath*
> Sean McNamara*
> Patrick Wendell*
>
> 0:
>
> -1:
>
> I'll finalize and package this release in the next 48 hours. Thanks to
> everyone who contributed.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Which committers care about Kafka?

2014-12-18 Thread Patrick Wendell
Hey Cody,

Thanks for reaching out with this. The lead on streaming is TD - he is
traveling this week though so I can respond a bit. To the high level
point of whether Kafka is important - it definitely is. Something like
80% of Spark Streaming deployments (anecdotally) ingest data from
Kafka. Also, good support for Kafka is something we generally want in
Spark and not a library. In some cases IIRC there were user libraries
that used unstable Kafka API's and we were somewhat waiting on Kafka
to stabilize them to merge things upstream. Otherwise users wouldn't
be able to use newer Kakfa versions. This is a high level impression
only though, I haven't talked to TD about this recently so it's worth
revisiting given the developments in Kafka.

Please do bring things up like this on the dev list if there are
blockers for your usage - thanks for pinging it.

- Patrick

On Thu, Dec 18, 2014 at 7:07 AM, Cody Koeninger  wrote:
> Now that 1.2 is finalized...  who are the go-to people to get some
> long-standing Kafka related issues resolved?
>
> The existing api is not sufficiently safe nor flexible for our production
> use.  I don't think we're alone in this viewpoint, because I've seen
> several different patches and libraries to fix the same things we've been
> running into.
>
> Regarding flexibility
>
> https://issues.apache.org/jira/browse/SPARK-3146
>
> has been outstanding since August, and IMHO an equivalent of this is
> absolutely necessary.  We wrote a similar patch ourselves, then found that
> PR and have been running it in production.  We wouldn't be able to get our
> jobs done without it.  It also allows users to solve a whole class of
> problems for themselves (e.g. SPARK-2388, arbitrary delay of messages, etc).
>
> Regarding safety, I understand the motivation behind WriteAheadLog as a
> general solution for streaming unreliable sources, but Kafka already is a
> reliable source.  I think there's a need for an api that treats it as
> such.  Even aside from the performance issues of duplicating the
> write-ahead log in kafka into another write-ahead log in hdfs, I need
> exactly-once semantics in the face of failure (I've had failures that
> prevented reloading a spark streaming checkpoint, for instance).
>
> I've got an implementation i've been using
>
> https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka
> /src/main/scala/org/apache/spark/rdd/kafka
>
> Tresata has something similar at https://github.com/tresata/spark-kafka,
> and I know there were earlier attempts based on Storm code.
>
> Trying to distribute these kinds of fixes as libraries rather than patches
> to Spark is problematic, because large portions of the implementation are
> private[spark].
>
>  I'd like to help, but i need to know whose attention to get.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] Requiring JIRA for inclusion in release credits

2014-12-16 Thread Patrick Wendell
Hey All,

Due to the very high volume of contributions, we're switching to an
automated process for generating release credits. This process relies
on JIRA for categorizing contributions, so it's not possible for us to
provide credits in the case where users submit pull requests with no
associated JIRA.

This needed to be automated because, with more than 1000 commits per
release, finding proper names for every commit and summarizing
contributions was taking on the order of days of time.

For 1.2.0 there were around 100 commits that did not have JIRA's. I'll
try to manually merge these into the credits, but please e-mail me
directly if you are not credited once the release notes are posted.
The notes should be posted within 48 hours of right now.

We already ask that users include a JIRA for pull requests, but now it
will be required for proper attribution. I've updated the contributing
guide on the wiki to reflect this.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-16 Thread Patrick Wendell
I'm closing this vote now, will send results in a new thread.

On Sat, Dec 13, 2014 at 12:47 PM, Sean McNamara
 wrote:
> +1 tested on OS X and deployed+tested our apps via YARN into our staging 
> cluster.
>
> Sean
>
>
>> On Dec 11, 2014, at 10:40 AM, Reynold Xin  wrote:
>>
>> +1
>>
>> Tested on OS X.
>>
>> On Wednesday, December 10, 2014, Patrick Wendell  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.2.0!
>>>
>>> The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
>>>
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.0-rc2/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1055/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
>>>
>>> Please vote on releasing this package as Apache Spark 1.2.0!
>>>
>>> The vote is open until Saturday, December 13, at 21:00 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> == What justifies a -1 vote for this release? ==
>>> This vote is happening relatively late into the QA period, so
>>> -1 votes should only occur for significant regressions from
>>> 1.0.2. Bugs already present in 1.1.X, minor
>>> regressions, or bugs related to new features will not block this
>>> release.
>>>
>>> == What default changes should I be aware of? ==
>>> 1. The default value of "spark.shuffle.blockTransferService" has been
>>> changed to "netty"
>>> --> Old behavior can be restored by switching to "nio"
>>>
>>> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
>>> --> Old behavior can be restored by setting "spark.shuffle.manager" to
>>> "hash".
>>>
>>> == How does this differ from RC1 ==
>>> This has fixes for a handful of issues identified - some of the
>>> notable fixes are:
>>>
>>> [Core]
>>> SPARK-4498: Standalone Master can fail to recognize completed/failed
>>> applications
>>>
>>> [SQL]
>>> SPARK-4552: Query for empty parquet table in spark sql hive get
>>> IllegalArgumentException
>>> SPARK-4753: Parquet2 does not prune based on OR filters on partition
>>> columns
>>> SPARK-4761: With JDBC server, set Kryo as default serializer and
>>> disable reference tracking
>>> SPARK-4785: When called with arguments referring column fields, PMOD
>>> throws NPE
>>>
>>> - Patrick
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>>> For additional commands, e-mail: dev-h...@spark.apache.org 
>>>
>>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-16 Thread Patrick Wendell
This vote has PASSED with 12 +1 votes (8 binding) and no 0 or -1 votes:

+1:
Matei Zaharia*
Madhu Siddalingaiah
Reynold Xin*
Sandy Ryza
Josh Rozen*
Mark Hamstra*
Denny Lee
Tom Graves*
GuiQiang Li
Nick Pentreath*
Sean McNamara*
Patrick Wendell*

0:

-1:

I'll finalize and package this release in the next 48 hours. Thanks to
everyone who contributed.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RDD data flow

2014-12-16 Thread Patrick Wendell
> Why is that? Shouldn't all Partitions be Iterators? Clearly I'm missing
> something.

The Partition itself doesn't need to be an iterator - the iterator
comes from the result of compute(partition). The Partition is just an
identifier for that partition, not the data itself. Take a look at the
signature for compute() in the RDD class.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L97

>
> On a related subject, I was thinking of documenting the data flow of RDDs in
> more detail. The code is not hard to follow, but it's nice to have a simple
> picture with the major components and some explanation of the flow.  The
> declaration of Partition is throwing me off.
>
> Thanks!
>
>
>
> -
> --
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-tp9804.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Scala's Jenkins setup looks neat

2014-12-16 Thread Patrick Wendell
Yeah you can do it - just make sure they understand it is a new
feature so we're asking them to revisit it. They looked at it in the
past and they concluded they couldn't give us access without giving us
push access.

- Patrick

On Tue, Dec 16, 2014 at 6:06 PM, Reynold Xin  wrote:
> It's worth trying :)
>
>
> On Tue, Dec 16, 2014 at 6:02 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>>
>> News flash!
>>
>> From the latest version of the GitHub API
>> :
>>
>> Note that the repo:status OAuth scope
>>  grants targeted access to
>> Statuses *without* also granting access to repository code, while the repo
>> scope grants permission to code as well as statuses.
>>
>> As I understand it, ASF Infra has said no in the past to granting access
>> to statuses because it also granted push access.
>>
>> If so, this no longer appears to be the case.
>>
>> 1) Did I understand correctly and 2) should I open a new request with ASF
>> Infra to give us OAuth keys with repo:status access?
>>
>> Nick
>>
>> On Sat Sep 06 2014 at 1:29:53 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> Aww, that's a bummer...
>>>
>>>
>>> On Sat, Sep 6, 2014 at 1:10 PM, Reynold Xin  wrote:
>>>
 that would require github hooks permission and unfortunately asf infra
 wouldn't allow that.

 Maybe they will change their mind one day, but so far we asked about
 this and the answer has been no for security reasons.

 On Saturday, September 6, 2014, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> After reading Erik's email, I found this Scala PR
>  and immediately noticed a
> few
> cool things:
>
>- Jenkins is hooked directly into GitHub somehow, so you get the
> "All is
>well" message in the merge status window, presumably based on the
> last test
>status
>- Jenkins is also tagging the PR based on its test status or need for
>review
>- Jenkins is also tagging the PR for a specific milestone
>
> Do any of these things make sense to add to our setup? Or perhaps
> something
> inspired by these features?
>
> Nick
>

>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Governance of the Jenkins whitelist

2014-12-15 Thread Patrick Wendell
Hey Andrew,

The list of admins is maintained by the Amplab as part of their
donation of this infrastructure. The reason why we need to have admins
is that the pull request builder will fetch and then execute arbitrary
user code, so we need to do a security audit before we can approve
testing new patches. Over time when we get to know users we usually
whitelist them so they can test whatever they want.

I can see offline if the Amplab would be open to adding you as an
admin. I think we've added people over time who are very involved in
the community. Just wanted to send this e-mail so people understand
how it works.

- Patrick

On Sat, Dec 13, 2014 at 11:43 PM, Andrew Ash  wrote:
> Jenkins is a really valuable tool for increasing quality of incoming
> patches to Spark, but I've noticed that there are often a lot of patches
> waiting for testing because they haven't been approved for testing.
>
> Certain users can instruct Jenkins to run on a PR, or add other users to a
> whitelist. How does governance work for that list of admins?  Meaning who
> is currently on it, and what are the requirements to be on that list?
>
> Can I be permissioned to allow Jenkins to run on certain PRs?  I've often
> come across well-intentioned PRs that are languishing because Jenkins has
> yet to run on them.
>
> Andrew

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Test failures after Jenkins upgrade

2014-12-15 Thread Patrick Wendell
Ah cool Josh - I think for some reason we are hitting this every time
now. Since this is holding up a bunch of other patches, I just pushed
something ignoring the tests as a hotfix. Even waiting for a couple
hours is really expensive productivity-wise given the frequency with
which we run tests. We should just re-enable them when we merge the
appropriate fix.

On Mon, Dec 15, 2014 at 10:54 AM, Josh Rosen  wrote:
> There's a JIRA for this: https://issues.apache.org/jira/browse/SPARK-4826
>
> And two open PRs:
>
> https://github.com/apache/spark/pull/3695
> https://github.com/apache/spark/pull/3701
>
> We might be close to fixing this via one of those PRs, so maybe we should
> try using one of those instead?
>
> On December 15, 2014 at 10:51:46 AM, Patrick Wendell (pwend...@gmail.com)
> wrote:
>
> Hey All,
>
> It appears that a single test suite is failing after the jenkins
> upgrade: "org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite".
> My guess is the suite is not resilient in some way to differences in
> the environment (JVM, OS version, or something else).
>
> I'm going to disable the suite to get the build passing. This should
> be done in the next 30 minutes or so.
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Test failures after Jenkins upgrade

2014-12-15 Thread Patrick Wendell
Hey All,

It appears that a single test suite is failing after the jenkins
upgrade: "org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite".
My guess is the suite is not resilient in some way to differences in
the environment (JVM, OS version, or something else).

I'm going to disable the suite to get the build passing. This should
be done in the next 30 minutes or so.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: zinc invocation examples

2014-12-12 Thread Patrick Wendell
Hey York - I'm sending some feedback off-list, feel free to open a PR as well.


On Tue, Dec 9, 2014 at 12:05 PM, York, Brennon
 wrote:
> Patrick, I¹ve nearly completed a basic build out for the SPARK-4501 issue
> (at https://github.com/brennonyork/spark/tree/SPARK-4501) and would be
> great to get your initial read on it. Per this thread I need to add in the
> -scala-home call to zinc, but its close to ready for a PR.
>
> On 12/5/14, 2:10 PM, "Patrick Wendell"  wrote:
>
>>One thing I created a JIRA for a while back was to have a similar
>>script to "sbt/sbt" that transparently downloads Zinc, Scala, and
>>Maven in a subdirectory of Spark and sets it up correctly. I.e.
>>"build/mvn".
>>
>>Outside of brew for MacOS there aren't good Zinc packages, and it's a
>>pain to figure out how to set it up.
>>
>>https://issues.apache.org/jira/browse/SPARK-4501
>>
>>Prashant Sharma looked at this for a bit but I don't think he's
>>working on it actively any more, so if someone wanted to do this, I'd
>>be extremely grateful.
>>
>>- Patrick
>>
>>On Fri, Dec 5, 2014 at 11:05 AM, Ryan Williams
>> wrote:
>>> fwiw I've been using `zinc -scala-home $SCALA_HOME -nailed -start`
>>>which:
>>>
>>> - starts a nailgun server as well,
>>> - uses my installed scala 2.{10,11}, as opposed to zinc's default 2.9.2
>>> <https://github.com/typesafehub/zinc#scala>: "If no options are passed
>>>to
>>> locate a version of Scala then Scala 2.9.2 is used by default (which is
>>> bundled with zinc)."
>>>
>>> The latter seems like it might be especially important.
>>>
>>>
>>> On Thu Dec 04 2014 at 4:25:32 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Oh, derp. I just assumed from looking at all the options that there was
>>>> something to it. Thanks Sean.
>>>>
>>>> On Thu Dec 04 2014 at 7:47:33 AM Sean Owen  wrote:
>>>>
>>>> > You just run it once with "zinc -start" and leave it running as a
>>>> > background process on your build machine. You don't have to do
>>>> > anything for each build.
>>>> >
>>>> > On Wed, Dec 3, 2014 at 3:44 PM, Nicholas Chammas
>>>> >  wrote:
>>>> > > https://github.com/apache/spark/blob/master/docs/
>>>> > building-spark.md#speeding-up-compilation-with-zinc
>>>> > >
>>>> > > Could someone summarize how they invoke zinc as part of a regular
>>>> > > build-test-etc. cycle?
>>>> > >
>>>> > > I'll add it in to the aforelinked page if appropriate.
>>>> > >
>>>> > > Nick
>>>> >
>>>>
>>
>>-
>>To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
> 
>
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates. The information transmitted herewith is 
> intended only for use by the individual or entity to which it is addressed.  
> If the reader of this message is not the intended recipient, you are hereby 
> notified that any review, retransmission, dissemination, distribution, 
> copying or other use of, or taking of any action in reliance upon this 
> information is strictly prohibited. If you have received this communication 
> in error, please contact the sender and delete the material from your 
> computer.
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Is Apache JIRA down?

2014-12-10 Thread Patrick Wendell
I believe many apache services are/were down due to an outage.

On Wed, Dec 10, 2014 at 5:24 PM, Nicholas Chammas
 wrote:
> Nevermind, seems to be back up now.
>
> On Wed Dec 10 2014 at 7:46:30 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> For example: https://issues.apache.org/jira/browse/SPARK-3431
>>
>> Where do we report/track issues with JIRA itself being down?
>>
>> Nick
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-10 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.0!

The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1055/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/

Please vote on releasing this package as Apache Spark 1.2.0!

The vote is open until Saturday, December 13, at 21:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== What justifies a -1 vote for this release? ==
This vote is happening relatively late into the QA period, so
-1 votes should only occur for significant regressions from
1.0.2. Bugs already present in 1.1.X, minor
regressions, or bugs related to new features will not block this
release.

== What default changes should I be aware of? ==
1. The default value of "spark.shuffle.blockTransferService" has been
changed to "netty"
--> Old behavior can be restored by switching to "nio"

2. The default value of "spark.shuffle.manager" has been changed to "sort".
--> Old behavior can be restored by setting "spark.shuffle.manager" to "hash".

== How does this differ from RC1 ==
This has fixes for a handful of issues identified - some of the
notable fixes are:

[Core]
SPARK-4498: Standalone Master can fail to recognize completed/failed
applications

[SQL]
SPARK-4552: Query for empty parquet table in spark sql hive get
IllegalArgumentException
SPARK-4753: Parquet2 does not prune based on OR filters on partition columns
SPARK-4761: With JDBC server, set Kryo as default serializer and
disable reference tracking
SPARK-4785: When called with arguments referring column fields, PMOD throws NPE

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-10 Thread Patrick Wendell
This vote is closed in favor of RC2.

On Fri, Dec 5, 2014 at 2:02 PM, Patrick Wendell  wrote:
> Hey All,
>
> Thanks all for the continued testing!
>
> The issue I mentioned earlier SPARK-4498 was fixed earlier this week
> (hat tip to Mark Hamstra who contributed to fix).
>
> In the interim a few smaller blocker-level issues with Spark SQL were
> found and fixed (SPARK-4753, SPARK-4552, SPARK-4761).
>
> There is currently an outstanding issue (SPARK-4740[1]) in Spark core
> that needs to be fixed.
>
> I want to thank in particular Shopify and Intel China who have
> identified and helped test blocker issues with the release. This type
> of workload testing around releases is really helpful for us.
>
> Once things stabilize I will cut RC2. I think we're pretty close with this 
> one.
>
> - Patrick
>
> On Wed, Dec 3, 2014 at 5:38 PM, Takeshi Yamamuro  
> wrote:
>> +1 (non-binding)
>>
>> Checked on CentOS 6.5, compiled from the source.
>> Ran various examples in stand-alone master and three slaves, and
>> browsed the web UI.
>>
>> On Sat, Nov 29, 2014 at 2:16 PM, Patrick Wendell  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.2.0!
>>>
>>> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>>>
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1048/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>>>
>>> Please vote on releasing this package as Apache Spark 1.2.0!
>>>
>>> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.1.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> == What justifies a -1 vote for this release? ==
>>> This vote is happening very late into the QA period compared with
>>> previous votes, so -1 votes should only occur for significant
>>> regressions from 1.0.2. Bugs already present in 1.1.X, minor
>>> regressions, or bugs related to new features will not block this
>>> release.
>>>
>>> == What default changes should I be aware of? ==
>>> 1. The default value of "spark.shuffle.blockTransferService" has been
>>> changed to "netty"
>>> --> Old behavior can be restored by switching to "nio"
>>>
>>> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
>>> --> Old behavior can be restored by setting "spark.shuffle.manager" to
>>> "hash".
>>>
>>> == Other notes ==
>>> Because this vote is occurring over a weekend, I will likely extend
>>> the vote if this RC survives until the end of the vote period.
>>>
>>> - Patrick
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Build Spark 1.2.0-rc1 encounter exceptions when running HiveContext - Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy

2014-12-10 Thread Patrick Wendell
Hi Andrew,

It looks like somehow you are including jars from the upstream Apache
Hive 0.13 project on your classpath. For Spark 1.2 Hive 0.13 support,
we had to modify Hive to use a different version of Kryo that was
compatible with Spark's Kryo version.

https://github.com/pwendell/hive/commit/5b582f242946312e353cfce92fc3f3fa472aedf3

I would look through the actual classpath and make sure you aren't
including your own hive-exec jar somehow.

- Patrick

On Wed, Dec 10, 2014 at 9:48 AM, Andrew Lee  wrote:
> Apologize for the format, somehow it got messed up and linefeed were removed. 
> Here's a reformatted version.
> Hi All,
> I tried to include necessary libraries in SPARK_CLASSPATH in spark-env.sh to 
> include auxiliaries JARs and datanucleus*.jars from Hive, however, when I run 
> HiveContext, it gives me the following error:
>
> Caused by: java.lang.ClassNotFoundException: 
> com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy
>
> I have checked the JARs with (jar tf), looks like this is already included 
> (shaded) in the assembly JAR (spark-assembly-1.2.0-hadoop2.4.1.jar) which is 
> configured in the System classpath already. I couldn't figure out what is 
> going on with the shading on the esotericsoftware JARs here.  Any help is 
> appreciated.
>
>
> How to reproduce the problem?
> Run the following 3 statements in spark-shell ( This is how I launched my 
> spark-shell. cd /opt/spark; ./bin/spark-shell --master yarn --deploy-mode 
> client --queue research --driver-memory 1024M)
>
> import org.apache.spark.SparkContext
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> hiveContext.hql("CREATE TABLE IF NOT EXISTS spark_hive_test_table (key INT, 
> value STRING)")
>
>
>
> A reference of my environment.
> Apache Hadoop 2.4.1
> Apache Hive 0.13.1
> Apache Spark branch-1.2 (installed under /opt/spark/, and config under 
> /etc/spark/)
> Maven build command:
>
> mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.4.1 
> -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests install
>
> Source Code commit label: eb4d457a870f7a281dc0267db72715cd00245e82
>
> My spark-env.sh have the following contents when I executed spark-shell:
>> HADOOP_HOME=/opt/hadoop/
>> HIVE_HOME=/opt/hive/
>> HADOOP_CONF_DIR=/etc/hadoop/
>> YARN_CONF_DIR=/etc/hadoop/
>> HIVE_CONF_DIR=/etc/hive/
>> HADOOP_SNAPPY_JAR=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name 
>> "snappy-java-*.jar")
>> HADOOP_LZO_JAR=$(find $HADOOP_HOME/share/hadoop/common/lib/ -type f -name 
>> "hadoop-lzo-*.jar")
>> SPARK_YARN_DIST_FILES=/user/spark/libs/spark-assembly-1.2.0-hadoop2.4.1.jar
>> export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native
>> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
>> export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native
>> export 
>> SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_SNAPPY_JAR:$HADOOP_LZO_JAR:$HIVE_CONF_DIR:/opt/hive/lib/datanucleus-api-jdo-3.2.6.jar:/opt/hive/lib/datanucleus-core-3.2.10.jar:/opt/hive/lib/datanucleus-rdbms-3.2.9.jar
>
>
>> Here's what I see from my stack trace.
>> warning: there were 1 deprecation warning(s); re-run with -deprecation for 
>> details
>> Hive history 
>> file=/home/hive/log/alti-test-01/hive_job_log_b5db9539-4736-44b3-a601-04fa77cb6730_1220828461.txt
>> java.lang.NoClassDefFoundError: 
>> com/esotericsoftware/shaded/org/objenesis/strategy/InstantiatorStrategy
>>   at 
>> org.apache.hadoop.hive.ql.exec.Utilities.(Utilities.java:925)
>>   at 
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.validate(SemanticAnalyzer.java:9718)
>>   at 
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.validate(SemanticAnalyzer.java:9712)
>>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:434)
>>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
>>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:975)
>>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1040)
>>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
>>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
>>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305)
>>   at 
>> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
>>   at 
>> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
>>   at 
>> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
>>   at 
>> org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
>>   at 
>> org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
>>   at 
>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
>>   at 
>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
>>   at 
>> org.apache.spark.sql.SchemaRDDLike$class.$init$(S

Re: Is this a little bug in BlockTransferMessage ?

2014-12-09 Thread Patrick Wendell
Hey Nick,

Thanks for bringing this up. I believe these Java tests are running in
the sbt build right now, the issue is that this particular bug was
flagged by the triggering of a runtime Java "assert" (not a normal
Junit test assertion) and those are not enabled in our sbt tests. It
would be good to fix it so that assertions run when we do the sbt
tests, for some reason I think the sbt tests disable them by default.

I think the original issue is fixed now (that Sean found and
reported). It would be good to get assertions running in our tests,
but I'm not sure I'd block the release on it. The normal JUnit
assertions are running correctly.

- Patrick

On Tue, Dec 9, 2014 at 3:35 PM, Nicholas Chammas
 wrote:
> OK. That's concerning. Hopefully that's the only bug we'll dig up once we
> run all the Java tests but who knows.
>
> Patrick,
>
> Shouldn't this be a release blocking bug for 1.2 (mostly just because it
> has already been covered by a unit test)? Well, that, as well as any other
> bugs that come up as we run these Java tests.
>
> Nick
>
> On Tue Dec 09 2014 at 6:32:53 PM Sean Owen  wrote:
>
>> I'm not so sure about SBT, but I'm looking at the output now and do
>> not see things like JavaAPISuite being run. I see them compiled. That
>> I'm not as sure how to fix. I think I have a solution for Maven on
>> SPARK-4159.
>>
>> On Tue, Dec 9, 2014 at 11:30 PM, Nicholas Chammas
>>  wrote:
>> > So all this time the tests that Jenkins has been running via Jenkins and
>> SBT
>> > + ScalaTest... those haven't been running any of the Java unit tests?
>> >
>> > SPARK-4159 only mentions Maven as a problem, but I'm wondering how these
>> > tests got through Jenkins OK.
>> >
>> > On Tue Dec 09 2014 at 5:34:22 PM Sean Owen  wrote:
>> >>
>> >> Yep, will do. The test does catch it -- it's just not being executed.
>> >> I think I have a reasonable start on re-enabling surefire + Java tests
>> >> for SPARK-4159.
>> >>
>> >> On Tue, Dec 9, 2014 at 10:30 PM, Aaron Davidson 
>> >> wrote:
>> >> > Oops, that does look like a bug. Strange that the
>> >> > BlockTransferMessageSuite
>> >> > did not catch this. "+1" sounds like the right solution, would you be
>> >> > able
>> >> > to submit a PR?
>> >> >
>> >> > On Tue, Dec 9, 2014 at 1:53 PM, Sean Owen  wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> https://github.com/apache/spark/blob/master/network/
>> shuffle/src/main/java/org/apache/spark/network/shuffle/
>> protocol/BlockTransferMessage.java#L70
>> >> >>
>> >> >> public byte[] toByteArray() {
>> >> >>   ByteBuf buf = Unpooled.buffer(encodedLength());
>> >> >>   buf.writeByte(type().id);
>> >> >>   encode(buf);
>> >> >>   assert buf.writableBytes() == 0 : "Writable bytes remain: " +
>> >> >> buf.writableBytes();
>> >> >>   return buf.array();
>> >> >> }
>> >> >>
>> >> >> Running the Java tests at last might have turned up a little bug
>> here,
>> >> >> but wanted to check. This makes a buffer to hold enough bytes to
>> >> >> encode the message. But it writes 1 byte, plus the message. This
>> makes
>> >> >> the buffer expand, and then does have nonzero capacity afterwards, so
>> >> >> the assert fails.
>> >> >>
>> >> >> So just needs a "+ 1" in the size?
>> >> >
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-05 Thread Patrick Wendell
Hey All,

Thanks all for the continued testing!

The issue I mentioned earlier SPARK-4498 was fixed earlier this week
(hat tip to Mark Hamstra who contributed to fix).

In the interim a few smaller blocker-level issues with Spark SQL were
found and fixed (SPARK-4753, SPARK-4552, SPARK-4761).

There is currently an outstanding issue (SPARK-4740[1]) in Spark core
that needs to be fixed.

I want to thank in particular Shopify and Intel China who have
identified and helped test blocker issues with the release. This type
of workload testing around releases is really helpful for us.

Once things stabilize I will cut RC2. I think we're pretty close with this one.

- Patrick

On Wed, Dec 3, 2014 at 5:38 PM, Takeshi Yamamuro  wrote:
> +1 (non-binding)
>
> Checked on CentOS 6.5, compiled from the source.
> Ran various examples in stand-alone master and three slaves, and
> browsed the web UI.
>
> On Sat, Nov 29, 2014 at 2:16 PM, Patrick Wendell  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.2.0!
>>
>> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1048/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.2.0!
>>
>> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.1.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening very late into the QA period compared with
>> previous votes, so -1 votes should only occur for significant
>> regressions from 1.0.2. Bugs already present in 1.1.X, minor
>> regressions, or bugs related to new features will not block this
>> release.
>>
>> == What default changes should I be aware of? ==
>> 1. The default value of "spark.shuffle.blockTransferService" has been
>> changed to "netty"
>> --> Old behavior can be restored by switching to "nio"
>>
>> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
>> --> Old behavior can be restored by setting "spark.shuffle.manager" to
>> "hash".
>>
>> == Other notes ==
>> Because this vote is occurring over a weekend, I will likely extend
>> the vote if this RC survives until the end of the vote period.
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: zinc invocation examples

2014-12-05 Thread Patrick Wendell
One thing I created a JIRA for a while back was to have a similar
script to "sbt/sbt" that transparently downloads Zinc, Scala, and
Maven in a subdirectory of Spark and sets it up correctly. I.e.
"build/mvn".

Outside of brew for MacOS there aren't good Zinc packages, and it's a
pain to figure out how to set it up.

https://issues.apache.org/jira/browse/SPARK-4501

Prashant Sharma looked at this for a bit but I don't think he's
working on it actively any more, so if someone wanted to do this, I'd
be extremely grateful.

- Patrick

On Fri, Dec 5, 2014 at 11:05 AM, Ryan Williams
 wrote:
> fwiw I've been using `zinc -scala-home $SCALA_HOME -nailed -start` which:
>
> - starts a nailgun server as well,
> - uses my installed scala 2.{10,11}, as opposed to zinc's default 2.9.2
> : "If no options are passed to
> locate a version of Scala then Scala 2.9.2 is used by default (which is
> bundled with zinc)."
>
> The latter seems like it might be especially important.
>
>
> On Thu Dec 04 2014 at 4:25:32 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Oh, derp. I just assumed from looking at all the options that there was
>> something to it. Thanks Sean.
>>
>> On Thu Dec 04 2014 at 7:47:33 AM Sean Owen  wrote:
>>
>> > You just run it once with "zinc -start" and leave it running as a
>> > background process on your build machine. You don't have to do
>> > anything for each build.
>> >
>> > On Wed, Dec 3, 2014 at 3:44 PM, Nicholas Chammas
>> >  wrote:
>> > > https://github.com/apache/spark/blob/master/docs/
>> > building-spark.md#speeding-up-compilation-with-zinc
>> > >
>> > > Could someone summarize how they invoke zinc as part of a regular
>> > > build-test-etc. cycle?
>> > >
>> > > I'll add it in to the aforelinked page if appropriate.
>> > >
>> > > Nick
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Exception adding resource files in latest Spark

2014-12-04 Thread Patrick Wendell
Thanks for flagging this. I reverted the relevant YARN fix in Spark
1.2 release. We can try to debug this in master.

On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang  wrote:
> I created a ticket for this:
>
>   https://issues.apache.org/jira/browse/SPARK-4757
>
>
> Jianshi
>
> On Fri, Dec 5, 2014 at 1:31 PM, Jianshi Huang 
> wrote:
>>
>> Correction:
>>
>> According to Liancheng, this hotfix might be the root cause:
>>
>>
>> https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce
>>
>> Jianshi
>>
>> On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang 
>> wrote:
>>>
>>> Looks like the datanucleus*.jar shouldn't appear in the hdfs path in
>>> Yarn-client mode.
>>>
>>> Maybe this patch broke yarn-client.
>>>
>>>
>>> https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53
>>>
>>> Jianshi
>>>
>>> On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang 
>>> wrote:

 Actually my HADOOP_CLASSPATH has already been set to include
 /etc/hadoop/conf/*

 export
 HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase
 classpath)

 Jianshi

 On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang 
 wrote:
>
> Looks like somehow Spark failed to find the core-site.xml in
> /et/hadoop/conf
>
> I've already set the following env variables:
>
> export YARN_CONF_DIR=/etc/hadoop/conf
> export HADOOP_CONF_DIR=/etc/hadoop/conf
> export HBASE_CONF_DIR=/etc/hbase/conf
>
> Should I put $HADOOP_CONF_DIR/* to HADOOP_CLASSPATH?
>
> Jianshi
>
> On Fri, Dec 5, 2014 at 11:37 AM, Jianshi Huang
>  wrote:
>>
>> I got the following error during Spark startup (Yarn-client mode):
>>
>> 14/12/04 19:33:58 INFO Client: Uploading resource
>> file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar
>> ->
>> hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar
>> java.lang.IllegalArgumentException: Wrong FS:
>> hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar,
>> expected: file:///
>> at
>> org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>> at
>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>> at
>> org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242)
>> at
>> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35)
>> at
>> org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350)
>> at
>> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35)
>> at
>> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80)
>> at
>> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140)
>> at
>> org.apache.spark.SparkContext.(SparkContext.scala:335)
>> at
>> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986)
>> at $iwC$$iwC.(:9)
>> at $iwC.(:18)
>> at (:20)
>> at .(:24)
>>
>> I'm using latest Spark built from master HEAD yesterday. Is this a
>> bug?
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github & Blog: http://huangjs.github.com/
>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Gi

Re: Ooyala Spark JobServer

2014-12-04 Thread Patrick Wendell
Hey Jun,

The Ooyala server is being maintained by it's original author (Evan Chan)
here:

https://github.com/spark-jobserver/spark-jobserver

This is likely to stay as a standalone project for now, since it builds
directly on Spark's public API's.

- Patrick

On Wed, Dec 3, 2014 at 9:02 PM, Jun Feng Liu  wrote:

> Hi, I am wondering the status of the Ooyala Spark Jobserver, any plan to
> get it into the spark release?
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   --
>  [image: 2D barcode - encoded with contact information] *Phone: 
> *86-10-82452683
>
> * E-mail:* *liuj...@cn.ibm.com* 
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>


Re: Spurious test failures, testing best practices

2014-12-02 Thread Patrick Wendell
Hey Ryan,

What if you run a single "mvn install" to install all libraries
locally - then can you "mvn compile -pl core"? I think this may be the
only way to make it work.

- Patrick

On Tue, Dec 2, 2014 at 2:40 PM, Ryan Williams
 wrote:
> Following on Mark's Maven examples, here is another related issue I'm
> having:
>
> I'd like to compile just the `core` module after a `mvn clean`, without
> building an assembly JAR first. Is this possible?
>
> Attempting to do it myself, the steps I performed were:
>
> - `mvn compile -pl core`: fails because `core` depends on `network/common`
> and `network/shuffle`, neither of which is installed in my local maven
> cache (and which don't exist in central Maven repositories, I guess? I
> thought Spark is publishing snapshot releases?)
>
> - `network/shuffle` also depends on `network/common`, so I'll `mvn install`
> the latter first: `mvn install -DskipTests -pl network/common`. That
> succeeds, and I see a newly built 1.3.0-SNAPSHOT jar in my local maven
> repository.
>
> - However, `mvn install -DskipTests -pl network/shuffle` subsequently
> fails, seemingly due to not finding network/core. Here's
>  a sample full
> output from running `mvn install -X -U -DskipTests -pl network/shuffle`
> from such a state (the -U was to get around a previous failure based on
> having cached a failed lookup of network-common-1.3.0-SNAPSHOT).
>
> - Thinking maven might be special-casing "-SNAPSHOT" versions, I tried
> replacing "1.3.0-SNAPSHOT" with "1.3.0.1" globally and repeating these
> steps, but the error seems to be the same
> .
>
> Any ideas?
>
> Thanks,
>
> -Ryan
>
> On Sun Nov 30 2014 at 6:37:28 PM Mark Hamstra 
> wrote:
>
>> >
>> > - Start the SBT interactive console with sbt/sbt
>> > - Build your assembly by running the "assembly" target in the assembly
>> > project: assembly/assembly
>> > - Run all the tests in one module: core/test
>> > - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>> (this
>> > also supports tab completion)
>>
>>
>> The equivalent using Maven:
>>
>> - Start zinc
>> - Build your assembly using the mvn "package" or "install" target
>> ("install" is actually the equivalent of SBT's "publishLocal") -- this step
>> is the first step in
>> http://spark.apache.org/docs/latest/building-with-maven.
>> html#spark-tests-in-maven
>> - Run all the tests in one module: mvn -pl core test
>> - Run a specific suite: mvn -pl core
>> -DwildcardSuites=org.apache.spark.rdd.RDDSuite test (the -pl option isn't
>> strictly necessary if you don't mind waiting for Maven to scan through all
>> the other sub-projects only to do nothing; and, of course, it needs to be
>> something other than "core" if the test you want to run is in another
>> sub-project.)
>>
>> You also typically want to carry along in each subsequent step any relevant
>> command line options you added in the "package"/"install" step.
>>
>> On Sun, Nov 30, 2014 at 3:06 PM, Matei Zaharia 
>> wrote:
>>
>> > Hi Ryan,
>> >
>> > As a tip (and maybe this isn't documented well), I normally use SBT for
>> > development to avoid the slow build process, and use its interactive
>> > console to run only specific tests. The nice advantage is that SBT can
>> keep
>> > the Scala compiler loaded and JITed across builds, making it faster to
>> > iterate. To use it, you can do the following:
>> >
>> > - Start the SBT interactive console with sbt/sbt
>> > - Build your assembly by running the "assembly" target in the assembly
>> > project: assembly/assembly
>> > - Run all the tests in one module: core/test
>> > - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>> (this
>> > also supports tab completion)
>> >
>> > Running all the tests does take a while, and I usually just rely on
>> > Jenkins for that once I've run the tests for the things I believed my
>> patch
>> > could break. But this is because some of them are integration tests (e.g.
>> > DistributedSuite, which creates multi-process mini-clusters). Many of the
>> > individual suites run fast without requiring this, however, so you can
>> pick
>> > the ones you want. Perhaps we should find a way to tag them so people
>> can
>> > do a "quick-test" that skips the integration ones.
>> >
>> > The assembly builds are annoying but they only take about a minute for me
>> > on a MacBook Pro with SBT warmed up. The assembly is actually only
>> required
>> > for some of the "integration" tests (which launch new processes), but I'd
>> > recommend doing it all the time anyway since it would be very confusing
>> to
>> > run those with an old assembly. The Scala compiler crash issue can also
>> be
>> > a problem, but I don't see it very often with SBT. If it happens, I exit
>> > SBT and do sbt clean.
>> >
>> > Anyway, this is useful feedback and I think we should try to improve some
>> > of these suites, but hopefully you can also try

Re: keeping PR titles / descriptions up to date

2014-12-02 Thread Patrick Wendell
Also a note on this for committers - it's possible to re-word the
title during merging, by just running "git commit -a --amend" before
you push the PR.

- Patrick

On Tue, Dec 2, 2014 at 12:50 PM, Mridul Muralidharan  wrote:
> I second that !
> Would also be great if the JIRA was updated accordingly too.
>
> Regards,
> Mridul
>
>
> On Wed, Dec 3, 2014 at 1:53 AM, Kay Ousterhout  
> wrote:
>> Hi all,
>>
>> I've noticed a bunch of times lately where a pull request changes to be
>> pretty different from the original pull request, and the title /
>> description never get updated.  Because the pull request title and
>> description are used as the commit message, the incorrect description lives
>> on forever, making it harder to understand the reason behind a particular
>> commit without going back and reading the entire conversation on the pull
>> request.  If folks could try to keep these up to date (and committers, try
>> to remember to verify that the title and description are correct before
>> making merging pull requests), that would be awesome.
>>
>> -Kay
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-01 Thread Patrick Wendell
Hey All,

Just an update. Josh, Andrew, and others are working to reproduce
SPARK-4498 and fix it. Other than that issue no serious regressions
have been reported so far. If we are able to get a fix in for that
soon, we'll likely cut another RC with the patch.

Continued testing of RC1 is definitely appreciated!

I'll leave this vote open to allow folks to continue posting comments.
It's fine to still give "+1" from your own testing... i.e. you can
assume at this point SPARK-4498 will be fixed before releasing.

- Patrick

On Mon, Dec 1, 2014 at 3:30 PM, Matei Zaharia  wrote:
> +0.9 from me. Tested it on Mac and Windows (someone has to do it) and while 
> things work, I noticed a few recent scripts don't have Windows equivalents, 
> namely https://issues.apache.org/jira/browse/SPARK-4683 and 
> https://issues.apache.org/jira/browse/SPARK-4684. The first one at least 
> would be good to fix if we do another RC. Not blocking the release but useful 
> to fix in docs is https://issues.apache.org/jira/browse/SPARK-4685.
>
> Matei
>
>
>> On Dec 1, 2014, at 11:18 AM, Josh Rosen  wrote:
>>
>> Hi everyone,
>>
>> There's an open bug report related to Spark standalone which could be a 
>> potential release-blocker (pending investigation / a bug fix): 
>> https://issues.apache.org/jira/browse/SPARK-4498.  This issue seems 
>> non-deterministc and only affects long-running Spark standalone deployments, 
>> so it may be hard to reproduce.  I'm going to work on a patch to add 
>> additional logging in order to help with debugging.
>>
>> I just wanted to give an early head's up about this issue and to get more 
>> eyes on it in case anyone else has run into it or wants to help with 
>> debugging.
>>
>> - Josh
>>
>> On November 28, 2014 at 9:18:09 PM, Patrick Wendell (pwend...@gmail.com) 
>> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.2.0!
>>
>> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1048/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.2.0!
>>
>> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.1.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening very late into the QA period compared with
>> previous votes, so -1 votes should only occur for significant
>> regressions from 1.0.2. Bugs already present in 1.1.X, minor
>> regressions, or bugs related to new features will not block this
>> release.
>>
>> == What default changes should I be aware of? ==
>> 1. The default value of "spark.shuffle.blockTransferService" has been
>> changed to "netty"
>> --> Old behavior can be restored by switching to "nio"
>>
>> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
>> --> Old behavior can be restored by setting "spark.shuffle.manager" to 
>> "hash".
>>
>> == Other notes ==
>> Because this vote is occurring over a weekend, I will likely extend
>> the vote if this RC survives until the end of the vote period.
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spurious test failures, testing best practices

2014-11-30 Thread Patrick Wendell
Hi Ilya - you can just submit a pull request and the way we test them
is to run it through jenkins. You don't need to do anything special.

On Sun, Nov 30, 2014 at 8:57 PM, Ganelin, Ilya
 wrote:
> Hi, Patrick - with regards to testing on Jenkins, is the process for this
> to submit a pull request for the branch or is there another interface we
> can use to submit a build to Jenkins for testing?
>
> On 11/30/14, 6:49 PM, "Patrick Wendell"  wrote:
>
>>Hey Ryan,
>>
>>A few more things here. You should feel free to send patches to
>>Jenkins to test them, since this is the reference environment in which
>>we regularly run tests. This is the normal workflow for most
>>developers and we spend a lot of effort provisioning/maintaining a
>>very large jenkins cluster to allow developers access this resource. A
>>common development approach is to locally run tests that you've added
>>in a patch, then send it to jenkins for the full run, and then try to
>>debug locally if you see specific unanticipated test failures.
>>
>>One challenge we have is that given the proliferation of OS versions,
>>Java versions, Python versions, ulimits, etc. there is a combinatorial
>>number of environments in which tests could be run. It is very hard in
>>some cases to figure out post-hoc why a given test is not working in a
>>specific environment. I think a good solution here would be to use a
>>standardized docker container for running Spark tests and asking folks
>>to use that locally if they are trying to run all of the hundreds of
>>Spark tests.
>>
>>Another solution would be to mock out every system interaction in
>>Spark's tests including e.g. filesystem interactions to try and reduce
>>variance across environments. However, that seems difficult.
>>
>>As the number of developers of Spark increases, it's definitely a good
>>idea for us to invest in developer infrastructure including things
>>like snapshot releases, better documentation, etc. Thanks for bringing
>>this up as a pain point.
>>
>>- Patrick
>>
>>
>>On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
>> wrote:
>>> thanks for the info, Matei and Brennon. I will try to switch my
>>>workflow to
>>> using sbt. Other potential action items:
>>>
>>> - currently the docs only contain information about building with maven,
>>> and even then don't cover many important cases, as I described in my
>>> previous email. If SBT is as much better as you've described then that
>>> should be made much more obvious. Wasn't it the case recently that there
>>> was only a page about building with SBT, and not one about building with
>>> maven? Clearer messaging around this needs to exist in the
>>>documentation,
>>> not just on the mailing list, imho.
>>>
>>> - +1 to better distinguishing between unit and integration tests, having
>>> separate scripts for each, improving documentation around common
>>>workflows,
>>> expectations of brittleness with each kind of test, advisability of just
>>> relying on Jenkins for certain kinds of tests to not waste too much
>>>time,
>>> etc. Things like the compiler crash should be discussed in the
>>> documentation, not just in the mailing list archives, if new
>>>contributors
>>> are likely to run into them through no fault of their own.
>>>
>>> - What is the algorithm you use to decide what tests you might have
>>>broken?
>>> Can we codify it in some scripts that other people can use?
>>>
>>>
>>>
>>> On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia 
>>> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> As a tip (and maybe this isn't documented well), I normally use SBT for
>>>> development to avoid the slow build process, and use its interactive
>>>> console to run only specific tests. The nice advantage is that SBT can
>>>>keep
>>>> the Scala compiler loaded and JITed across builds, making it faster to
>>>> iterate. To use it, you can do the following:
>>>>
>>>> - Start the SBT interactive console with sbt/sbt
>>>> - Build your assembly by running the "assembly" target in the assembly
>>>> project: assembly/assembly
>>>> - Run all the tests in one module: core/test
>>>> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>>>>(this
>>>> also supports tab completion)
>>>>
>>>> Run

<    1   2   3   4   5   6   7   >