Re: Handling stale PRs

2014-08-30 Thread Nicholas Chammas
On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell pwend...@gmail.com wrote:

 it's actually precedurally difficult for us to close pull requests


Just an FYI: Seems like the GitHub-sanctioned work-around to having
issues-only permissions is to have a second, issues-only repository
https://help.github.com/articles/issues-only-access-permissions. Not a
very attractive work-around...

Nick


Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-30 Thread Patrick Wendell
Thanks to Nick Chammas and Cheng Lian who pointed out two issues with
the release candidate. I'll cancel this in favor of RC3.

On Fri, Aug 29, 2014 at 1:33 PM, Jeremy Freeman
freeman.jer...@gmail.com wrote:
 +1. Validated several custom analysis pipelines on a private cluster in
 standalone mode. Tested new PySpark support for arbitrary Hadoop input
 formats, works great!

 -- Jeremy



 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8143.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.1.0 (RC3)

2014-08-30 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.1.0!

The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1030/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/

Please vote on releasing this package as Apache Spark 1.1.0!

The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== Regressions fixed since RC1 ==
- Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234
- EC2 script version bump to 1.1.0.

== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.0.X will not block
this release.

== What default changes should I be aware of? ==
1. The default value of spark.io.compression.codec is now snappy
-- Old behavior can be restored by switching to lzf

2. PySpark now performs external spilling during aggregations.
-- Old behavior can be restored by setting spark.shuffle.spill to false.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[SPARK-3324] make yarn module as a unified maven jar project

2014-08-30 Thread Yi Tian
Hi everyone!

I found the YARN module has nonstandard path structure like:

${SPARK_HOME}
  |--yarn
 |--alpha (contains yarn api support for 0.23 and 2.0.x)
 |--stable (contains yarn api support for 2.2 and later)
 | |--pom.xml (spark-yarn)
 |--common (Common codes not depending on specific version of Hadoop)
 |--pom.xml (yarn-parent)

When we use maven to compile yarn module, maven will import 'alpha' or 'stable' 
module according to profile setting.
And the submodule like 'stable' use the build propertie defined in yarn/pom.xml 
to import common codes to sourcePath.
It will cause IntelliJ can't directly recognize sources in common directory as 
sourcePath.

I thought we should change the yarn module to a unified maven jar project, 
and add specify different version of yarn api via maven profile setting.

I created a JIRA ticket: https://issues.apache.org/jira/browse/SPARK-3324

Any advice will be appreciated .






Fwd: Partitioning strategy changed in Spark 1.0.x?

2014-08-30 Thread Reynold Xin
Sending the response back to the dev list so this is indexable and
searchable by others.

-- Forwarded message --
From: Milos Nikolic milos.nikoli...@gmail.com
Date: Sat, Aug 30, 2014 at 5:50 PM
Subject: Re: Partitioning strategy changed in Spark 1.0.x?
To: Reynold Xin r...@databricks.com


Thank you, your insights were very helpful, and we managed to find a
solution that works for us.

Best,
Milos


On Aug 27, 2014, at 11:20 PM, Reynold Xin r...@databricks.com wrote:

I don't think you can ever expect the mapping from data to physical nodes
in Spark, even in Spark 0.9. That is because the scheduler needs to be
fault-tolerant. What if the node is busy or the node is down?

What happens is the partitioning of data is deterministic, i.e. certain
data is always hashed into certain partitions (given the same partition
count). And if you don't run foreach twice, but instead simply zip the two
RDDs that are both hash partitioned using the same partitioner, then the
scheduler will not create extra stages.

e.g.

// Let's say I have 10 nodes
val partitioner = new HashPartitioner(10)

// Create RDD
val rdd = sc.parallelize(0 until 10).map(k = (k, computeValue(k)))

// Partition twice using the same partitioner
val p1 = rdd.partitionBy(partitioner)
val p2 = rdd.partitionBy(partitioner)
p1.zip(p2)   --- this should work




On Wed, Aug 27, 2014 at 1:50 PM, Milos Nikolic milos.nikoli...@gmail.com
wrote:

 Sure.

 Suppose we have two SQL relations, expressed as two RDDs, and we want to
 do a hash join between them. First, we would partition each RDD on the join
 key — that will collocate partitions with the same join key on one node.
 Then, I would zip corresponding partitions from two relations and do a
 local join on each node.

 This approach makes sense only if Spark always places key X on node Y for
 both RDDs, which is not true now. And I have no idea how to circumvent this
 issue with the recent changes in hashing you mentioned.

 Milos

 On Aug 27, 2014, at 10:05 PM, Reynold Xin r...@databricks.com wrote:

 Can you elaborate your problem?

 I am not sure if I understand what you mean by on one node, I get two
 different sets of keys


 On Tue, Aug 26, 2014 at 2:16 AM, Milos Nikolic milos.nikoli...@gmail.com
 wrote:

 Hi Reynold,

 The problem still exists even with more elements. On one node, I get two
 different
 sets of keys -- I want these local sets to be the same to be able to zip
 local partitions
 together later on (rather than RDD.join them, which involves shuffling).

 With this recent change in hashing, RDD.zip seems not to be useful
 anymore
 as I cannot guarantee anymore that local partitions from two RDDs will
 share
 the same set of keys on one node.

 Do you have any ideas on how to resolve this problem?

 Thanks,
 Milos



 On Aug 26, 2014, at 10:04 AM, Reynold Xin r...@databricks.com wrote:

 It is better to use a larger number of elements rather than just 10 for
 this test.

 Can you try larger? Like 1000 or 1?

 IIRC, the hash function changed to murmur hash:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala#L205






 On Tue, Aug 26, 2014 at 1:01 AM, Milos Nikolic milos.nikoli...@gmail.com
  wrote:

 Hi guys,

 I’ve noticed some changes in the behavior of partitioning under Spark
 1.0.x.
 I’d appreciate if someone could explain what has changed in the meantime.

 Here is a small example. I want to create two RDD[(K, V)] objects and
 then
 collocate partitions with the same K on one node. When the same
 partitioner
 for two RDDs is used, partitions with the same K end up being on
 different nodes.

 // Let's say I have 10 nodes
 val partitioner = new HashPartitioner(10)

 // Create RDD
 val rdd = sc.parallelize(0 until 10).map(k = (k, computeValue(k)))

 // Partition twice using the same partitioner
 rdd.partitionBy(partitioner).foreach { case (k, v) =
 println(Dummy1 - k =  + k) }
 rdd.partitionBy(partitioner).foreach { case (k, v) =
 println(Dummy2 - k =  + k) }

 The output on one node is:
 Dummy1 - k = 2
 Dummy2 - k = 7

 I was expecting to see the same keys on each node. That was happening
 under Spark 0.9.2, but not under Spark 1.0.x.

 Anyone has an idea what has changed in the meantime? Or how to get
 corresponding partitions on one node?

 Thanks in advance,
 Milos
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org