spark git commit: [SPARK-8406] [SQL] Adding UUID to output file name to avoid accidental overwriting

2015-06-22 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/master 47c1d5629 -> 0818fdec3


[SPARK-8406] [SQL] Adding UUID to output file name to avoid accidental 
overwriting

This PR fixes a Parquet output file name collision bug which may cause data 
loss.  Changes made:

1.  Identify each write job issued by `InsertIntoHadoopFsRelation` with a UUID

All concrete data sources which extend `HadoopFsRelation` (Parquet and ORC 
for now) must use this UUID to generate task output file path to avoid name 
collision.

2.  Make `TestHive` use a local mode `SparkContext` with 32 threads to increase 
parallelism

The major reason for this is that, the original parallelism of 2 is too low 
to reproduce the data loss issue.  Also, higher concurrency may potentially 
caught more concurrency bugs during testing phase. (It did help us spotted 
SPARK-8501.)

3. `OrcSourceSuite` was updated to workaround SPARK-8501, which we detected 
along the way.

NOTE: This PR is made a little bit more complicated than expected because we 
hit two other bugs on the way and have to work them around. See [SPARK-8501] 
[1] and [SPARK-8513] [2].

[1]: https://github.com/liancheng/spark/tree/spark-8501
[2]: https://github.com/liancheng/spark/tree/spark-8513



Some background and a summary of offline discussion with yhuai about this issue 
for better understanding:

In 1.4.0, we added `HadoopFsRelation` to abstract partition support of all data 
sources that are based on Hadoop `FileSystem` interface.  Specifically, this 
makes partition discovery, partition pruning, and writing dynamic partitions 
for data sources much easier.

To support appending, the Parquet data source tries to find out the max part 
number of part-files in the destination directory (i.e., `` in output file 
name `part-r-.gz.parquet`) at the beginning of the write job.  In 1.3.0, 
this step happens on driver side before any files are written.  However, in 
1.4.0, this is moved to task side.  Unfortunately, for tasks scheduled later, 
they may see wrong max part number generated of files newly written by other 
finished tasks within the same job.  This actually causes a race condition.  In 
most cases, this only causes nonconsecutive part numbers in output file names.  
But when the DataFrame contains thousands of RDD partitions, it's likely that 
two tasks may choose the same part number, then one of them gets overwritten by 
the other.

Before `HadoopFsRelation`, Spark SQL already supports appending data to Hive 
tables.  From a user's perspective, these two look similar.  However, they 
differ a lot internally.  When data are inserted into Hive tables via Spark 
SQL, `InsertIntoHiveTable` simulates Hive's behaviors:

1.  Write data to a temporary location

2.  Move data in the temporary location to the final destination location using

-   `Hive.loadTable()` for non-partitioned table
-   `Hive.loadPartition()` for static partitions
-   `Hive.loadDynamicPartitions()` for dynamic partitions

The important part is that, `Hive.copyFiles()` is invoked in step 2 to move the 
data to the destination directory (I found the name is kinda confusing since no 
"copying" occurs here, we are just moving and renaming stuff).  If a file in 
the source directory and another file in the destination directory happen to 
have the same name, say `part-r-1.parquet`, the former is moved to the 
destination directory and renamed with a `_copy_N` postfix 
(`part-r-1_copy_1.parquet`).  That's how Hive handles appending and avoids 
name collision between different write jobs.

Some alternatives fixes considered for this issue:

1.  Use a similar approach as Hive

This approach is not preferred in Spark 1.4.0 mainly because file metadata 
operations in S3 tend to be slow, especially for tables with lots of file 
and/or partitions.  That's why `InsertIntoHadoopFsRelation` just inserts to 
destination directory directly, and is often used together with 
`DirectParquetOutputCommitter` to reduce latency when working with S3.  This 
means, we don't have the chance to do renaming, and must avoid name collision 
from the very beginning.

2.  Same as 1.3, just move max part number detection back to driver side

This isn't doable because unlike 1.3, 1.4 also takes dynamic partitioning 
into account.  When inserting into dynamic partitions, we don't know which 
partition directories will be touched on driver side before issuing the write 
job.  Checking all partition directories is simply too expensive for tables 
with thousands of partitions.

3.  Add extra component to output file names to avoid name collision

This seems to be the only reasonable solution for now.  To be more 
specific, we need a JOB level unique identifier to identify all write jobs 
issued by `InsertIntoHadoopFile`.  Notice that TASK level unique identifiers 
can NOT be used.  Because in this way a speculative task will write to a 
different output file from the original task.  If both tasks succ

spark git commit: [SPARK-8406] [SQL] Backports SPARK-8406 and PR #6864 to branch-1.4

2015-06-22 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/branch-1.4 b836bac3f -> 451c8722a


[SPARK-8406] [SQL] Backports SPARK-8406 and PR #6864 to branch-1.4

Author: Cheng Lian 

Closes #6932 from liancheng/spark-8406-for-1.4 and squashes the following 
commits:

a0168fe [Cheng Lian] Backports SPARK-8406 and PR #6864 to branch-1.4


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/451c8722
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/451c8722
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/451c8722

Branch: refs/heads/branch-1.4
Commit: 451c8722afea83e8e8f11c438469eea10e5acf4c
Parents: b836bac
Author: Cheng Lian 
Authored: Mon Jun 22 10:04:29 2015 -0700
Committer: Yin Huai 
Committed: Mon Jun 22 10:04:29 2015 -0700

--
 .../apache/spark/sql/parquet/newParquet.scala   | 41 ++
 .../org/apache/spark/sql/sources/commands.scala | 59 
 .../spark/sql/hive/orc/OrcFileOperator.scala|  7 ++-
 .../apache/spark/sql/hive/orc/OrcRelation.scala |  3 +-
 .../apache/spark/sql/hive/test/TestHive.scala   |  2 +-
 .../spark/sql/hive/orc/OrcSourceSuite.scala | 22 
 .../spark/sql/sources/SimpleTextRelation.scala  |  4 +-
 .../sql/sources/hadoopFsRelationSuites.scala| 39 +++--
 8 files changed, 110 insertions(+), 67 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/451c8722/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
index 3328e6f..abf9614 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
@@ -61,50 +61,21 @@ private[sql] class ParquetOutputWriter(path: String, 
context: TaskAttemptContext
   extends OutputWriter {
 
   private val recordWriter: RecordWriter[Void, Row] = {
-val conf = context.getConfiguration
 val outputFormat = {
-  // When appending new Parquet files to an existing Parquet file 
directory, to avoid
-  // overwriting existing data files, we need to find out the max task ID 
encoded in these data
-  // file names.
-  // TODO Make this snippet a utility function for other data source 
developers
-  val maxExistingTaskId = {
-// Note that `path` may point to a temporary location.  Here we 
retrieve the real
-// destination path from the configuration
-val outputPath = new Path(conf.get("spark.sql.sources.output.path"))
-val fs = outputPath.getFileSystem(conf)
-
-if (fs.exists(outputPath)) {
-  // Pattern used to match task ID in part file names, e.g.:
-  //
-  //   part-r-1.gz.parquet
-  //  ^
-  val partFilePattern = """part-.-(\d{1,}).*""".r
-
-  fs.listStatus(outputPath).map(_.getPath.getName).map {
-case partFilePattern(id) => id.toInt
-case name if name.startsWith("_") => 0
-case name if name.startsWith(".") => 0
-case name => throw new AnalysisException(
-  s"Trying to write Parquet files to directory $outputPath, " +
-s"but found items with illegal name '$name'.")
-  }.reduceOption(_ max _).getOrElse(0)
-} else {
-  0
-}
-  }
-
   new ParquetOutputFormat[Row]() {
 // Here we override `getDefaultWorkFile` for two reasons:
 //
-//  1. To allow appending.  We need to generate output file name based 
on the max available
-// task ID computed above.
+//  1. To allow appending.  We need to generate unique output file 
names to avoid
+// overwriting existing files (either exist before the write job, 
or are just written
+// by other tasks within the same write job).
 //
 //  2. To allow dynamic partitioning.  Default `getDefaultWorkFile` 
uses
 // `FileOutputCommitter.getWorkPath()`, which points to the base 
directory of all
 // partitions in the case of dynamic partitioning.
 override def getDefaultWorkFile(context: TaskAttemptContext, 
extension: String): Path = {
-  val split = context.getTaskAttemptID.getTaskID.getId + 
maxExistingTaskId + 1
-  new Path(path, f"part-r-$split%05d$extension")
+  val uniqueWriteJobId = 
context.getConfiguration.get("spark.sql.sources.writeJobUUID")
+  val split = context.getTaskAttemptID.getTaskID.getId
+  new Path(path, f"part-r-$split%05d-$uniqueWriteJobId$extension")
 }
   }
 }

http://git-wip-us.apache.org/repos/asf

spark git commit: [SPARK-8420] [SQL] Fix comparision of timestamps/dates with strings (branch-1.4)

2015-06-22 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/branch-1.4 451c8722a -> 65981619b


[SPARK-8420] [SQL] Fix comparision of timestamps/dates with strings (branch-1.4)

This is branch 1.4 backport of https://github.com/apache/spark/pull/6888.

Below is the original description.

In earlier versions of Spark SQL we casted `TimestampType` and `DataType` to 
`StringType` when it was involved in a binary comparison with a `StringType`.  
This allowed comparing a timestamp with a partial date as a user would expect.
 - `time > "2014-06-10"`
 - `time > "2014"`

In 1.4.0 we tried to cast the String instead into a Timestamp.  However, since 
partial dates are not a valid complete timestamp this results in `null` which 
results in the tuple being filtered.

This PR restores the earlier behavior.  Note that we still special case 
equality so that these comparisons are not affected by not printing zeros for 
subsecond precision.

Author: Michael Armbrust 

Closes #6888 from marmbrus/timeCompareString and squashes the following commits:

bdef29c [Michael Armbrust] test partial date
1f09adf [Michael Armbrust] special handling of equality
1172c60 [Michael Armbrust] more test fixing
4dfc412 [Michael Armbrust] fix tests
aaa9508 [Michael Armbrust] newline
04d908f [Michael Armbrust] [SPARK-8420][SQL] Fix comparision of 
timestamps/dates with strings

Conflicts:

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

Author: Michael Armbrust 

Closes #6914 from yhuai/timeCompareString-1.4 and squashes the following 
commits:

9882915 [Michael Armbrust] [SPARK-8420] [SQL] Fix comparision of 
timestamps/dates with strings


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/65981619
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/65981619
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/65981619

Branch: refs/heads/branch-1.4
Commit: 65981619b26da03f0c5133133e318a180235e96d
Parents: 451c872
Author: Michael Armbrust 
Authored: Mon Jun 22 10:45:33 2015 -0700
Committer: Yin Huai 
Committed: Mon Jun 22 10:45:33 2015 -0700

--
 .../catalyst/analysis/HiveTypeCoercion.scala| 17 --
 .../sql/catalyst/expressions/predicates.scala   |  9 
 .../apache/spark/sql/DataFrameDateSuite.scala   | 56 
 .../org/apache/spark/sql/SQLQuerySuite.scala|  4 ++
 .../scala/org/apache/spark/sql/TestData.scala   |  6 ---
 .../columnar/InMemoryColumnarQuerySuite.scala   |  7 ++-
 6 files changed, 88 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/65981619/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
index fa7968e..6d0f4a0 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
@@ -242,7 +242,16 @@ trait HiveTypeCoercion {
   case a: BinaryArithmetic if a.right.dataType == StringType =>
 a.makeCopy(Array(a.left, Cast(a.right, DoubleType)))
 
-  // we should cast all timestamp/date/string compare into string compare
+  // For equality between string and timestamp we cast the string to a 
timestamp
+  // so that things like rounding of subsecond precision does not affect 
the comparison.
+  case p @ Equality(left @ StringType(), right @ TimestampType()) =>
+p.makeCopy(Array(Cast(left, TimestampType), right))
+  case p @ Equality(left @ TimestampType(), right @ StringType()) =>
+p.makeCopy(Array(left, Cast(right, TimestampType)))
+
+  // We should cast all relative timestamp/date/string comparison into 
string comparisions
+  // This behaves as a user would expect because timestamp strings sort 
lexicographically.
+  // i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true
   case p: BinaryComparison if p.left.dataType == StringType &&
   p.right.dataType == DateType =>
 p.makeCopy(Array(p.left, Cast(p.right, StringType)))
@@ -251,10 +260,12 @@ trait HiveTypeCoercion {
 p.makeCopy(Array(Cast(p.left, StringType), p.right))
   case p: BinaryComparison if p.left.dataType == StringType &&
   p.right.dataType == TimestampType =>
-p.makeCopy(Array(Cast(p.left, TimestampType), p.right))
+p.makeCopy(Array(p.left, Cast(p.right, StringType

spark git commit: [SPARK-8429] [EC2] Add ability to set additional tags

2015-06-22 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/master 0818fdec3 -> 42a1f716f


[SPARK-8429] [EC2] Add ability to set additional tags

Add the `--additional-tags` parameter that allows to set additional tags to all 
the created instances (masters and slaves).

The user can specify multiple tags by separating them with a comma (`,`), while 
each tag name and value should be separated by a colon (`:`); for example, 
`Task:MySparkProject,Env:production` would add two tags, `Task` and `Env`, with 
the given values.

Author: Stefano Parmesan 

Closes #6857 from armisael/patch-1 and squashes the following commits:

c5ac92c [Stefano Parmesan] python style (pep8)
8e614f1 [Stefano Parmesan] Set multiple tags in a single request
bfc56af [Stefano Parmesan] Address SPARK-7900 by inceasing sleep time
daf8615 [Stefano Parmesan] Add ability to set additional tags


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/42a1f716
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/42a1f716
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/42a1f716

Branch: refs/heads/master
Commit: 42a1f716fa35533507784be5e9117a984a03e62d
Parents: 0818fde
Author: Stefano Parmesan 
Authored: Mon Jun 22 11:43:10 2015 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 22 11:43:10 2015 -0700

--
 ec2/spark_ec2.py | 28 
 1 file changed, 20 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/42a1f716/ec2/spark_ec2.py
--
diff --git a/ec2/spark_ec2.py b/ec2/spark_ec2.py
index 5608749..1037356 100755
--- a/ec2/spark_ec2.py
+++ b/ec2/spark_ec2.py
@@ -290,6 +290,10 @@ def parse_args():
 "--additional-security-group", type="string", default="",
 help="Additional security group to place the machines in")
 parser.add_option(
+"--additional-tags", type="string", default="",
+help="Additional tags to set on the machines; tags are 
comma-separated, while name and " +
+ "value are colon separated; ex: 
\"Task:MySparkProject,Env:production\"")
+parser.add_option(
 "--copy-aws-credentials", action="store_true", default=False,
 help="Add AWS credentials to hadoop configuration to allow Spark to 
access S3")
 parser.add_option(
@@ -684,16 +688,24 @@ def launch_cluster(conn, opts, cluster_name):
 
 # This wait time corresponds to SPARK-4983
 print("Waiting for AWS to propagate instance metadata...")
-time.sleep(5)
-# Give the instances descriptive names
+time.sleep(15)
+
+# Give the instances descriptive names and set additional tags
+additional_tags = {}
+if opts.additional_tags.strip():
+additional_tags = dict(
+map(str.strip, tag.split(':', 1)) for tag in 
opts.additional_tags.split(',')
+)
+
 for master in master_nodes:
-master.add_tag(
-key='Name',
-value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id))
+master.add_tags(
+dict(additional_tags, 
Name='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id))
+)
+
 for slave in slave_nodes:
-slave.add_tag(
-key='Name',
-value='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id))
+slave.add_tags(
+dict(additional_tags, 
Name='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id))
+)
 
 # Return all the instances
 return (master_nodes, slave_nodes)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-8482] Added M4 instances to the list.

2015-06-22 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/master 42a1f716f -> ba8a4537f


[SPARK-8482] Added M4 instances to the list.

AWS recently added M4 instances 
(https://aws.amazon.com/blogs/aws/the-new-m4-instance-type-bonus-price-reduction-on-m3-c4/).

Author: Pradeep Chhetri 

Closes #6899 from pradeepchhetri/master and squashes the following commits:

4f4ea79 [Pradeep Chhetri] Added t2.large instance
3d2bb6c [Pradeep Chhetri] Added M4 instances to the list


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ba8a4537
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ba8a4537
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ba8a4537

Branch: refs/heads/master
Commit: ba8a4537fee7d85f968cccf8d1c607731daae307
Parents: 42a1f71
Author: Pradeep Chhetri 
Authored: Mon Jun 22 11:45:31 2015 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 22 11:45:31 2015 -0700

--
 ec2/spark_ec2.py | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ba8a4537/ec2/spark_ec2.py
--
diff --git a/ec2/spark_ec2.py b/ec2/spark_ec2.py
index 1037356..63e2c79 100755
--- a/ec2/spark_ec2.py
+++ b/ec2/spark_ec2.py
@@ -362,7 +362,7 @@ def get_validate_spark_version(version, repo):
 
 
 # Source: http://aws.amazon.com/amazon-linux-ami/instance-type-matrix/
-# Last Updated: 2015-05-08
+# Last Updated: 2015-06-19
 # For easy maintainability, please keep this manually-inputted dictionary 
sorted by key.
 EC2_INSTANCE_TYPES = {
 "c1.medium":   "pvm",
@@ -404,6 +404,11 @@ EC2_INSTANCE_TYPES = {
 "m3.large":"hvm",
 "m3.xlarge":   "hvm",
 "m3.2xlarge":  "hvm",
+"m4.large":"hvm",
+"m4.xlarge":   "hvm",
+"m4.2xlarge":  "hvm",
+"m4.4xlarge":  "hvm",
+"m4.10xlarge": "hvm",
 "r3.large":"hvm",
 "r3.xlarge":   "hvm",
 "r3.2xlarge":  "hvm",
@@ -413,6 +418,7 @@ EC2_INSTANCE_TYPES = {
 "t2.micro":"hvm",
 "t2.small":"hvm",
 "t2.medium":   "hvm",
+"t2.large":"hvm",
 }
 
 
@@ -923,7 +929,7 @@ def wait_for_cluster_state(conn, opts, cluster_instances, 
cluster_state):
 # Get number of local disks available for a given EC2 instance type.
 def get_num_disks(instance_type):
 # Source: 
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html
-# Last Updated: 2015-05-08
+# Last Updated: 2015-06-19
 # For easy maintainability, please keep this manually-inputted dictionary 
sorted by key.
 disks_by_instance = {
 "c1.medium":   1,
@@ -965,6 +971,11 @@ def get_num_disks(instance_type):
 "m3.large":1,
 "m3.xlarge":   2,
 "m3.2xlarge":  2,
+"m4.large":0,
+"m4.xlarge":   0,
+"m4.2xlarge":  0,
+"m4.4xlarge":  0,
+"m4.10xlarge": 0,
 "r3.large":1,
 "r3.xlarge":   1,
 "r3.2xlarge":  1,
@@ -974,6 +985,7 @@ def get_num_disks(instance_type):
 "t2.micro":0,
 "t2.small":0,
 "t2.medium":   0,
+"t2.large":0,
 }
 if instance_type in disks_by_instance:
 return disks_by_instance[instance_type]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-8511] [PYSPARK] Modify a test to remove a saved model in `regression.py`

2015-06-22 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master ba8a4537f -> 5d89d9f00


[SPARK-8511] [PYSPARK] Modify a test to remove a saved model in `regression.py`

[[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF 
JIRA](https://issues.apache.org/jira/browse/SPARK-8511)

Author: Yu ISHIKAWA 

Closes #6926 from yu-iskw/SPARK-8511 and squashes the following commits:

7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving 
model testings, instead of `os.removedirs()`
4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved 
model in `regression.py`


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5d89d9f0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5d89d9f0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5d89d9f0

Branch: refs/heads/master
Commit: 5d89d9f00ba4d6d0767a4c4964d3af324bf6f14b
Parents: ba8a453
Author: Yu ISHIKAWA 
Authored: Mon Jun 22 11:53:11 2015 -0700
Committer: Joseph K. Bradley 
Committed: Mon Jun 22 11:53:11 2015 -0700

--
 python/pyspark/mllib/classification.py |  9 ++---
 python/pyspark/mllib/clustering.py |  3 ++-
 python/pyspark/mllib/recommendation.py |  3 ++-
 python/pyspark/mllib/regression.py | 14 +-
 python/pyspark/mllib/tests.py  |  3 ++-
 5 files changed, 21 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5d89d9f0/python/pyspark/mllib/classification.py
--
diff --git a/python/pyspark/mllib/classification.py 
b/python/pyspark/mllib/classification.py
index 42e4139..758accf 100644
--- a/python/pyspark/mllib/classification.py
+++ b/python/pyspark/mllib/classification.py
@@ -135,8 +135,9 @@ class LogisticRegressionModel(LinearClassificationModel):
 1
 >>> sameModel.predict(SparseVector(2, {0: 1.0}))
 0
+>>> from shutil import rmtree
 >>> try:
-...os.removedirs(path)
+...rmtree(path)
 ... except:
 ...pass
 >>> multi_class_data = [
@@ -387,8 +388,9 @@ class SVMModel(LinearClassificationModel):
 1
 >>> sameModel.predict(SparseVector(2, {0: -1.0}))
 0
+>>> from shutil import rmtree
 >>> try:
-...os.removedirs(path)
+...rmtree(path)
 ... except:
 ...pass
 """
@@ -515,8 +517,9 @@ class NaiveBayesModel(Saveable, Loader):
 >>> sameModel = NaiveBayesModel.load(sc, path)
 >>> sameModel.predict(SparseVector(2, {0: 1.0})) == 
model.predict(SparseVector(2, {0: 1.0}))
 True
+>>> from shutil import rmtree
 >>> try:
-... os.removedirs(path)
+... rmtree(path)
 ... except OSError:
 ... pass
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/5d89d9f0/python/pyspark/mllib/clustering.py
--
diff --git a/python/pyspark/mllib/clustering.py 
b/python/pyspark/mllib/clustering.py
index c382298..e6ef729 100644
--- a/python/pyspark/mllib/clustering.py
+++ b/python/pyspark/mllib/clustering.py
@@ -79,8 +79,9 @@ class KMeansModel(Saveable, Loader):
 >>> sameModel = KMeansModel.load(sc, path)
 >>> sameModel.predict(sparse_data[0]) == model.predict(sparse_data[0])
 True
+>>> from shutil import rmtree
 >>> try:
-... os.removedirs(path)
+... rmtree(path)
 ... except OSError:
 ... pass
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/5d89d9f0/python/pyspark/mllib/recommendation.py
--
diff --git a/python/pyspark/mllib/recommendation.py 
b/python/pyspark/mllib/recommendation.py
index 9c4647d..506ca21 100644
--- a/python/pyspark/mllib/recommendation.py
+++ b/python/pyspark/mllib/recommendation.py
@@ -106,8 +106,9 @@ class MatrixFactorizationModel(JavaModelWrapper, 
JavaSaveable, JavaLoader):
 0.4...
 >>> sameModel.predictAll(testset).collect()
 [Rating(...
+>>> from shutil import rmtree
 >>> try:
-... os.removedirs(path)
+... rmtree(path)
 ... except OSError:
 ... pass
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/5d89d9f0/python/pyspark/mllib/regression.py
--
diff --git a/python/pyspark/mllib/regression.py 
b/python/pyspark/mllib/regression.py
index 0c4d7d3..5ddbbee 100644
--- a/python/pyspark/mllib/regression.py
+++ b/python/pyspark/mllib/regression.py
@@ -133,10 +133,11 @@ class LinearRegressionModel(LinearRegressionModelBase):
 True
 >>> abs(sameModel.predict(SparseVector(1, {0: 1.0})) - 1) < 0.5
 True
+>>> from shutil import rmtree
 >>> try:
-...os.removedirs(path)
+... rmtree(path)

spark git commit: [SPARK-8511] [PYSPARK] Modify a test to remove a saved model in `regression.py`

2015-06-22 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/branch-1.4 65981619b -> 507381d39


[SPARK-8511] [PYSPARK] Modify a test to remove a saved model in `regression.py`

[[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF 
JIRA](https://issues.apache.org/jira/browse/SPARK-8511)

Author: Yu ISHIKAWA 

Closes #6926 from yu-iskw/SPARK-8511 and squashes the following commits:

7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving 
model testings, instead of `os.removedirs()`
4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved 
model in `regression.py`

(cherry picked from commit 5d89d9f00ba4d6d0767a4c4964d3af324bf6f14b)
Signed-off-by: Joseph K. Bradley 

Conflicts:
python/pyspark/mllib/tests.py


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/507381d3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/507381d3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/507381d3

Branch: refs/heads/branch-1.4
Commit: 507381d393cdfee697e8b67f9d1f8a048f383c05
Parents: 6598161
Author: Yu ISHIKAWA 
Authored: Mon Jun 22 11:53:11 2015 -0700
Committer: Joseph K. Bradley 
Committed: Mon Jun 22 11:59:53 2015 -0700

--
 python/pyspark/mllib/classification.py |  9 ++---
 python/pyspark/mllib/clustering.py |  3 ++-
 python/pyspark/mllib/recommendation.py |  3 ++-
 python/pyspark/mllib/regression.py | 14 +-
 python/pyspark/mllib/tests.py  |  3 ++-
 5 files changed, 21 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/507381d3/python/pyspark/mllib/classification.py
--
diff --git a/python/pyspark/mllib/classification.py 
b/python/pyspark/mllib/classification.py
index 42e4139..758accf 100644
--- a/python/pyspark/mllib/classification.py
+++ b/python/pyspark/mllib/classification.py
@@ -135,8 +135,9 @@ class LogisticRegressionModel(LinearClassificationModel):
 1
 >>> sameModel.predict(SparseVector(2, {0: 1.0}))
 0
+>>> from shutil import rmtree
 >>> try:
-...os.removedirs(path)
+...rmtree(path)
 ... except:
 ...pass
 >>> multi_class_data = [
@@ -387,8 +388,9 @@ class SVMModel(LinearClassificationModel):
 1
 >>> sameModel.predict(SparseVector(2, {0: -1.0}))
 0
+>>> from shutil import rmtree
 >>> try:
-...os.removedirs(path)
+...rmtree(path)
 ... except:
 ...pass
 """
@@ -515,8 +517,9 @@ class NaiveBayesModel(Saveable, Loader):
 >>> sameModel = NaiveBayesModel.load(sc, path)
 >>> sameModel.predict(SparseVector(2, {0: 1.0})) == 
model.predict(SparseVector(2, {0: 1.0}))
 True
+>>> from shutil import rmtree
 >>> try:
-... os.removedirs(path)
+... rmtree(path)
 ... except OSError:
 ... pass
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/507381d3/python/pyspark/mllib/clustering.py
--
diff --git a/python/pyspark/mllib/clustering.py 
b/python/pyspark/mllib/clustering.py
index b55583f..15e4527 100644
--- a/python/pyspark/mllib/clustering.py
+++ b/python/pyspark/mllib/clustering.py
@@ -75,8 +75,9 @@ class KMeansModel(Saveable, Loader):
 >>> sameModel = KMeansModel.load(sc, path)
 >>> sameModel.predict(sparse_data[0]) == model.predict(sparse_data[0])
 True
+>>> from shutil import rmtree
 >>> try:
-... os.removedirs(path)
+... rmtree(path)
 ... except OSError:
 ... pass
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/507381d3/python/pyspark/mllib/recommendation.py
--
diff --git a/python/pyspark/mllib/recommendation.py 
b/python/pyspark/mllib/recommendation.py
index 9c4647d..506ca21 100644
--- a/python/pyspark/mllib/recommendation.py
+++ b/python/pyspark/mllib/recommendation.py
@@ -106,8 +106,9 @@ class MatrixFactorizationModel(JavaModelWrapper, 
JavaSaveable, JavaLoader):
 0.4...
 >>> sameModel.predictAll(testset).collect()
 [Rating(...
+>>> from shutil import rmtree
 >>> try:
-... os.removedirs(path)
+... rmtree(path)
 ... except OSError:
 ... pass
 """

http://git-wip-us.apache.org/repos/asf/spark/blob/507381d3/python/pyspark/mllib/regression.py
--
diff --git a/python/pyspark/mllib/regression.py 
b/python/pyspark/mllib/regression.py
index 0c4d7d3..5ddbbee 100644
--- a/python/pyspark/mllib/regression.py
+++ b/python/pyspark/mllib/regression.py
@@ -133,10 +133,11 @@ class LinearRegressionModel(LinearRegressionModelBase):
 True
 >>> abs(sameModel.p

spark git commit: [SPARK-8104] [SQL] auto alias expressions in analyzer

2015-06-22 Thread marmbrus
Repository: spark
Updated Branches:
  refs/heads/master 5d89d9f00 -> da7bbb943


[SPARK-8104] [SQL] auto alias expressions in analyzer

Currently we auto alias expression in parser. However, during parser phase we 
don't have enough information to do the right alias. For example, Generator 
that has more than 1 kind of element need MultiAlias, ExtractValue don't need 
Alias if it's in middle of a ExtractValue chain.

Author: Wenchen Fan 

Closes #6647 from cloud-fan/alias and squashes the following commits:

552eba4 [Wenchen Fan] fix python
5b5786d [Wenchen Fan] fix agg
73a90cb [Wenchen Fan] fix case-preserve of ExtractValue
4cfd23c [Wenchen Fan] fix order by
d18f401 [Wenchen Fan] refine
9f07359 [Wenchen Fan] address comments
39c1aef [Wenchen Fan] small fix
33640ec [Wenchen Fan] auto alias expressions in analyzer


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/da7bbb94
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/da7bbb94
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/da7bbb94

Branch: refs/heads/master
Commit: da7bbb9435dae9a3bedad578599d96ea858f349e
Parents: 5d89d9f
Author: Wenchen Fan 
Authored: Mon Jun 22 12:13:00 2015 -0700
Committer: Michael Armbrust 
Committed: Mon Jun 22 12:13:00 2015 -0700

--
 python/pyspark/sql/context.py   |  9 ++-
 .../apache/spark/sql/catalyst/SqlParser.scala   | 11 +--
 .../spark/sql/catalyst/analysis/Analyzer.scala  | 77 
 .../sql/catalyst/analysis/CheckAnalysis.scala   |  9 +--
 .../sql/catalyst/analysis/unresolved.scala  | 20 -
 .../sql/catalyst/expressions/ExtractValue.scala | 36 +
 .../spark/sql/catalyst/planning/patterns.scala  |  6 +-
 .../catalyst/plans/logical/LogicalPlan.scala| 11 ++-
 .../catalyst/plans/logical/basicOperators.scala | 20 -
 .../scala/org/apache/spark/sql/Column.scala |  1 -
 .../scala/org/apache/spark/sql/DataFrame.scala  |  6 +-
 .../org/apache/spark/sql/GroupedData.scala  | 43 +--
 .../apache/spark/sql/execution/pythonUdfs.scala |  2 +-
 .../org/apache/spark/sql/SQLQuerySuite.scala|  6 +-
 .../scala/org/apache/spark/sql/TestData.scala   |  1 -
 .../org/apache/spark/sql/hive/HiveQl.scala  |  9 +--
 16 files changed, 150 insertions(+), 117 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/da7bbb94/python/pyspark/sql/context.py
--
diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py
index 599c9ac..dc23922 100644
--- a/python/pyspark/sql/context.py
+++ b/python/pyspark/sql/context.py
@@ -86,7 +86,8 @@ class SQLContext(object):
 >>> df.registerTempTable("allTypes")
 >>> sqlContext.sql('select i+1, d+1, not b, list[1], dict["s"], time, 
row.a '
 ...'from allTypes where b and i > 0').collect()
-[Row(c0=2, c1=2.0, c2=False, c3=2, c4=0, time=datetime.datetime(2014, 
8, 1, 14, 1, 5), a=1)]
+[Row(_c0=2, _c1=2.0, _c2=False, _c3=2, _c4=0, \
+time=datetime.datetime(2014, 8, 1, 14, 1, 5), a=1)]
 >>> df.map(lambda x: (x.i, x.s, x.d, x.l, x.b, x.time, x.row.a, 
x.list)).collect()
 [(1, u'string', 1.0, 1, True, datetime.datetime(2014, 8, 1, 14, 1, 5), 
1, [1, 2, 3])]
 """
@@ -176,17 +177,17 @@ class SQLContext(object):
 
 >>> sqlContext.registerFunction("stringLengthString", lambda x: len(x))
 >>> sqlContext.sql("SELECT stringLengthString('test')").collect()
-[Row(c0=u'4')]
+[Row(_c0=u'4')]
 
 >>> from pyspark.sql.types import IntegerType
 >>> sqlContext.registerFunction("stringLengthInt", lambda x: len(x), 
IntegerType())
 >>> sqlContext.sql("SELECT stringLengthInt('test')").collect()
-[Row(c0=4)]
+[Row(_c0=4)]
 
 >>> from pyspark.sql.types import IntegerType
 >>> sqlContext.udf.register("stringLengthInt", lambda x: len(x), 
IntegerType())
 >>> sqlContext.sql("SELECT stringLengthInt('test')").collect()
-[Row(c0=4)]
+[Row(_c0=4)]
 """
 func = lambda _, it: map(lambda x: f(*x), it)
 ser = AutoBatchedSerializer(PickleSerializer())

http://git-wip-us.apache.org/repos/asf/spark/blob/da7bbb94/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala
index da3a717..79f526e 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala
@@ -99,13 +99,6 @@ class SqlParser extends AbstractSpa

spark git commit: [SPARK-8532] [SQL] In Python's DataFrameWriter, save/saveAsTable/json/parquet/jdbc always override mode

2015-06-22 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/branch-1.4 507381d39 -> 994abbaeb


[SPARK-8532] [SQL] In Python's DataFrameWriter, 
save/saveAsTable/json/parquet/jdbc always override mode

https://issues.apache.org/jira/browse/SPARK-8532

This PR has two changes. First, it fixes the bug that save actions (i.e. 
`save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds 
input argument `partitionBy` to `save/saveAsTable/parquet`.

Author: Yin Huai 

Closes #6937 from yhuai/SPARK-8532 and squashes the following commits:

f972d5d [Yin Huai] davies's comment.
d37abd2 [Yin Huai] style.
d21290a [Yin Huai] Python doc.
889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, 
and parquet.
7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode 
since JVM-side already uses "error" as the default value.
d696dff [Yin Huai] Python style.
88eb6c4 [Yin Huai] If mode is "error", do not call mode method.
c40c461 [Yin Huai] Regression test.

(cherry picked from commit 5ab9fcfb01a0ad2f6c103f67c1a785d3b49e33f0)
Signed-off-by: Yin Huai 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/994abbae
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/994abbae
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/994abbae

Branch: refs/heads/branch-1.4
Commit: 994abbaeb3c5444d09548291f865373ba4f1909f
Parents: 507381d
Author: Yin Huai 
Authored: Mon Jun 22 13:51:23 2015 -0700
Committer: Yin Huai 
Committed: Mon Jun 22 13:51:34 2015 -0700

--
 python/pyspark/sql/readwriter.py | 30 +++---
 python/pyspark/sql/tests.py  | 32 
 2 files changed, 51 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/994abbae/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index f036644..1b7bc0f 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -218,7 +218,10 @@ class DataFrameWriter(object):
 
 >>> df.write.mode('append').parquet(os.path.join(tempfile.mkdtemp(), 
'data'))
 """
-self._jwrite = self._jwrite.mode(saveMode)
+# At the JVM side, the default value of mode is already set to "error".
+# So, if the given saveMode is None, we will not call JVM-side's mode 
method.
+if saveMode is not None:
+self._jwrite = self._jwrite.mode(saveMode)
 return self
 
 @since(1.4)
@@ -253,11 +256,12 @@ class DataFrameWriter(object):
 """
 if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
 cols = cols[0]
-self._jwrite = self._jwrite.partitionBy(_to_seq(self._sqlContext._sc, 
cols))
+if len(cols) > 0:
+self._jwrite = 
self._jwrite.partitionBy(_to_seq(self._sqlContext._sc, cols))
 return self
 
 @since(1.4)
-def save(self, path=None, format=None, mode="error", **options):
+def save(self, path=None, format=None, mode=None, partitionBy=(), 
**options):
 """Saves the contents of the :class:`DataFrame` to a data source.
 
 The data source is specified by the ``format`` and a set of 
``options``.
@@ -272,11 +276,12 @@ class DataFrameWriter(object):
 * ``overwrite``: Overwrite existing data.
 * ``ignore``: Silently ignore this operation if data already 
exists.
 * ``error`` (default case): Throw an exception if data already 
exists.
+:param partitionBy: names of partitioning columns
 :param options: all other string options
 
 >>> df.write.mode('append').parquet(os.path.join(tempfile.mkdtemp(), 
'data'))
 """
-self.mode(mode).options(**options)
+self.partitionBy(partitionBy).mode(mode).options(**options)
 if format is not None:
 self.format(format)
 if path is None:
@@ -296,7 +301,7 @@ class DataFrameWriter(object):
 self._jwrite.mode("overwrite" if overwrite else 
"append").insertInto(tableName)
 
 @since(1.4)
-def saveAsTable(self, name, format=None, mode="error", **options):
+def saveAsTable(self, name, format=None, mode=None, partitionBy=(), 
**options):
 """Saves the content of the :class:`DataFrame` as the specified table.
 
 In the case the table already exists, behavior of this function 
depends on the
@@ -312,15 +317,16 @@ class DataFrameWriter(object):
 :param name: the table name
 :param format: the format used to save
 :param mode: one of `append`, `overwrite`, `error`, `ignore` (default: 
error)
+:param partitionBy: names of partitioning columns
 :param options: all other string o

spark git commit: [SPARK-8532] [SQL] In Python's DataFrameWriter, save/saveAsTable/json/parquet/jdbc always override mode

2015-06-22 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/master da7bbb943 -> 5ab9fcfb0


[SPARK-8532] [SQL] In Python's DataFrameWriter, 
save/saveAsTable/json/parquet/jdbc always override mode

https://issues.apache.org/jira/browse/SPARK-8532

This PR has two changes. First, it fixes the bug that save actions (i.e. 
`save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds 
input argument `partitionBy` to `save/saveAsTable/parquet`.

Author: Yin Huai 

Closes #6937 from yhuai/SPARK-8532 and squashes the following commits:

f972d5d [Yin Huai] davies's comment.
d37abd2 [Yin Huai] style.
d21290a [Yin Huai] Python doc.
889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, 
and parquet.
7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode 
since JVM-side already uses "error" as the default value.
d696dff [Yin Huai] Python style.
88eb6c4 [Yin Huai] If mode is "error", do not call mode method.
c40c461 [Yin Huai] Regression test.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5ab9fcfb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5ab9fcfb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5ab9fcfb

Branch: refs/heads/master
Commit: 5ab9fcfb01a0ad2f6c103f67c1a785d3b49e33f0
Parents: da7bbb9
Author: Yin Huai 
Authored: Mon Jun 22 13:51:23 2015 -0700
Committer: Yin Huai 
Committed: Mon Jun 22 13:51:23 2015 -0700

--
 python/pyspark/sql/readwriter.py | 30 +++---
 python/pyspark/sql/tests.py  | 32 
 2 files changed, 51 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5ab9fcfb/python/pyspark/sql/readwriter.py
--
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index f036644..1b7bc0f 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -218,7 +218,10 @@ class DataFrameWriter(object):
 
 >>> df.write.mode('append').parquet(os.path.join(tempfile.mkdtemp(), 
'data'))
 """
-self._jwrite = self._jwrite.mode(saveMode)
+# At the JVM side, the default value of mode is already set to "error".
+# So, if the given saveMode is None, we will not call JVM-side's mode 
method.
+if saveMode is not None:
+self._jwrite = self._jwrite.mode(saveMode)
 return self
 
 @since(1.4)
@@ -253,11 +256,12 @@ class DataFrameWriter(object):
 """
 if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
 cols = cols[0]
-self._jwrite = self._jwrite.partitionBy(_to_seq(self._sqlContext._sc, 
cols))
+if len(cols) > 0:
+self._jwrite = 
self._jwrite.partitionBy(_to_seq(self._sqlContext._sc, cols))
 return self
 
 @since(1.4)
-def save(self, path=None, format=None, mode="error", **options):
+def save(self, path=None, format=None, mode=None, partitionBy=(), 
**options):
 """Saves the contents of the :class:`DataFrame` to a data source.
 
 The data source is specified by the ``format`` and a set of 
``options``.
@@ -272,11 +276,12 @@ class DataFrameWriter(object):
 * ``overwrite``: Overwrite existing data.
 * ``ignore``: Silently ignore this operation if data already 
exists.
 * ``error`` (default case): Throw an exception if data already 
exists.
+:param partitionBy: names of partitioning columns
 :param options: all other string options
 
 >>> df.write.mode('append').parquet(os.path.join(tempfile.mkdtemp(), 
'data'))
 """
-self.mode(mode).options(**options)
+self.partitionBy(partitionBy).mode(mode).options(**options)
 if format is not None:
 self.format(format)
 if path is None:
@@ -296,7 +301,7 @@ class DataFrameWriter(object):
 self._jwrite.mode("overwrite" if overwrite else 
"append").insertInto(tableName)
 
 @since(1.4)
-def saveAsTable(self, name, format=None, mode="error", **options):
+def saveAsTable(self, name, format=None, mode=None, partitionBy=(), 
**options):
 """Saves the content of the :class:`DataFrame` as the specified table.
 
 In the case the table already exists, behavior of this function 
depends on the
@@ -312,15 +317,16 @@ class DataFrameWriter(object):
 :param name: the table name
 :param format: the format used to save
 :param mode: one of `append`, `overwrite`, `error`, `ignore` (default: 
error)
+:param partitionBy: names of partitioning columns
 :param options: all other string options
 """
-self.mode(mode).options(**options)
+self.partitionBy(partitionBy).

spark git commit: [SPARK-8455] [ML] Implement n-gram feature transformer

2015-06-22 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 5ab9fcfb0 -> afe35f051


[SPARK-8455] [ML] Implement n-gram feature transformer

Implementation of n-gram feature transformer for ML.

Author: Feynman Liang 

Closes #6887 from feynmanliang/ngram-featurizer and squashes the following 
commits:

d2c839f [Feynman Liang] Make n > input length yield empty output
9fadd36 [Feynman Liang] Add empty and corner test cases, fix names and spaces
fe93873 [Feynman Liang] Implement n-gram feature transformer


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/afe35f05
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/afe35f05
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/afe35f05

Branch: refs/heads/master
Commit: afe35f0519bc7dcb85010a7eedcff854d4fc313a
Parents: 5ab9fcf
Author: Feynman Liang 
Authored: Mon Jun 22 14:15:35 2015 -0700
Committer: Joseph K. Bradley 
Committed: Mon Jun 22 14:15:35 2015 -0700

--
 .../org/apache/spark/ml/feature/NGram.scala | 69 ++
 .../apache/spark/ml/feature/NGramSuite.scala| 94 
 2 files changed, 163 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/afe35f05/mllib/src/main/scala/org/apache/spark/ml/feature/NGram.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/NGram.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/NGram.scala
new file mode 100644
index 000..8de10eb
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/NGram.scala
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.UnaryTransformer
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.types.{ArrayType, DataType, StringType}
+
+/**
+ * :: Experimental ::
+ * A feature transformer that converts the input array of strings into an 
array of n-grams. Null
+ * values in the input array are ignored.
+ * It returns an array of n-grams where each n-gram is represented by a 
space-separated string of
+ * words.
+ *
+ * When the input is empty, an empty array is returned.
+ * When the input array length is less than n (number of elements per n-gram), 
no n-grams are
+ * returned.
+ */
+@Experimental
+class NGram(override val uid: String)
+  extends UnaryTransformer[Seq[String], Seq[String], NGram] {
+
+  def this() = this(Identifiable.randomUID("ngram"))
+
+  /**
+   * Minimum n-gram length, >= 1.
+   * Default: 2, bigram features
+   * @group param
+   */
+  val n: IntParam = new IntParam(this, "n", "number elements per n-gram (>=1)",
+ParamValidators.gtEq(1))
+
+  /** @group setParam */
+  def setN(value: Int): this.type = set(n, value)
+
+  /** @group getParam */
+  def getN: Int = $(n)
+
+  setDefault(n -> 2)
+
+  override protected def createTransformFunc: Seq[String] => Seq[String] = {
+_.iterator.sliding($(n)).withPartial(false).map(_.mkString(" ")).toSeq
+  }
+
+  override protected def validateInputType(inputType: DataType): Unit = {
+require(inputType.sameType(ArrayType(StringType)),
+  s"Input type must be ArrayType(StringType) but got $inputType.")
+  }
+
+  override protected def outputDataType: DataType = new ArrayType(StringType, 
false)
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/afe35f05/mllib/src/test/scala/org/apache/spark/ml/feature/NGramSuite.scala
--
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/NGramSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/NGramSuite.scala
new file mode 100644
index 000..ab97e3d
--- /dev/null
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/NGramSuite.scala
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with

spark git commit: [SPARK-8537] [SPARKR] Add a validation rule about the curly braces in SparkR to `.lintr`

2015-06-22 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/master afe35f051 -> b1f3a489e


[SPARK-8537] [SPARKR] Add a validation rule about the curly braces in SparkR to 
`.lintr`

[[SPARK-8537] Add a validation rule about the curly braces in SparkR to 
`.lintr` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8537)

Author: Yu ISHIKAWA 

Closes #6940 from yu-iskw/SPARK-8537 and squashes the following commits:

7eec1a0 [Yu ISHIKAWA] [SPARK-8537][SparkR] Add a validation rule about the 
curly braces in SparkR to `.lintr`


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b1f3a489
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b1f3a489
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b1f3a489

Branch: refs/heads/master
Commit: b1f3a489efc6f4f9d172344c3345b9b38ae235e0
Parents: afe35f0
Author: Yu ISHIKAWA 
Authored: Mon Jun 22 14:35:38 2015 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 22 14:35:38 2015 -0700

--
 R/pkg/.lintr | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b1f3a489/R/pkg/.lintr
--
diff --git a/R/pkg/.lintr b/R/pkg/.lintr
index b10ebd3..038236f 100644
--- a/R/pkg/.lintr
+++ b/R/pkg/.lintr
@@ -1,2 +1,2 @@
-linters: with_defaults(line_length_linter(100), camel_case_linter = NULL)
+linters: with_defaults(line_length_linter(100), camel_case_linter = NULL, 
open_curly_linter(allow_single_line = TRUE), 
closed_curly_linter(allow_single_line = TRUE))
 exclusions: list("inst/profile/general.R" = 1, "inst/profile/shell.R")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-8095][BACKPORT] Resolve dependencies of --packages in local ivy cache

2015-06-22 Thread andrewor14
Repository: spark
Updated Branches:
  refs/heads/branch-1.3 0b8dce0c0 -> 45b4527e3


[SPARK-8095][BACKPORT] Resolve dependencies of --packages in local ivy cache

Backported PR #6788

cc andrewor14

Author: Burak Yavuz 

Closes #6923 from brkyvz/backport-local-ivy and squashes the following commits:

eb17384 [Burak Yavuz] [SPARK-8095][BACKPORT] Resolve dependencies of --packages 
in local ivy cache


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/45b4527e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/45b4527e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/45b4527e

Branch: refs/heads/branch-1.3
Commit: 45b4527e34d67a0143eeaa70660feee04ab8d85a
Parents: 0b8dce0
Author: Burak Yavuz 
Authored: Mon Jun 22 14:45:52 2015 -0700
Committer: Andrew Or 
Committed: Mon Jun 22 14:45:52 2015 -0700

--
 .../org/apache/spark/deploy/SparkSubmit.scala   |  24 ++--
 .../org/apache/spark/deploy/IvyTestUtils.scala  | 127 ---
 .../spark/deploy/SparkSubmitUtilsSuite.scala|  22 ++--
 3 files changed, 138 insertions(+), 35 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/45b4527e/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
--
diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
index 02fb360..831679e 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
@@ -35,7 +35,8 @@ import org.apache.ivy.core.resolve.ResolveOptions
 import org.apache.ivy.core.retrieve.RetrieveOptions
 import org.apache.ivy.core.settings.IvySettings
 import org.apache.ivy.plugins.matcher.GlobPatternMatcher
-import org.apache.ivy.plugins.resolver.{ChainResolver, IBiblioResolver}
+import org.apache.ivy.plugins.repository.file.FileRepository
+import org.apache.ivy.plugins.resolver.{FileSystemResolver, ChainResolver, 
IBiblioResolver}
 
 import org.apache.spark.SPARK_VERSION
 import org.apache.spark.deploy.rest._
@@ -677,8 +678,14 @@ private[spark] object SparkSubmitUtils {
   }
 
   /** Path of the local Maven cache. */
-  private[spark] def m2Path: File = new File(System.getProperty("user.home"),
-".m2" + File.separator + "repository" + File.separator)
+  private[spark] def m2Path: File = {
+if (Utils.isTesting) {
+  // test builds delete the maven cache, and this can cause flakiness
+  new File("dummy", ".m2" + File.separator + "repository")
+} else {
+  new File(System.getProperty("user.home"), ".m2" + File.separator + 
"repository")
+}
+  }
 
   /**
* Extracts maven coordinates from a comma-delimited string
@@ -700,12 +707,13 @@ private[spark] object SparkSubmitUtils {
 localM2.setName("local-m2-cache")
 cr.add(localM2)
 
-val localIvy = new IBiblioResolver
-localIvy.setRoot(new File(ivySettings.getDefaultIvyUserDir,
-  "local" + File.separator).toURI.toString)
+val localIvy = new FileSystemResolver
+val localIvyRoot = new File(ivySettings.getDefaultIvyUserDir, "local")
+localIvy.setLocal(true)
+localIvy.setRepository(new FileRepository(localIvyRoot))
 val ivyPattern = Seq("[organisation]", "[module]", "[revision]", "[type]s",
   "[artifact](-[classifier]).[ext]").mkString(File.separator)
-localIvy.setPattern(ivyPattern)
+localIvy.addIvyPattern(localIvyRoot.getAbsolutePath + File.separator + 
ivyPattern)
 localIvy.setName("local-ivy-cache")
 cr.add(localIvy)
 
@@ -769,7 +777,7 @@ private[spark] object SparkSubmitUtils {
   md.addDependency(dd)
 }
   }
-  
+
   /** Add exclusion rules for dependencies already included in the 
spark-assembly */
   private[spark] def addExclusionRules(
   ivySettings: IvySettings,

http://git-wip-us.apache.org/repos/asf/spark/blob/45b4527e/core/src/test/scala/org/apache/spark/deploy/IvyTestUtils.scala
--
diff --git a/core/src/test/scala/org/apache/spark/deploy/IvyTestUtils.scala 
b/core/src/test/scala/org/apache/spark/deploy/IvyTestUtils.scala
index 7d39984..8cb5228 100644
--- a/core/src/test/scala/org/apache/spark/deploy/IvyTestUtils.scala
+++ b/core/src/test/scala/org/apache/spark/deploy/IvyTestUtils.scala
@@ -24,6 +24,8 @@ import com.google.common.io.{Files, ByteStreams}
 
 import org.apache.commons.io.FileUtils
 
+import org.apache.ivy.core.settings.IvySettings
+
 import org.apache.spark.TestUtils.{createCompiledClass, JavaSourceFromString}
 import org.apache.spark.deploy.SparkSubmitUtils.MavenCoordinate
 
@@ -44,13 +46,30 @@ private[deploy] object IvyTestUtils {
   if (!useIvyLayout) {
 Seq(groupDirs, artifactDirs, artifact.ve

spark git commit: [SPARK-8356] [SQL] Reconcile callUDF and callUdf

2015-06-22 Thread marmbrus
Repository: spark
Updated Branches:
  refs/heads/master b1f3a489e -> 50d3242d6


[SPARK-8356] [SQL] Reconcile callUDF and callUdf

Deprecates ```callUdf``` in favor of ```callUDF```.

Author: BenFradet 

Closes #6902 from BenFradet/SPARK-8356 and squashes the following commits:

ef4e9d8 [BenFradet] deprecated callUDF, use udf instead
9b1de4d [BenFradet] reinstated unit test for the deprecated callUdf
cbd80a5 [BenFradet] deprecated callUdf in favor of callUDF


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/50d3242d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/50d3242d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/50d3242d

Branch: refs/heads/master
Commit: 50d3242d6a5530a51dacab249e3f3d49e2d50635
Parents: b1f3a48
Author: BenFradet 
Authored: Mon Jun 22 15:06:47 2015 -0700
Committer: Michael Armbrust 
Committed: Mon Jun 22 15:06:47 2015 -0700

--
 .../scala/org/apache/spark/sql/functions.scala  | 45 
 .../org/apache/spark/sql/DataFrameSuite.scala   | 11 -
 2 files changed, 55 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/50d3242d/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
index 7e7a099..8cea826 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
@@ -1448,7 +1448,9 @@ object functions {
  *
  * @group udf_funcs
  * @since 1.3.0
+ * @deprecated As of 1.5.0, since it's redundant with udf()
  */
+@deprecated("Use udf", "1.5.0")
 def callUDF(f: Function$x[$fTypes], returnType: DataType${if (args.length 
> 0) ", " + args else ""}): Column = {
   ScalaUdf(f, returnType, Seq($argsInUdf))
 }""")
@@ -1584,7 +1586,9 @@ object functions {
*
* @group udf_funcs
* @since 1.3.0
+   * @deprecated As of 1.5.0, since it's redundant with udf()
*/
+  @deprecated("Use udf", "1.5.0")
   def callUDF(f: Function0[_], returnType: DataType): Column = {
 ScalaUdf(f, returnType, Seq())
   }
@@ -1595,7 +1599,9 @@ object functions {
*
* @group udf_funcs
* @since 1.3.0
+   * @deprecated As of 1.5.0, since it's redundant with udf()
*/
+  @deprecated("Use udf", "1.5.0")
   def callUDF(f: Function1[_, _], returnType: DataType, arg1: Column): Column 
= {
 ScalaUdf(f, returnType, Seq(arg1.expr))
   }
@@ -1606,7 +1612,9 @@ object functions {
*
* @group udf_funcs
* @since 1.3.0
+   * @deprecated As of 1.5.0, since it's redundant with udf()
*/
+  @deprecated("Use udf", "1.5.0")
   def callUDF(f: Function2[_, _, _], returnType: DataType, arg1: Column, arg2: 
Column): Column = {
 ScalaUdf(f, returnType, Seq(arg1.expr, arg2.expr))
   }
@@ -1617,7 +1625,9 @@ object functions {
*
* @group udf_funcs
* @since 1.3.0
+   * @deprecated As of 1.5.0, since it's redundant with udf()
*/
+  @deprecated("Use udf", "1.5.0")
   def callUDF(f: Function3[_, _, _, _], returnType: DataType, arg1: Column, 
arg2: Column, arg3: Column): Column = {
 ScalaUdf(f, returnType, Seq(arg1.expr, arg2.expr, arg3.expr))
   }
@@ -1628,7 +1638,9 @@ object functions {
*
* @group udf_funcs
* @since 1.3.0
+   * @deprecated As of 1.5.0, since it's redundant with udf()
*/
+  @deprecated("Use udf", "1.5.0")
   def callUDF(f: Function4[_, _, _, _, _], returnType: DataType, arg1: Column, 
arg2: Column, arg3: Column, arg4: Column): Column = {
 ScalaUdf(f, returnType, Seq(arg1.expr, arg2.expr, arg3.expr, arg4.expr))
   }
@@ -1639,7 +1651,9 @@ object functions {
*
* @group udf_funcs
* @since 1.3.0
+   * @deprecated As of 1.5.0, since it's redundant with udf()
*/
+  @deprecated("Use udf", "1.5.0")
   def callUDF(f: Function5[_, _, _, _, _, _], returnType: DataType, arg1: 
Column, arg2: Column, arg3: Column, arg4: Column, arg5: Column): Column = {
 ScalaUdf(f, returnType, Seq(arg1.expr, arg2.expr, arg3.expr, arg4.expr, 
arg5.expr))
   }
@@ -1650,7 +1664,9 @@ object functions {
*
* @group udf_funcs
* @since 1.3.0
+   * @deprecated As of 1.5.0, since it's redundant with udf()
*/
+  @deprecated("Use udf", "1.5.0")
   def callUDF(f: Function6[_, _, _, _, _, _, _], returnType: DataType, arg1: 
Column, arg2: Column, arg3: Column, arg4: Column, arg5: Column, arg6: Column): 
Column = {
 ScalaUdf(f, returnType, Seq(arg1.expr, arg2.expr, arg3.expr, arg4.expr, 
arg5.expr, arg6.expr))
   }
@@ -1661,7 +1677,9 @@ object functions {
*
* @group udf_funcs
* @since 1.3.0
+   * @deprecated As of 1.5.0, since it's redundant with udf()
 

spark git commit: [SPARK-8492] [SQL] support binaryType in UnsafeRow

2015-06-22 Thread davies
Repository: spark
Updated Branches:
  refs/heads/master 50d3242d6 -> 96aa01378


[SPARK-8492] [SQL] support binaryType in UnsafeRow

Support BinaryType in UnsafeRow, just like StringType.

Also change the layout of StringType and BinaryType in UnsafeRow, by combining 
offset and size together as Long, which will limit the size of Row to under 2G 
(given that fact that any single buffer can not be bigger than 2G in JVM).

Author: Davies Liu 

Closes #6911 from davies/unsafe_bin and squashes the following commits:

d68706f [Davies Liu] update comment
519f698 [Davies Liu] address comment
98a964b [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
unsafe_bin
180b49d [Davies Liu] fix zero-out
22e4c0a [Davies Liu] zero-out padding bytes
6abfe93 [Davies Liu] fix style
447dea0 [Davies Liu] support binaryType in UnsafeRow


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/96aa0137
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/96aa0137
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/96aa0137

Branch: refs/heads/master
Commit: 96aa01378e3b3dbb4601d31c7312a311cb65b22e
Parents: 50d3242
Author: Davies Liu 
Authored: Mon Jun 22 15:22:17 2015 -0700
Committer: Davies Liu 
Committed: Mon Jun 22 15:22:17 2015 -0700

--
 .../UnsafeFixedWidthAggregationMap.java |  8 ---
 .../sql/catalyst/expressions/UnsafeRow.java | 34 ++-
 .../expressions/UnsafeRowConverter.scala| 60 +++-
 .../expressions/UnsafeRowConverterSuite.scala   | 16 +++---
 4 files changed, 72 insertions(+), 46 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/96aa0137/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeFixedWidthAggregationMap.java
--
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeFixedWidthAggregationMap.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeFixedWidthAggregationMap.java
index f7849eb..83f2a31 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeFixedWidthAggregationMap.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeFixedWidthAggregationMap.java
@@ -17,7 +17,6 @@
 
 package org.apache.spark.sql.catalyst.expressions;
 
-import java.util.Arrays;
 import java.util.Iterator;
 
 import org.apache.spark.sql.catalyst.InternalRow;
@@ -142,14 +141,7 @@ public final class UnsafeFixedWidthAggregationMap {
 final int groupingKeySize = 
groupingKeyToUnsafeRowConverter.getSizeRequirement(groupingKey);
 // Make sure that the buffer is large enough to hold the key. If it's not, 
grow it:
 if (groupingKeySize > groupingKeyConversionScratchSpace.length) {
-  // This new array will be initially zero, so there's no need to zero it 
out here
   groupingKeyConversionScratchSpace = new byte[groupingKeySize];
-} else {
-  // Zero out the buffer that's used to hold the current row. This is 
necessary in order
-  // to ensure that rows hash properly, since garbage data from the 
previous row could
-  // otherwise end up as padding in this row. As a performance 
optimization, we only zero out
-  // the portion of the buffer that we'll actually write to.
-  Arrays.fill(groupingKeyConversionScratchSpace, 0, groupingKeySize, 
(byte) 0);
 }
 final int actualGroupingKeySize = groupingKeyToUnsafeRowConverter.writeRow(
   groupingKey,

http://git-wip-us.apache.org/repos/asf/spark/blob/96aa0137/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
--
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
index ed04d2e..bb2f207 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
@@ -47,7 +47,8 @@ import static org.apache.spark.sql.types.DataTypes.*;
  * In the `values` region, we store one 8-byte word per field. For fields that 
hold fixed-length
  * primitive types, such as long, double, or int, we store the value directly 
in the word. For
  * fields with non-primitive or variable-length values, we store a relative 
offset (w.r.t. the
- * base address of the row) that points to the beginning of the 
variable-length field.
+ * base address of the row) that points to the beginning of the 
variable-length field, and length
+ * (they are combined into a long).
  *
  * Instances of `UnsafeRow` act as pointers to row data store

spark git commit: [HOTFIX] [TESTS] Typo mqqt -> mqtt

2015-06-22 Thread andrewor14
Repository: spark
Updated Branches:
  refs/heads/master 96aa01378 -> 1dfb0f7b2


[HOTFIX] [TESTS] Typo mqqt -> mqtt

This was introduced in #6866.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1dfb0f7b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1dfb0f7b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1dfb0f7b

Branch: refs/heads/master
Commit: 1dfb0f7b2aed5ee6d07543fdeac8ff7c777b63b9
Parents: 96aa013
Author: Andrew Or 
Authored: Mon Jun 22 16:16:26 2015 -0700
Committer: Andrew Or 
Committed: Mon Jun 22 16:16:26 2015 -0700

--
 dev/run-tests.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1dfb0f7b/dev/run-tests.py
--
diff --git a/dev/run-tests.py b/dev/run-tests.py
index 2cccfed..de1b453 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -179,14 +179,14 @@ streaming_twitter = Module(
 )
 
 
-streaming_mqqt = Module(
-name="streaming-mqqt",
+streaming_mqtt = Module(
+name="streaming-mqtt",
 dependencies=[streaming],
 source_file_regexes=[
-"external/mqqt",
+"external/mqtt",
 ],
 sbt_test_goals=[
-"streaming-mqqt/test",
+"streaming-mqtt/test",
 ]
 )
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-7153] [SQL] support all integral type ordinal in GetArrayItem

2015-06-22 Thread marmbrus
Repository: spark
Updated Branches:
  refs/heads/master 1dfb0f7b2 -> 860a49ef2


[SPARK-7153] [SQL] support all integral type ordinal in GetArrayItem

first convert `ordinal` to `Number`, then convert to int type.

Author: Wenchen Fan 

Closes #5706 from cloud-fan/7153 and squashes the following commits:

915db79 [Wenchen Fan] fix 7153


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/860a49ef
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/860a49ef
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/860a49ef

Branch: refs/heads/master
Commit: 860a49ef20cea5711a7f54de0053ea33647e56a7
Parents: 1dfb0f7
Author: Wenchen Fan 
Authored: Mon Jun 22 17:37:35 2015 -0700
Committer: Michael Armbrust 
Committed: Mon Jun 22 17:37:35 2015 -0700

--
 .../sql/catalyst/expressions/ExtractValue.scala |  2 +-
 .../expressions/complexTypeCreator.scala| 80 +++
 .../sql/catalyst/expressions/complexTypes.scala | 81 
 .../catalyst/expressions/ComplexTypeSuite.scala | 20 +
 4 files changed, 101 insertions(+), 82 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/860a49ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExtractValue.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExtractValue.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExtractValue.scala
index 013027b..4d6c1c2 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExtractValue.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExtractValue.scala
@@ -186,7 +186,7 @@ case class GetArrayItem(child: Expression, ordinal: 
Expression)
 // TODO: consider using Array[_] for ArrayType child to avoid
 // boxing of primitives
 val baseValue = value.asInstanceOf[Seq[_]]
-val index = ordinal.asInstanceOf[Int]
+val index = ordinal.asInstanceOf[Number].intValue()
 if (index >= baseValue.size || index < 0) {
   null
 } else {

http://git-wip-us.apache.org/repos/asf/spark/blob/860a49ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
new file mode 100644
index 000..e0bf07e
--- /dev/null
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.types._
+
+
+/**
+ * Returns an Array containing the evaluation of all children expressions.
+ */
+case class CreateArray(children: Seq[Expression]) extends Expression {
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  lazy val childTypes = children.map(_.dataType).distinct
+
+  override lazy val resolved =
+childrenResolved && childTypes.size <= 1
+
+  override def dataType: DataType = {
+assert(resolved, s"Invalid dataType of mixed ArrayType 
${childTypes.mkString(",")}")
+ArrayType(
+  childTypes.headOption.getOrElse(NullType),
+  containsNull = children.exists(_.nullable))
+  }
+
+  override def nullable: Boolean = false
+
+  override def eval(input: InternalRow): Any = {
+children.map(_.eval(input))
+  }
+
+  override def toString: String = s"Array(${children.mkString(",")})"
+}
+
+/**
+ * Returns a Row containing the evaluation of all children expressions.
+ * TODO: [[CreateStruct]] does not support codegen.
+ */
+case class CreateStruct(children: Seq[Expression]) extends Expression {
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override lazy val resolved: Boolean

spark git commit: [SPARK-8307] [SQL] improve timestamp from parquet

2015-06-22 Thread davies
Repository: spark
Updated Branches:
  refs/heads/master 860a49ef2 -> 6b7f2ceaf


[SPARK-8307] [SQL] improve timestamp from parquet

This PR change to convert julian day to unix timestamp directly (without 
Calendar and Timestamp).

cc adrian-wang rxin

Author: Davies Liu 

Closes #6759 from davies/improve_ts and squashes the following commits:

849e301 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
improve_ts
b0e4cad [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
improve_ts
8e2d56f [Davies Liu] address comments
634b9f5 [Davies Liu] fix mima
4891efb [Davies Liu] address comment
bfc437c [Davies Liu] fix build
ae5979c [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
improve_ts
602b969 [Davies Liu] remove jodd
2f2e48c [Davies Liu] fix test
8ace611 [Davies Liu] fix mima
212143b [Davies Liu] fix mina
c834108 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
improve_ts
a3171b8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
improve_ts
5233974 [Davies Liu] fix scala style
361fd62 [Davies Liu] address comments
ea196d4 [Davies Liu] improve timestamp from parquet


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6b7f2cea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6b7f2cea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6b7f2cea

Branch: refs/heads/master
Commit: 6b7f2ceafdcbb014791909747c2210b527305df9
Parents: 860a49e
Author: Davies Liu 
Authored: Mon Jun 22 18:03:59 2015 -0700
Committer: Davies Liu 
Committed: Mon Jun 22 18:03:59 2015 -0700

--
 pom.xml |   1 -
 project/MimaExcludes.scala  |  12 +-
 .../sql/catalyst/CatalystTypeConverters.scala   |  14 +-
 .../spark/sql/catalyst/expressions/Cast.scala   |  16 +-
 .../sql/catalyst/expressions/literals.scala |   6 +-
 .../spark/sql/catalyst/util/DateTimeUtils.scala | 151 +++
 .../spark/sql/catalyst/util/DateUtils.scala | 120 ---
 .../sql/catalyst/expressions/CastSuite.scala|  11 +-
 .../catalyst/expressions/PredicateSuite.scala   |   6 +-
 .../expressions/UnsafeRowConverterSuite.scala   |  10 +-
 .../sql/catalyst/util/DateTimeUtilsSuite.scala  |  51 +++
 .../sql/catalyst/util/DateUtilsSuite.scala  |  39 -
 sql/core/pom.xml|   5 -
 .../apache/spark/sql/execution/pythonUdfs.scala |  10 +-
 .../org/apache/spark/sql/jdbc/JDBCRDD.scala |  12 +-
 .../apache/spark/sql/json/JacksonParser.scala   |   6 +-
 .../org/apache/spark/sql/json/JsonRDD.scala |   8 +-
 .../spark/sql/parquet/ParquetConverter.scala|  86 ++-
 .../spark/sql/parquet/ParquetTableSupport.scala |  19 ++-
 .../spark/sql/parquet/timestamp/NanoTime.scala  |  69 -
 .../org/apache/spark/sql/json/JsonSuite.scala   |  20 +--
 .../spark/sql/parquet/ParquetIOSuite.scala  |   4 +-
 .../spark/sql/sources/TableScanSuite.scala  |  11 +-
 .../apache/spark/sql/hive/HiveInspectors.scala  |  20 +--
 .../org/apache/spark/sql/hive/TableReader.scala |   8 +-
 .../spark/sql/hive/hiveWriterContainers.scala   |   4 +-
 26 files changed, 321 insertions(+), 398 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6b7f2cea/pom.xml
--
diff --git a/pom.xml b/pom.xml
index 6d4f717..80cacb5 100644
--- a/pom.xml
+++ b/pom.xml
@@ -156,7 +156,6 @@
 2.10
 ${scala.version}
 org.scala-lang
-3.6.3
 1.9.13
 2.4.4
 1.1.1.7

http://git-wip-us.apache.org/repos/asf/spark/blob/6b7f2cea/project/MimaExcludes.scala
--
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index 015d029..7a748fb 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -54,7 +54,17 @@ object MimaExcludes {
 ProblemFilters.exclude[MissingMethodProblem](
   
"org.apache.spark.streaming.kafka.KafkaTestUtils.waitUntilLeaderOffset"),
 // SQL execution is considered private.
-excludePackage("org.apache.spark.sql.execution")
+excludePackage("org.apache.spark.sql.execution"),
+// NanoTime and CatalystTimestampConverter is only used inside 
catalyst,
+// not needed anymore
+ProblemFilters.exclude[MissingClassProblem](
+  "org.apache.spark.sql.parquet.timestamp.NanoTime"),
+  ProblemFilters.exclude[MissingClassProblem](
+  "org.apache.spark.sql.parquet.timestamp.NanoTime$"),
+ProblemFilters.exclude[MissingClassProblem](
+  "org.apache.spark.sql.parquet.CatalystTimestampConverter"),
+ProblemFilters.exclude[MissingC

spark git commit: [SPARK-7859] [SQL] Collect_set() behavior differences which fails the unit test under jdk8

2015-06-22 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/master 6b7f2ceaf -> 13321e655


[SPARK-7859] [SQL] Collect_set() behavior differences which fails the unit test 
under jdk8

To reproduce that:
```
JAVA_HOME=/home/hcheng/Java/jdk1.8.0_45 | build/sbt -Phadoop-2.3 -Phive  
'test-only 
org.apache.spark.sql.hive.execution.HiveWindowFunctionQueryWithoutCodeGenSuite'
```

A simple workaround to fix that is update the original query, for getting the 
output size instead of the exact elements of the array (output by collect_set())

Author: Cheng Hao 

Closes #6402 from chenghao-intel/windowing and squashes the following commits:

99312ad [Cheng Hao] add order by for the select clause
edf8ce3 [Cheng Hao] update the code as suggested
7062da7 [Cheng Hao] fix the collect_set() behaviour differences under different 
versions of JDK


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/13321e65
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/13321e65
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/13321e65

Branch: refs/heads/master
Commit: 13321e65559f6354ec1287a690580fd6f498ef89
Parents: 6b7f2ce
Author: Cheng Hao 
Authored: Mon Jun 22 20:04:49 2015 -0700
Committer: Yin Huai 
Committed: Mon Jun 22 20:04:49 2015 -0700

--
 .../HiveWindowFunctionQuerySuite.scala  |  8 ++
 ... testSTATs-0-6dfcd7925fb267699c4bf82737d4609 | 97 
 ...testSTATs-0-da0e0cca69e42118a96b8609b8fa5838 | 26 --
 3 files changed, 105 insertions(+), 26 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/13321e65/sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala
--
diff --git 
a/sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala
 
b/sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala
index 934452f..31a49a3 100644
--- 
a/sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala
+++ 
b/sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala
@@ -526,8 +526,14 @@ abstract class HiveWindowFunctionQueryBaseSuite extends 
HiveComparisonTest with
   | rows between 2 preceding and 2 following);
 """.stripMargin, reset = false)
 
+  // collect_set() output array in an arbitrary order, hence causes different 
result
+  // when running this test suite under Java 7 and 8.
+  // We change the original sql query a little bit for making the test suite 
passed
+  // under different JDK
   createQueryTest("windowing.q -- 20. testSTATs",
 """
+  |select p_mfgr,p_name, p_size, sdev, sdev_pop, uniq_data, var, cor, 
covarp
+  |from (
   |select  p_mfgr,p_name, p_size,
   |stddev(p_retailprice) over w1 as sdev,
   |stddev_pop(p_retailprice) over w1 as sdev_pop,
@@ -538,6 +544,8 @@ abstract class HiveWindowFunctionQueryBaseSuite extends 
HiveComparisonTest with
   |from part
   |window w1 as (distribute by p_mfgr sort by p_mfgr, p_name
   | rows between 2 preceding and 2 following)
+  |) t lateral view explode(uniq_size) d as uniq_data
+  |order by p_mfgr,p_name, p_size, sdev, sdev_pop, uniq_data, var, cor, 
covarp
 """.stripMargin, reset = false)
 
   createQueryTest("windowing.q -- 21. testDISTs",

http://git-wip-us.apache.org/repos/asf/spark/blob/13321e65/sql/hive/src/test/resources/golden/windowing.q
 -- 20. testSTATs-0-6dfcd7925fb267699c4bf82737d4609
--
diff --git a/sql/hive/src/test/resources/golden/windowing.q -- 20. 
testSTATs-0-6dfcd7925fb267699c4bf82737d4609 
b/sql/hive/src/test/resources/golden/windowing.q -- 20. 
testSTATs-0-6dfcd7925fb267699c4bf82737d4609
new file mode 100644
index 000..7e5fcee
--- /dev/null
+++ b/sql/hive/src/test/resources/golden/windowing.q -- 20. 
testSTATs-0-6dfcd7925fb267699c4bf82737d4609 
@@ -0,0 +1,97 @@
+Manufacturer#1 almond antique burnished rose metallic  2   
258.10677784349235  258.10677784349235  2   66619.10876874991   
0.811328754177887   2801.70745
+Manufacturer#1 almond antique burnished rose metallic  2   
258.10677784349235  258.10677784349235  6   66619.10876874991   
0.811328754177887   2801.70745
+Manufacturer#1 almond antique burnished rose metallic  2   
258.10677784349235  258.10677784349235  34  66619.10876874991   
0.811328754177887   2801.70745
+Manufacturer#1 almond antique burnished rose metallic  2   
273.70217881648074  273.70217881648074  2   74912.882688

spark git commit: [SPARK-7859] [SQL] Collect_set() behavior differences which fails the unit test under jdk8

2015-06-22 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/branch-1.4 994abbaeb -> d73900a90


[SPARK-7859] [SQL] Collect_set() behavior differences which fails the unit test 
under jdk8

To reproduce that:
```
JAVA_HOME=/home/hcheng/Java/jdk1.8.0_45 | build/sbt -Phadoop-2.3 -Phive  
'test-only 
org.apache.spark.sql.hive.execution.HiveWindowFunctionQueryWithoutCodeGenSuite'
```

A simple workaround to fix that is update the original query, for getting the 
output size instead of the exact elements of the array (output by collect_set())

Author: Cheng Hao 

Closes #6402 from chenghao-intel/windowing and squashes the following commits:

99312ad [Cheng Hao] add order by for the select clause
edf8ce3 [Cheng Hao] update the code as suggested
7062da7 [Cheng Hao] fix the collect_set() behaviour differences under different 
versions of JDK

(cherry picked from commit 13321e65559f6354ec1287a690580fd6f498ef89)
Signed-off-by: Yin Huai 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d73900a9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d73900a9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d73900a9

Branch: refs/heads/branch-1.4
Commit: d73900a9034b22191e1440b18ee63b1835f09582
Parents: 994abba
Author: Cheng Hao 
Authored: Mon Jun 22 20:04:49 2015 -0700
Committer: Yin Huai 
Committed: Mon Jun 22 20:05:00 2015 -0700

--
 .../HiveWindowFunctionQuerySuite.scala  |  8 ++
 ... testSTATs-0-6dfcd7925fb267699c4bf82737d4609 | 97 
 ...testSTATs-0-da0e0cca69e42118a96b8609b8fa5838 | 26 --
 3 files changed, 105 insertions(+), 26 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d73900a9/sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala
--
diff --git 
a/sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala
 
b/sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala
index 934452f..31a49a3 100644
--- 
a/sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala
+++ 
b/sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionQuerySuite.scala
@@ -526,8 +526,14 @@ abstract class HiveWindowFunctionQueryBaseSuite extends 
HiveComparisonTest with
   | rows between 2 preceding and 2 following);
 """.stripMargin, reset = false)
 
+  // collect_set() output array in an arbitrary order, hence causes different 
result
+  // when running this test suite under Java 7 and 8.
+  // We change the original sql query a little bit for making the test suite 
passed
+  // under different JDK
   createQueryTest("windowing.q -- 20. testSTATs",
 """
+  |select p_mfgr,p_name, p_size, sdev, sdev_pop, uniq_data, var, cor, 
covarp
+  |from (
   |select  p_mfgr,p_name, p_size,
   |stddev(p_retailprice) over w1 as sdev,
   |stddev_pop(p_retailprice) over w1 as sdev_pop,
@@ -538,6 +544,8 @@ abstract class HiveWindowFunctionQueryBaseSuite extends 
HiveComparisonTest with
   |from part
   |window w1 as (distribute by p_mfgr sort by p_mfgr, p_name
   | rows between 2 preceding and 2 following)
+  |) t lateral view explode(uniq_size) d as uniq_data
+  |order by p_mfgr,p_name, p_size, sdev, sdev_pop, uniq_data, var, cor, 
covarp
 """.stripMargin, reset = false)
 
   createQueryTest("windowing.q -- 21. testDISTs",

http://git-wip-us.apache.org/repos/asf/spark/blob/d73900a9/sql/hive/src/test/resources/golden/windowing.q
 -- 20. testSTATs-0-6dfcd7925fb267699c4bf82737d4609
--
diff --git a/sql/hive/src/test/resources/golden/windowing.q -- 20. 
testSTATs-0-6dfcd7925fb267699c4bf82737d4609 
b/sql/hive/src/test/resources/golden/windowing.q -- 20. 
testSTATs-0-6dfcd7925fb267699c4bf82737d4609
new file mode 100644
index 000..7e5fcee
--- /dev/null
+++ b/sql/hive/src/test/resources/golden/windowing.q -- 20. 
testSTATs-0-6dfcd7925fb267699c4bf82737d4609 
@@ -0,0 +1,97 @@
+Manufacturer#1 almond antique burnished rose metallic  2   
258.10677784349235  258.10677784349235  2   66619.10876874991   
0.811328754177887   2801.70745
+Manufacturer#1 almond antique burnished rose metallic  2   
258.10677784349235  258.10677784349235  6   66619.10876874991   
0.811328754177887   2801.70745
+Manufacturer#1 almond antique burnished rose metallic  2   
258.10677784349235  258.10677784349235  34  66619.10876874991   
0.811328754177887   2801.70745
+Manufacturer#1 almond antique

spark git commit: MAINTENANCE: Automated closing of pull requests.

2015-06-22 Thread pwendell
Repository: spark
Updated Branches:
  refs/heads/master 13321e655 -> c4d234396


MAINTENANCE: Automated closing of pull requests.

This commit exists to close the following pull requests on Github:

Closes #2849 (close requested by 'srowen')
Closes #2786 (close requested by 'andrewor14')
Closes #4678 (close requested by 'JoshRosen')
Closes #5457 (close requested by 'andrewor14')
Closes #3346 (close requested by 'andrewor14')
Closes #6518 (close requested by 'andrewor14')
Closes #5403 (close requested by 'pwendell')
Closes #2110 (close requested by 'srowen')


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c4d23439
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c4d23439
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c4d23439

Branch: refs/heads/master
Commit: c4d2343966cbae40a8271a2e6cad66227d2f8249
Parents: 13321e6
Author: Patrick Wendell 
Authored: Mon Jun 22 20:25:32 2015 -0700
Committer: Patrick Wendell 
Committed: Mon Jun 22 20:25:32 2015 -0700

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-8548] [SPARKR] Remove the trailing whitespaces from the SparkR files

2015-06-22 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/branch-1.4 d73900a90 -> 250179485


[SPARK-8548] [SPARKR] Remove the trailing whitespaces from the SparkR files

[[SPARK-8548] Remove the trailing whitespaces from the SparkR files - ASF 
JIRA](https://issues.apache.org/jira/browse/SPARK-8548)

- This is the result of `lint-r`
https://gist.github.com/yu-iskw/0019b37a2c1167f33986

Author: Yu ISHIKAWA 

Closes #6945 from yu-iskw/SPARK-8548 and squashes the following commits:

0bd567a [Yu ISHIKAWA] [SPARK-8548][SparkR] Remove the trailing whitespaces from 
the SparkR files

(cherry picked from commit 44fa7df64daa55bd6eb1f2c219a9701b34e1c2a3)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/25017948
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/25017948
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/25017948

Branch: refs/heads/branch-1.4
Commit: 250179485b59f3015fd2f44934b6cb1d3669de80
Parents: d73900a
Author: Yu ISHIKAWA 
Authored: Mon Jun 22 20:55:38 2015 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 22 20:55:55 2015 -0700

--
 R/pkg/R/DataFrame.R | 96 ++--
 R/pkg/R/RDD.R   | 48 +++---
 R/pkg/R/SQLContext.R| 14 ++--
 R/pkg/R/broadcast.R |  6 +-
 R/pkg/R/deserialize.R   |  2 +-
 R/pkg/R/generics.R  | 15 ++---
 R/pkg/R/group.R |  1 -
 R/pkg/R/jobj.R  |  2 +-
 R/pkg/R/pairRDD.R   |  4 +-
 R/pkg/R/schema.R|  2 +-
 R/pkg/R/serialize.R |  2 +-
 R/pkg/R/sparkR.R|  6 +-
 R/pkg/R/utils.R | 48 +++---
 R/pkg/R/zzz.R   |  1 -
 R/pkg/inst/tests/test_binaryFile.R  |  7 +-
 R/pkg/inst/tests/test_binary_function.R | 28 
 R/pkg/inst/tests/test_rdd.R | 12 ++--
 R/pkg/inst/tests/test_shuffle.R | 28 
 R/pkg/inst/tests/test_sparkSQL.R| 28 
 R/pkg/inst/tests/test_take.R|  1 -
 R/pkg/inst/tests/test_textFile.R|  7 +-
 R/pkg/inst/tests/test_utils.R   | 12 ++--
 22 files changed, 182 insertions(+), 188 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/25017948/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 0af5cb8..6feabf4 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -38,7 +38,7 @@ setClass("DataFrame",
 setMethod("initialize", "DataFrame", function(.Object, sdf, isCached) {
   .Object@env <- new.env()
   .Object@env$isCached <- isCached
-  
+
   .Object@sdf <- sdf
   .Object
 })
@@ -55,11 +55,11 @@ dataFrame <- function(sdf, isCached = FALSE) {
  DataFrame Methods 
##
 
 #' Print Schema of a DataFrame
-#' 
+#'
 #' Prints out the schema in tree format
-#' 
+#'
 #' @param x A SparkSQL DataFrame
-#' 
+#'
 #' @rdname printSchema
 #' @export
 #' @examples
@@ -78,11 +78,11 @@ setMethod("printSchema",
   })
 
 #' Get schema object
-#' 
+#'
 #' Returns the schema of this DataFrame as a structType object.
-#' 
+#'
 #' @param x A SparkSQL DataFrame
-#' 
+#'
 #' @rdname schema
 #' @export
 #' @examples
@@ -100,9 +100,9 @@ setMethod("schema",
   })
 
 #' Explain
-#' 
+#'
 #' Print the logical and physical Catalyst plans to the console for debugging.
-#' 
+#'
 #' @param x A SparkSQL DataFrame
 #' @param extended Logical. If extended is False, explain() only prints the 
physical plan.
 #' @rdname explain
@@ -200,11 +200,11 @@ setMethod("show", "DataFrame",
   })
 
 #' DataTypes
-#' 
+#'
 #' Return all column names and their data types as a list
-#' 
+#'
 #' @param x A SparkSQL DataFrame
-#' 
+#'
 #' @rdname dtypes
 #' @export
 #' @examples
@@ -224,11 +224,11 @@ setMethod("dtypes",
   })
 
 #' Column names
-#' 
+#'
 #' Return all column names as a list
-#' 
+#'
 #' @param x A SparkSQL DataFrame
-#' 
+#'
 #' @rdname columns
 #' @export
 #' @examples
@@ -256,12 +256,12 @@ setMethod("names",
   })
 
 #' Register Temporary Table
-#' 
+#'
 #' Registers a DataFrame as a Temporary Table in the SQLContext
-#' 
+#'
 #' @param x A SparkSQL DataFrame
 #' @param tableName A character vector containing the name of the table
-#' 
+#'
 #' @rdname registerTempTable
 #' @export
 #' @examples
@@ -306,11 +306,11 @@ setMethod("insertInto",
   })
 
 #' Cache
-#' 
+#'
 #' Persist with the default storage level (MEMORY_ONLY).
-#' 
+#'
 #' @param x A SparkSQL DataFrame
-#' 
+#'
 #' @rdname cache-methods
 #' @export
 #' 

spark git commit: [SPARK-8548] [SPARKR] Remove the trailing whitespaces from the SparkR files

2015-06-22 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/master c4d234396 -> 44fa7df64


[SPARK-8548] [SPARKR] Remove the trailing whitespaces from the SparkR files

[[SPARK-8548] Remove the trailing whitespaces from the SparkR files - ASF 
JIRA](https://issues.apache.org/jira/browse/SPARK-8548)

- This is the result of `lint-r`
https://gist.github.com/yu-iskw/0019b37a2c1167f33986

Author: Yu ISHIKAWA 

Closes #6945 from yu-iskw/SPARK-8548 and squashes the following commits:

0bd567a [Yu ISHIKAWA] [SPARK-8548][SparkR] Remove the trailing whitespaces from 
the SparkR files


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/44fa7df6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/44fa7df6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/44fa7df6

Branch: refs/heads/master
Commit: 44fa7df64daa55bd6eb1f2c219a9701b34e1c2a3
Parents: c4d2343
Author: Yu ISHIKAWA 
Authored: Mon Jun 22 20:55:38 2015 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jun 22 20:55:38 2015 -0700

--
 R/pkg/R/DataFrame.R | 96 ++--
 R/pkg/R/RDD.R   | 48 +++---
 R/pkg/R/SQLContext.R| 14 ++--
 R/pkg/R/broadcast.R |  6 +-
 R/pkg/R/deserialize.R   |  2 +-
 R/pkg/R/generics.R  | 15 ++---
 R/pkg/R/group.R |  1 -
 R/pkg/R/jobj.R  |  2 +-
 R/pkg/R/pairRDD.R   |  4 +-
 R/pkg/R/schema.R|  2 +-
 R/pkg/R/serialize.R |  2 +-
 R/pkg/R/sparkR.R|  6 +-
 R/pkg/R/utils.R | 48 +++---
 R/pkg/R/zzz.R   |  1 -
 R/pkg/inst/tests/test_binaryFile.R  |  7 +-
 R/pkg/inst/tests/test_binary_function.R | 28 
 R/pkg/inst/tests/test_rdd.R | 12 ++--
 R/pkg/inst/tests/test_shuffle.R | 28 
 R/pkg/inst/tests/test_sparkSQL.R| 28 
 R/pkg/inst/tests/test_take.R|  1 -
 R/pkg/inst/tests/test_textFile.R|  7 +-
 R/pkg/inst/tests/test_utils.R   | 12 ++--
 22 files changed, 182 insertions(+), 188 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/44fa7df6/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 0af5cb8..6feabf4 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -38,7 +38,7 @@ setClass("DataFrame",
 setMethod("initialize", "DataFrame", function(.Object, sdf, isCached) {
   .Object@env <- new.env()
   .Object@env$isCached <- isCached
-  
+
   .Object@sdf <- sdf
   .Object
 })
@@ -55,11 +55,11 @@ dataFrame <- function(sdf, isCached = FALSE) {
  DataFrame Methods 
##
 
 #' Print Schema of a DataFrame
-#' 
+#'
 #' Prints out the schema in tree format
-#' 
+#'
 #' @param x A SparkSQL DataFrame
-#' 
+#'
 #' @rdname printSchema
 #' @export
 #' @examples
@@ -78,11 +78,11 @@ setMethod("printSchema",
   })
 
 #' Get schema object
-#' 
+#'
 #' Returns the schema of this DataFrame as a structType object.
-#' 
+#'
 #' @param x A SparkSQL DataFrame
-#' 
+#'
 #' @rdname schema
 #' @export
 #' @examples
@@ -100,9 +100,9 @@ setMethod("schema",
   })
 
 #' Explain
-#' 
+#'
 #' Print the logical and physical Catalyst plans to the console for debugging.
-#' 
+#'
 #' @param x A SparkSQL DataFrame
 #' @param extended Logical. If extended is False, explain() only prints the 
physical plan.
 #' @rdname explain
@@ -200,11 +200,11 @@ setMethod("show", "DataFrame",
   })
 
 #' DataTypes
-#' 
+#'
 #' Return all column names and their data types as a list
-#' 
+#'
 #' @param x A SparkSQL DataFrame
-#' 
+#'
 #' @rdname dtypes
 #' @export
 #' @examples
@@ -224,11 +224,11 @@ setMethod("dtypes",
   })
 
 #' Column names
-#' 
+#'
 #' Return all column names as a list
-#' 
+#'
 #' @param x A SparkSQL DataFrame
-#' 
+#'
 #' @rdname columns
 #' @export
 #' @examples
@@ -256,12 +256,12 @@ setMethod("names",
   })
 
 #' Register Temporary Table
-#' 
+#'
 #' Registers a DataFrame as a Temporary Table in the SQLContext
-#' 
+#'
 #' @param x A SparkSQL DataFrame
 #' @param tableName A character vector containing the name of the table
-#' 
+#'
 #' @rdname registerTempTable
 #' @export
 #' @examples
@@ -306,11 +306,11 @@ setMethod("insertInto",
   })
 
 #' Cache
-#' 
+#'
 #' Persist with the default storage level (MEMORY_ONLY).
-#' 
+#'
 #' @param x A SparkSQL DataFrame
-#' 
+#'
 #' @rdname cache-methods
 #' @export
 #' @examples
@@ -400,7 +400,7 @@ setMethod("repartition",
   signature(x = "DataFrame", numPartitions = "numeri

spark git commit: [BUILD] Preparing Spark release 1.4.1

2015-06-22 Thread pwendell
Repository: spark
Updated Branches:
  refs/heads/branch-1.4 250179485 -> 48d683014


[BUILD] Preparing Spark release 1.4.1


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/48d68301
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/48d68301
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/48d68301

Branch: refs/heads/branch-1.4
Commit: 48d683014458d48453b59911b472224879746533
Parents: 2501794
Author: Patrick Wendell 
Authored: Mon Jun 22 22:18:52 2015 -0700
Committer: Patrick Wendell 
Committed: Mon Jun 22 22:18:52 2015 -0700

--
 CHANGES.txt | 603 +++
 R/pkg/DESCRIPTION   |   2 +-
 assembly/pom.xml|   2 +-
 bagel/pom.xml   |   2 +-
 core/pom.xml|   2 +-
 .../main/scala/org/apache/spark/package.scala   |   2 +-
 dev/create-release/generate-changelist.py   |   4 +-
 docs/_config.yml|   4 +-
 ec2/spark_ec2.py|   4 +-
 examples/pom.xml|   2 +-
 external/flume-sink/pom.xml |   2 +-
 external/flume/pom.xml  |   2 +-
 external/kafka-assembly/pom.xml |   2 +-
 external/kafka/pom.xml  |   2 +-
 external/mqtt/pom.xml   |   2 +-
 external/twitter/pom.xml|   2 +-
 external/zeromq/pom.xml |   2 +-
 extras/java8-tests/pom.xml  |   2 +-
 extras/kinesis-asl/pom.xml  |   2 +-
 extras/spark-ganglia-lgpl/pom.xml   |   2 +-
 graphx/pom.xml  |   2 +-
 launcher/pom.xml|   2 +-
 mllib/pom.xml   |   2 +-
 network/common/pom.xml  |   2 +-
 network/shuffle/pom.xml |   2 +-
 network/yarn/pom.xml|   2 +-
 pom.xml |   2 +-
 repl/pom.xml|   2 +-
 sql/catalyst/pom.xml|   2 +-
 sql/core/pom.xml|   2 +-
 sql/hive-thriftserver/pom.xml   |   2 +-
 sql/hive/pom.xml|   2 +-
 streaming/pom.xml   |   2 +-
 tools/pom.xml   |   2 +-
 unsafe/pom.xml  |   2 +-
 yarn/pom.xml|   2 +-
 36 files changed, 642 insertions(+), 37 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/48d68301/CHANGES.txt
--
diff --git a/CHANGES.txt b/CHANGES.txt
index 8c99404..d6fe305 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,6 +1,609 @@
 Spark Change Log
 
 
+Release 1.4.1
+
+  [SPARK-8548] [SPARKR] Remove the trailing whitespaces from the SparkR files
+  Yu ISHIKAWA 
+  2015-06-22 20:55:38 -0700
+  Commit: 2501794, github.com/apache/spark/pull/6945
+
+  [SPARK-7859] [SQL] Collect_set() behavior differences which fails the unit 
test under jdk8
+  Cheng Hao 
+  2015-06-22 20:04:49 -0700
+  Commit: d73900a, github.com/apache/spark/pull/6402
+
+  [SPARK-8532] [SQL] In Python's DataFrameWriter, 
save/saveAsTable/json/parquet/jdbc always override mode
+  Yin Huai 
+  2015-06-22 13:51:23 -0700
+  Commit: 994abba, github.com/apache/spark/pull/6937
+
+  [SPARK-8511] [PYSPARK] Modify a test to remove a saved model in 
`regression.py`
+  Yu ISHIKAWA 
+  2015-06-22 11:53:11 -0700
+  Commit: 507381d, github.com/apache/spark/pull/6926
+
+  [SPARK-8420] [SQL] Fix comparision of timestamps/dates with strings 
(branch-1.4)
+  Michael Armbrust , Michael Armbrust 

+  2015-06-22 10:45:33 -0700
+  Commit: 6598161, github.com/apache/spark/pull/6888
+
+  [SPARK-8406] [SQL] Backports SPARK-8406 and PR #6864 to branch-1.4
+  Cheng Lian 
+  2015-06-22 10:04:29 -0700
+  Commit: 451c872, github.com/apache/spark/pull/6932
+
+  [HOTFIX] Hotfix branch-1.4 building by removing avgMetrics in 
CrossValidatorSuite
+  Liang-Chi Hsieh 
+  2015-06-21 22:25:08 -0700
+  Commit: b836bac, github.com/apache/spark/pull/6929
+
+  [SPARK-7715] [MLLIB] [ML] [DOC] Updated MLlib programming guide for release 
1.4
+  Joseph K. Bradley 
+  2015-06-21 16:25:25 -0700
+  Commit: 2a7ea31, github.com/apache/spark/pull/6897
+
+  [SPARK-8379] [SQL] avoid speculative tasks write to the same file
+  jeanlyn 
+  2015-06-21 00:13:40 -0700
+  Commit: f0e4040, github.com/apache/spark/pull/6833
+
+  [SPARK-8468] [ML] Take the negative of some metrics in Regress

Git Push Summary

2015-06-22 Thread pwendell
Repository: spark
Updated Tags:  refs/tags/v1.4.1-rc1 [created] d0a5560ce

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[2/2] spark git commit: Preparing development version 1.4.2-SNAPSHOT

2015-06-22 Thread pwendell
Preparing development version 1.4.2-SNAPSHOT


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1cfa7302
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1cfa7302
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1cfa7302

Branch: refs/heads/branch-1.4
Commit: 1cfa7302ee9ac4ec31728f74c63714d53adf27b7
Parents: d0a5560
Author: Patrick Wendell 
Authored: Mon Jun 22 22:21:31 2015 -0700
Committer: Patrick Wendell 
Committed: Mon Jun 22 22:21:31 2015 -0700

--
 assembly/pom.xml  | 2 +-
 bagel/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 examples/pom.xml  | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-assembly/pom.xml   | 2 +-
 external/kafka/pom.xml| 2 +-
 external/mqtt/pom.xml | 2 +-
 external/twitter/pom.xml  | 2 +-
 external/zeromq/pom.xml   | 2 +-
 extras/java8-tests/pom.xml| 2 +-
 extras/kinesis-asl/pom.xml| 2 +-
 extras/spark-ganglia-lgpl/pom.xml | 2 +-
 graphx/pom.xml| 2 +-
 launcher/pom.xml  | 2 +-
 mllib/pom.xml | 2 +-
 network/common/pom.xml| 2 +-
 network/shuffle/pom.xml   | 2 +-
 network/yarn/pom.xml  | 2 +-
 pom.xml   | 2 +-
 repl/pom.xml  | 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 unsafe/pom.xml| 2 +-
 yarn/pom.xml  | 2 +-
 30 files changed, 30 insertions(+), 30 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1cfa7302/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index ba233e7..228db59 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1
+1.4.2-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1cfa7302/bagel/pom.xml
--
diff --git a/bagel/pom.xml b/bagel/pom.xml
index c5e9183..ce791a6 100644
--- a/bagel/pom.xml
+++ b/bagel/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1
+1.4.2-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1cfa7302/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index f0d236d..176ea9b 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1
+1.4.2-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1cfa7302/examples/pom.xml
--
diff --git a/examples/pom.xml b/examples/pom.xml
index e9a9cc2..877c2fb 100644
--- a/examples/pom.xml
+++ b/examples/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1
+1.4.2-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1cfa7302/external/flume-sink/pom.xml
--
diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml
index 9669507..37f2b1b 100644
--- a/external/flume-sink/pom.xml
+++ b/external/flume-sink/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1
+1.4.2-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1cfa7302/external/flume/pom.xml
--
diff --git a/external/flume/pom.xml b/external/flume/pom.xml
index b3ad09a..9789435 100644
--- a/external/flume/pom.xml
+++ b/external/flume/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1
+1.4.2-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1cfa7302/external/kafka-assembly/pom.xml
--
diff --git a/external/kafka-assembly/pom.xml b/external/kafka-assembly/pom.xml
index c05bd1b..18b1d86 100644
--- a/external/kafka-assembly/pom.xml
+++ b/external/kafka-assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1
+1.4.2-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/1cfa7302/external/kafka/pom.xml

[1/2] spark git commit: Preparing Spark release v1.4.1-rc1

2015-06-22 Thread pwendell
Repository: spark
Updated Branches:
  refs/heads/branch-1.4 48d683014 -> 1cfa7302e


Preparing Spark release v1.4.1-rc1


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d0a5560c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d0a5560c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d0a5560c

Branch: refs/heads/branch-1.4
Commit: d0a5560ce4b898ef931c37a315ad53e1d5fbe6c6
Parents: 48d6830
Author: Patrick Wendell 
Authored: Mon Jun 22 22:21:26 2015 -0700
Committer: Patrick Wendell 
Committed: Mon Jun 22 22:21:26 2015 -0700

--
 assembly/pom.xml  | 2 +-
 bagel/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 examples/pom.xml  | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-assembly/pom.xml   | 2 +-
 external/kafka/pom.xml| 2 +-
 external/mqtt/pom.xml | 2 +-
 external/twitter/pom.xml  | 2 +-
 external/zeromq/pom.xml   | 2 +-
 extras/java8-tests/pom.xml| 2 +-
 extras/kinesis-asl/pom.xml| 2 +-
 extras/spark-ganglia-lgpl/pom.xml | 2 +-
 graphx/pom.xml| 2 +-
 launcher/pom.xml  | 2 +-
 mllib/pom.xml | 2 +-
 network/common/pom.xml| 2 +-
 network/shuffle/pom.xml   | 2 +-
 network/yarn/pom.xml  | 2 +-
 pom.xml   | 2 +-
 repl/pom.xml  | 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 unsafe/pom.xml| 2 +-
 yarn/pom.xml  | 2 +-
 30 files changed, 30 insertions(+), 30 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d0a5560c/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index b53d7c3..ba233e7 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1-SNAPSHOT
+1.4.1
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d0a5560c/bagel/pom.xml
--
diff --git a/bagel/pom.xml b/bagel/pom.xml
index dce13c4..c5e9183 100644
--- a/bagel/pom.xml
+++ b/bagel/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1-SNAPSHOT
+1.4.1
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d0a5560c/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index 4f3375d..f0d236d 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1-SNAPSHOT
+1.4.1
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d0a5560c/examples/pom.xml
--
diff --git a/examples/pom.xml b/examples/pom.xml
index eee557a..e9a9cc2 100644
--- a/examples/pom.xml
+++ b/examples/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1-SNAPSHOT
+1.4.1
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d0a5560c/external/flume-sink/pom.xml
--
diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml
index fc0ef59..9669507 100644
--- a/external/flume-sink/pom.xml
+++ b/external/flume-sink/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1-SNAPSHOT
+1.4.1
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d0a5560c/external/flume/pom.xml
--
diff --git a/external/flume/pom.xml b/external/flume/pom.xml
index ee70d8e..b3ad09a 100644
--- a/external/flume/pom.xml
+++ b/external/flume/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1-SNAPSHOT
+1.4.1
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d0a5560c/external/kafka-assembly/pom.xml
--
diff --git a/external/kafka-assembly/pom.xml b/external/kafka-assembly/pom.xml
index 72d3547..c05bd1b 100644
--- a/external/kafka-assembly/pom.xml
+++ b/external/kafka-assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.4.1-SNAPSHOT
+1.4.1
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spar

spark git commit: [SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max bins

2015-06-22 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/master 44fa7df64 -> 164fe2aa4


[SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max bins

Author: Holden Karau 

Closes #6331 from 
holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and 
squashes the following commits:

2894695 [Holden Karau] remove extra blank line
2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the 
test a bit nicer too
3a09170 [Holden Karau] add maxBins to to the train method as well
af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and 
correctly mention the default of 32 in other places where it mentioned 100


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/164fe2aa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/164fe2aa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/164fe2aa

Branch: refs/heads/master
Commit: 164fe2aa44993da6c77af6de5efdae47a8b3958c
Parents: 44fa7df
Author: Holden Karau 
Authored: Mon Jun 22 22:40:19 2015 -0700
Committer: Joseph K. Bradley 
Committed: Mon Jun 22 22:40:19 2015 -0700

--
 .../spark/mllib/api/python/PythonMLLibAPI.scala |  4 +++-
 python/pyspark/mllib/tests.py   |  7 +++
 python/pyspark/mllib/tree.py| 22 +---
 3 files changed, 24 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/164fe2aa/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 634d56d..f9a271f 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -696,12 +696,14 @@ private[python] class PythonMLLibAPI extends Serializable 
{
   lossStr: String,
   numIterations: Int,
   learningRate: Double,
-  maxDepth: Int): GradientBoostedTreesModel = {
+  maxDepth: Int,
+  maxBins: Int): GradientBoostedTreesModel = {
 val boostingStrategy = BoostingStrategy.defaultParams(algoStr)
 boostingStrategy.setLoss(Losses.fromString(lossStr))
 boostingStrategy.setNumIterations(numIterations)
 boostingStrategy.setLearningRate(learningRate)
 boostingStrategy.treeStrategy.setMaxDepth(maxDepth)
+boostingStrategy.treeStrategy.setMaxBins(maxBins)
 boostingStrategy.treeStrategy.categoricalFeaturesInfo = 
categoricalFeaturesInfo.asScala.toMap
 
 val cached = data.rdd.persist(StorageLevel.MEMORY_AND_DISK)

http://git-wip-us.apache.org/repos/asf/spark/blob/164fe2aa/python/pyspark/mllib/tests.py
--
diff --git a/python/pyspark/mllib/tests.py b/python/pyspark/mllib/tests.py
index b13159e..c8d61b9 100644
--- a/python/pyspark/mllib/tests.py
+++ b/python/pyspark/mllib/tests.py
@@ -463,6 +463,13 @@ class ListTests(MLlibTestCase):
 except ValueError:
 self.fail()
 
+# Verify that maxBins is being passed through
+GradientBoostedTrees.trainRegressor(
+rdd, categoricalFeaturesInfo=categoricalFeaturesInfo, 
numIterations=4, maxBins=32)
+with self.assertRaises(Exception) as cm:
+GradientBoostedTrees.trainRegressor(
+rdd, categoricalFeaturesInfo=categoricalFeaturesInfo, 
numIterations=4, maxBins=1)
+
 
 class StatTests(MLlibTestCase):
 # SPARK-4023

http://git-wip-us.apache.org/repos/asf/spark/blob/164fe2aa/python/pyspark/mllib/tree.py
--
diff --git a/python/pyspark/mllib/tree.py b/python/pyspark/mllib/tree.py
index cfcbea5..372b86a 100644
--- a/python/pyspark/mllib/tree.py
+++ b/python/pyspark/mllib/tree.py
@@ -299,7 +299,7 @@ class RandomForest(object):
  1 internal node + 2 leaf nodes. (default: 4)
 :param maxBins: maximum number of bins used for splitting
  features
- (default: 100)
+ (default: 32)
 :param seed: Random seed for bootstrapping and choosing feature
  subsets.
 :return: RandomForestModel that can be used for prediction
@@ -377,7 +377,7 @@ class RandomForest(object):
  1 leaf node; depth 1 means 1 internal node + 2 leaf
  nodes. (default: 4)
 :param maxBins: maximum number of bins used for splitting
- features (default: 100)
+ features (default: 32)
 :param seed: Random seed for bootstrapping and choosing feature
  

spark git commit: [SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max bins

2015-06-22 Thread jkbradley
Repository: spark
Updated Branches:
  refs/heads/branch-1.4 1cfa7302e -> 22cc1ab66


[SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max bins

Author: Holden Karau 

Closes #6331 from 
holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and 
squashes the following commits:

2894695 [Holden Karau] remove extra blank line
2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the 
test a bit nicer too
3a09170 [Holden Karau] add maxBins to to the train method as well
af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and 
correctly mention the default of 32 in other places where it mentioned 100

(cherry picked from commit 164fe2aa44993da6c77af6de5efdae47a8b3958c)
Signed-off-by: Joseph K. Bradley 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/22cc1ab6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/22cc1ab6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/22cc1ab6

Branch: refs/heads/branch-1.4
Commit: 22cc1ab66ec26aeac9dacaeb176f94d3bdaccbf4
Parents: 1cfa730
Author: Holden Karau 
Authored: Mon Jun 22 22:40:19 2015 -0700
Committer: Joseph K. Bradley 
Committed: Mon Jun 22 22:40:31 2015 -0700

--
 .../spark/mllib/api/python/PythonMLLibAPI.scala |  4 +++-
 python/pyspark/mllib/tests.py   |  7 +++
 python/pyspark/mllib/tree.py| 22 +---
 3 files changed, 24 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/22cc1ab6/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 16f3131..d1b2c98 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -685,12 +685,14 @@ private[python] class PythonMLLibAPI extends Serializable 
{
   lossStr: String,
   numIterations: Int,
   learningRate: Double,
-  maxDepth: Int): GradientBoostedTreesModel = {
+  maxDepth: Int,
+  maxBins: Int): GradientBoostedTreesModel = {
 val boostingStrategy = BoostingStrategy.defaultParams(algoStr)
 boostingStrategy.setLoss(Losses.fromString(lossStr))
 boostingStrategy.setNumIterations(numIterations)
 boostingStrategy.setLearningRate(learningRate)
 boostingStrategy.treeStrategy.setMaxDepth(maxDepth)
+boostingStrategy.treeStrategy.setMaxBins(maxBins)
 boostingStrategy.treeStrategy.categoricalFeaturesInfo = 
categoricalFeaturesInfo.asScala.toMap
 
 val cached = data.rdd.persist(StorageLevel.MEMORY_AND_DISK)

http://git-wip-us.apache.org/repos/asf/spark/blob/22cc1ab6/python/pyspark/mllib/tests.py
--
diff --git a/python/pyspark/mllib/tests.py b/python/pyspark/mllib/tests.py
index 7a113f8..4335143 100644
--- a/python/pyspark/mllib/tests.py
+++ b/python/pyspark/mllib/tests.py
@@ -444,6 +444,13 @@ class ListTests(MLlibTestCase):
 except ValueError:
 self.fail()
 
+# Verify that maxBins is being passed through
+GradientBoostedTrees.trainRegressor(
+rdd, categoricalFeaturesInfo=categoricalFeaturesInfo, 
numIterations=4, maxBins=32)
+with self.assertRaises(Exception) as cm:
+GradientBoostedTrees.trainRegressor(
+rdd, categoricalFeaturesInfo=categoricalFeaturesInfo, 
numIterations=4, maxBins=1)
+
 
 class StatTests(MLlibTestCase):
 # SPARK-4023

http://git-wip-us.apache.org/repos/asf/spark/blob/22cc1ab6/python/pyspark/mllib/tree.py
--
diff --git a/python/pyspark/mllib/tree.py b/python/pyspark/mllib/tree.py
index cfcbea5..372b86a 100644
--- a/python/pyspark/mllib/tree.py
+++ b/python/pyspark/mllib/tree.py
@@ -299,7 +299,7 @@ class RandomForest(object):
  1 internal node + 2 leaf nodes. (default: 4)
 :param maxBins: maximum number of bins used for splitting
  features
- (default: 100)
+ (default: 32)
 :param seed: Random seed for bootstrapping and choosing feature
  subsets.
 :return: RandomForestModel that can be used for prediction
@@ -377,7 +377,7 @@ class RandomForest(object):
  1 leaf node; depth 1 means 1 internal node + 2 leaf
  nodes. (default: 4)
 :param maxBins: maximum number of bins used for splitting
- features (default: 100)
+ fe

spark git commit: [SPARK-8431] [SPARKR] Add in operator to DataFrame Column in SparkR

2015-06-22 Thread davies
Repository: spark
Updated Branches:
  refs/heads/master 164fe2aa4 -> d4f633514


[SPARK-8431] [SPARKR] Add in operator to DataFrame Column in SparkR

[[SPARK-8431] Add in operator to DataFrame Column in SparkR - ASF 
JIRA](https://issues.apache.org/jira/browse/SPARK-8431)

Author: Yu ISHIKAWA 

Closes #6941 from yu-iskw/SPARK-8431 and squashes the following commits:

1f64423 [Yu ISHIKAWA] Modify the comment
f4309a7 [Yu ISHIKAWA] Make a `setMethod` for `%in%` be independent
6e37936 [Yu ISHIKAWA] Modify a variable name
c196173 [Yu ISHIKAWA] [SPARK-8431][SparkR] Add in operator to DataFrame Column 
in SparkR


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d4f63351
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d4f63351
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d4f63351

Branch: refs/heads/master
Commit: d4f633514a393320c9ae64c00a75f702e6f58c67
Parents: 164fe2a
Author: Yu ISHIKAWA 
Authored: Mon Jun 22 23:04:36 2015 -0700
Committer: Davies Liu 
Committed: Mon Jun 22 23:04:36 2015 -0700

--
 R/pkg/R/column.R | 16 
 R/pkg/inst/tests/test_sparkSQL.R | 10 ++
 2 files changed, 26 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d4f63351/R/pkg/R/column.R
--
diff --git a/R/pkg/R/column.R b/R/pkg/R/column.R
index 80e92d3..8e4b0f5 100644
--- a/R/pkg/R/column.R
+++ b/R/pkg/R/column.R
@@ -210,6 +210,22 @@ setMethod("cast",
 }
   })
 
+#' Match a column with given values.
+#'
+#' @rdname column
+#' @return a matched values as a result of comparing with given values.
+#' \dontrun{
+#'   filter(df, "age in (10, 30)")
+#'   where(df, df$age %in% c(10, 30))
+#' }
+setMethod("%in%",
+  signature(x = "Column"),
+  function(x, table) {
+table <- listToSeq(as.list(table))
+jc <- callJMethod(x@jc, "in", table)
+return(column(jc))
+  })
+
 #' Approx Count Distinct
 #'
 #' @rdname column

http://git-wip-us.apache.org/repos/asf/spark/blob/d4f63351/R/pkg/inst/tests/test_sparkSQL.R
--
diff --git a/R/pkg/inst/tests/test_sparkSQL.R b/R/pkg/inst/tests/test_sparkSQL.R
index fc7f3f0..417153d 100644
--- a/R/pkg/inst/tests/test_sparkSQL.R
+++ b/R/pkg/inst/tests/test_sparkSQL.R
@@ -693,6 +693,16 @@ test_that("filter() on a DataFrame", {
   filtered2 <- where(df, df$name != "Michael")
   expect_true(count(filtered2) == 2)
   expect_true(collect(filtered2)$age[2] == 19)
+
+  # test suites for %in%
+  filtered3 <- filter(df, "age in (19)")
+  expect_equal(count(filtered3), 1)
+  filtered4 <- filter(df, "age in (19, 30)")
+  expect_equal(count(filtered4), 2)
+  filtered5 <- where(df, df$age %in% c(19))
+  expect_equal(count(filtered5), 1)
+  filtered6 <- where(df, df$age %in% c(19, 30))
+  expect_equal(count(filtered6), 2)
 })
 
 test_that("join() on a DataFrame", {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-8359] [SQL] Fix incorrect decimal precision after multiplication

2015-06-22 Thread davies
Repository: spark
Updated Branches:
  refs/heads/master d4f633514 -> 31bd30687


[SPARK-8359] [SQL] Fix incorrect decimal precision after multiplication

JIRA: https://issues.apache.org/jira/browse/SPARK-8359

Author: Liang-Chi Hsieh 

Closes #6814 from viirya/fix_decimal2 and squashes the following commits:

071a757 [Liang-Chi Hsieh] Remove maximum precision and use 
MathContext.UNLIMITED.
df217d4 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into 
fix_decimal2
a43bfc3 [Liang-Chi Hsieh] Add MathContext with maximum supported precision.
72eeb3f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into 
fix_decimal2
44c9348 [Liang-Chi Hsieh] Fix incorrect decimal precision after multiplication.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/31bd3068
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/31bd3068
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/31bd3068

Branch: refs/heads/master
Commit: 31bd30687bc29c0e457c37308d489ae2b6e5b72a
Parents: d4f6335
Author: Liang-Chi Hsieh 
Authored: Mon Jun 22 23:11:56 2015 -0700
Committer: Davies Liu 
Committed: Mon Jun 22 23:11:56 2015 -0700

--
 .../src/main/scala/org/apache/spark/sql/types/Decimal.scala| 6 --
 .../org/apache/spark/sql/types/decimal/DecimalSuite.scala  | 5 +
 2 files changed, 9 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/31bd3068/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
index a85af9e..bd9823b 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
@@ -17,6 +17,8 @@
 
 package org.apache.spark.sql.types
 
+import java.math.{MathContext, RoundingMode}
+
 import org.apache.spark.annotation.DeveloperApi
 
 /**
@@ -137,9 +139,9 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
 
   def toBigDecimal: BigDecimal = {
 if (decimalVal.ne(null)) {
-  decimalVal
+  decimalVal(MathContext.UNLIMITED)
 } else {
-  BigDecimal(longVal, _scale)
+  BigDecimal(longVal, _scale)(MathContext.UNLIMITED)
 }
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/31bd3068/sql/catalyst/src/test/scala/org/apache/spark/sql/types/decimal/DecimalSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/decimal/DecimalSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/decimal/DecimalSuite.scala
index 4c0365c..ccc29c0 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/types/decimal/DecimalSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/types/decimal/DecimalSuite.scala
@@ -162,4 +162,9 @@ class DecimalSuite extends SparkFunSuite with 
PrivateMethodTester {
 assert(new Decimal().set(100L, 10, 0).toUnscaledLong === 100L)
 assert(Decimal(Long.MaxValue, 100, 0).toUnscaledLong === Long.MaxValue)
   }
+
+  test("accurate precision after multiplication") {
+val decimal = (Decimal(Long.MaxValue, 38, 0) * Decimal(Long.MaxValue, 38, 
0)).toJavaBigDecimal
+assert(decimal.unscaledValue.toString === 
"85070591730234615847396907784232501249")
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-8483] [STREAMING] Remove commons-lang3 dependency from Flume Si…

2015-06-22 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/master 31bd30687 -> 9b618fb0d


[SPARK-8483] [STREAMING] Remove commons-lang3 dependency from Flume Si…

…nk. Also bump Flume version to 1.6.0

Author: Hari Shreedharan 

Closes #6910 from harishreedharan/remove-commons-lang3 and squashes the 
following commits:

9875f7d [Hari Shreedharan] Revert back to Flume 1.4.0
ca35eb0 [Hari Shreedharan] [SPARK-8483][Streaming] Remove commons-lang3 
dependency from Flume Sink. Also bump Flume version to 1.6.0


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b618fb0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b618fb0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b618fb0

Branch: refs/heads/master
Commit: 9b618fb0d2536121d2784ff5341d74723e810fc5
Parents: 31bd306
Author: Hari Shreedharan 
Authored: Mon Jun 22 23:34:17 2015 -0700
Committer: Tathagata Das 
Committed: Mon Jun 22 23:34:17 2015 -0700

--
 external/flume-sink/pom.xml  | 4 
 .../spark/streaming/flume/sink/SparkAvroCallbackHandler.scala| 4 ++--
 2 files changed, 2 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9b618fb0/external/flume-sink/pom.xml
--
diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml
index 7a7dccc..0664cfb 100644
--- a/external/flume-sink/pom.xml
+++ b/external/flume-sink/pom.xml
@@ -36,10 +36,6 @@
 
   
 
-  org.apache.commons
-  commons-lang3
-
-
   org.apache.flume
   flume-ng-sdk
   

http://git-wip-us.apache.org/repos/asf/spark/blob/9b618fb0/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
--
diff --git 
a/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
 
b/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
index dc2a4ab..719fca0 100644
--- 
a/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
+++ 
b/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
@@ -16,13 +16,13 @@
  */
 package org.apache.spark.streaming.flume.sink
 
+import java.util.UUID
 import java.util.concurrent.{CountDownLatch, Executors}
 import java.util.concurrent.atomic.AtomicLong
 
 import scala.collection.mutable
 
 import org.apache.flume.Channel
-import org.apache.commons.lang3.RandomStringUtils
 
 /**
  * Class that implements the SparkFlumeProtocol, that is used by the Avro 
Netty Server to process
@@ -53,7 +53,7 @@ private[flume] class SparkAvroCallbackHandler(val threads: 
Int, val channel: Cha
   // Since the new txn may not have the same sequence number we must guard 
against accidentally
   // committing a new transaction. To reduce the probability of that happening 
a random string is
   // prepended to the sequence number. Does not change for life of sink
-  private val seqBase = RandomStringUtils.randomAlphanumeric(8)
+  private val seqBase = UUID.randomUUID().toString.substring(0, 8)
   private val seqCounter = new AtomicLong(0)
 
   // Protected by `sequenceNumberToProcessor`


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-8541] [PYSPARK] test the absolute error in approx doctests

2015-06-22 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/master 9b618fb0d -> f0dcbe8a7


[SPARK-8541] [PYSPARK] test the absolute error in approx doctests

A minor change but one which is (presumably) visible on the public api docs 
webpage.

Author: Scott Taylor 

Closes #6942 from megatron-me-uk/patch-3 and squashes the following commits:

fbed000 [Scott Taylor] test the absolute error in approx doctests


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f0dcbe8a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f0dcbe8a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f0dcbe8a

Branch: refs/heads/master
Commit: f0dcbe8a7c2de510b47a21eb45cde34777638758
Parents: 9b618fb
Author: Scott Taylor 
Authored: Mon Jun 22 23:37:56 2015 -0700
Committer: Josh Rosen 
Committed: Mon Jun 22 23:37:56 2015 -0700

--
 python/pyspark/rdd.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f0dcbe8a/python/pyspark/rdd.py
--
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index 20c0bc9..1b64be2 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -2198,7 +2198,7 @@ class RDD(object):
 
 >>> rdd = sc.parallelize(range(1000), 10)
 >>> r = sum(range(1000))
->>> (rdd.sumApprox(1000) - r) / r < 0.05
+>>> abs(rdd.sumApprox(1000) - r) / r < 0.05
 True
 """
 jrdd = self.mapPartitions(lambda it: 
[float(sum(it))])._to_java_object_rdd()
@@ -2215,7 +2215,7 @@ class RDD(object):
 
 >>> rdd = sc.parallelize(range(1000), 10)
 >>> r = sum(range(1000)) / 1000.0
->>> (rdd.meanApprox(1000) - r) / r < 0.05
+>>> abs(rdd.meanApprox(1000) - r) / r < 0.05
 True
 """
 jrdd = self.map(float)._to_java_object_rdd()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-8541] [PYSPARK] test the absolute error in approx doctests

2015-06-22 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/branch-1.4 22cc1ab66 -> d0943afbc


[SPARK-8541] [PYSPARK] test the absolute error in approx doctests

A minor change but one which is (presumably) visible on the public api docs 
webpage.

Author: Scott Taylor 

Closes #6942 from megatron-me-uk/patch-3 and squashes the following commits:

fbed000 [Scott Taylor] test the absolute error in approx doctests

(cherry picked from commit f0dcbe8a7c2de510b47a21eb45cde34777638758)
Signed-off-by: Josh Rosen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d0943afb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d0943afb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d0943afb

Branch: refs/heads/branch-1.4
Commit: d0943afbcffec5d8b668794dedc8d85fb10b0596
Parents: 22cc1ab
Author: Scott Taylor 
Authored: Mon Jun 22 23:37:56 2015 -0700
Committer: Josh Rosen 
Committed: Mon Jun 22 23:38:21 2015 -0700

--
 python/pyspark/rdd.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d0943afb/python/pyspark/rdd.py
--
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index 20c0bc9..1b64be2 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -2198,7 +2198,7 @@ class RDD(object):
 
 >>> rdd = sc.parallelize(range(1000), 10)
 >>> r = sum(range(1000))
->>> (rdd.sumApprox(1000) - r) / r < 0.05
+>>> abs(rdd.sumApprox(1000) - r) / r < 0.05
 True
 """
 jrdd = self.mapPartitions(lambda it: 
[float(sum(it))])._to_java_object_rdd()
@@ -2215,7 +2215,7 @@ class RDD(object):
 
 >>> rdd = sc.parallelize(range(1000), 10)
 >>> r = sum(range(1000)) / 1000.0
->>> (rdd.meanApprox(1000) - r) / r < 0.05
+>>> abs(rdd.meanApprox(1000) - r) / r < 0.05
 True
 """
 jrdd = self.map(float)._to_java_object_rdd()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-8541] [PYSPARK] test the absolute error in approx doctests

2015-06-22 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/branch-1.3 45b4527e3 -> 716dcf631


[SPARK-8541] [PYSPARK] test the absolute error in approx doctests

A minor change but one which is (presumably) visible on the public api docs 
webpage.

Author: Scott Taylor 

Closes #6942 from megatron-me-uk/patch-3 and squashes the following commits:

fbed000 [Scott Taylor] test the absolute error in approx doctests

(cherry picked from commit f0dcbe8a7c2de510b47a21eb45cde34777638758)
Signed-off-by: Josh Rosen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/716dcf63
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/716dcf63
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/716dcf63

Branch: refs/heads/branch-1.3
Commit: 716dcf631558920c080cb824dcd617789b9f96d5
Parents: 45b4527
Author: Scott Taylor 
Authored: Mon Jun 22 23:37:56 2015 -0700
Committer: Josh Rosen 
Committed: Mon Jun 22 23:39:39 2015 -0700

--
 python/pyspark/rdd.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/716dcf63/python/pyspark/rdd.py
--
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index d80366a..bd18cb3 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -2130,7 +2130,7 @@ class RDD(object):
 
 >>> rdd = sc.parallelize(range(1000), 10)
 >>> r = sum(xrange(1000))
->>> (rdd.sumApprox(1000) - r) / r < 0.05
+>>> abs(rdd.sumApprox(1000) - r) / r < 0.05
 True
 """
 jrdd = self.mapPartitions(lambda it: 
[float(sum(it))])._to_java_object_rdd()
@@ -2147,7 +2147,7 @@ class RDD(object):
 
 >>> rdd = sc.parallelize(range(1000), 10)
 >>> r = sum(xrange(1000)) / 1000.0
->>> (rdd.meanApprox(1000) - r) / r < 0.05
+>>> abs(rdd.meanApprox(1000) - r) / r < 0.05
 True
 """
 jrdd = self.map(float)._to_java_object_rdd()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-8483] [STREAMING] Remove commons-lang3 dependency from Flume Sink

2015-06-22 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/branch-1.4 d0943afbc -> 929479675


[SPARK-8483] [STREAMING] Remove commons-lang3 dependency from Flume Sink

Author: Hari Shreedharan 

Closes #6910 from harishreedharan/remove-commons-lang3 and squashes the 
following commits:

9875f7d [Hari Shreedharan] Revert back to Flume 1.4.0
ca35eb0 [Hari Shreedharan] [SPARK-8483][Streaming] Remove commons-lang3 
dependency from Flume Sink. Also bump Flume version to 1.6.0


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/92947967
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/92947967
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/92947967

Branch: refs/heads/branch-1.4
Commit: 9294796750f9c9330ab113f025763e68b624abc9
Parents: d0943af
Author: Hari Shreedharan 
Authored: Mon Jun 22 23:34:17 2015 -0700
Committer: Tathagata Das 
Committed: Mon Jun 22 23:41:35 2015 -0700

--
 external/flume-sink/pom.xml  | 4 
 .../spark/streaming/flume/sink/SparkAvroCallbackHandler.scala| 4 ++--
 2 files changed, 2 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/92947967/external/flume-sink/pom.xml
--
diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml
index 37f2b1b..ad431fa 100644
--- a/external/flume-sink/pom.xml
+++ b/external/flume-sink/pom.xml
@@ -36,10 +36,6 @@
 
   
 
-  org.apache.commons
-  commons-lang3
-
-
   org.apache.flume
   flume-ng-sdk
   

http://git-wip-us.apache.org/repos/asf/spark/blob/92947967/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
--
diff --git 
a/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
 
b/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
index dc2a4ab..719fca0 100644
--- 
a/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
+++ 
b/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
@@ -16,13 +16,13 @@
  */
 package org.apache.spark.streaming.flume.sink
 
+import java.util.UUID
 import java.util.concurrent.{CountDownLatch, Executors}
 import java.util.concurrent.atomic.AtomicLong
 
 import scala.collection.mutable
 
 import org.apache.flume.Channel
-import org.apache.commons.lang3.RandomStringUtils
 
 /**
  * Class that implements the SparkFlumeProtocol, that is used by the Avro 
Netty Server to process
@@ -53,7 +53,7 @@ private[flume] class SparkAvroCallbackHandler(val threads: 
Int, val channel: Cha
   // Since the new txn may not have the same sequence number we must guard 
against accidentally
   // committing a new transaction. To reduce the probability of that happening 
a random string is
   // prepended to the sequence number. Does not change for life of sink
-  private val seqBase = RandomStringUtils.randomAlphanumeric(8)
+  private val seqBase = UUID.randomUUID().toString.substring(0, 8)
   private val seqCounter = new AtomicLong(0)
 
   // Protected by `sequenceNumberToProcessor`


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org